ArticlePDF Available

Investigating the psychometric properties of the Speech User Interface Service Quality questionnaire

Authors:
  • MeasuringU

Abstract

The Speech User Interface Service Quality (SUISQ) questionnaire is a standardized instrument for the assessment of the usability of interactive voice response (IVR) applications, developed by Polkosky (Toward a social-cognitive psychology of speech technology: affective responses to speech-based e-service, 2005; Mediated interpersonal communication, 2008). During its development, participants rated the quality of recorded interactions rather than interactions in which they participated, leaving open the question of the extent to which the findings would generalize to personal as opposed to observed interactions. The results of a large-scale unmoderated usability study of a natural-language speech recognition IVR demonstrated the utility of the SUISQ for the purpose of assessing personal experiences with service-providing speech user interfaces. The psychometric properties of construct validity and reliability were very similar to those reported by Polkosky. Additional item analyses led to the definition of two subsets of the full set of 25 SUISQ items—a reduced version (SUISQ-R, 14 items) and a maximally-reduced version (SUISQ-MR, 9 items). The SUISQ-R had similar psychometric properties to the full SUISQ, but analysis the SUISQ-MR revealed some weaknesses in its reliability and construct validity. This replication of the original SUISQ findings in a markedly different context of measurement and the availability of a shorter, psychometrically qualified, version of the questionnaire (SUISQ-R) should enhance its utility for usability practitioners who work on the development and assessment of speech-recognition IVRs.
1 23
International Journal of Speech
Technology
ISSN 1381-2416
Volume 18
Number 3
Int J Speech Technol (2015) 18:479-487
DOI 10.1007/s10772-015-9289-1
Investigating the psychometric properties of
the Speech User Interface Service Quality
questionnaire
James R.Lewis & Mary L.Hardzinski
1 23
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media New York. This e-offprint is
for personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
Investigating the psychometric properties of the Speech User
Interface Service Quality questionnaire
James R. Lewis
1
Mary L. Hardzinski
2
Received: 30 October 2014 / Accepted: 25 June 2015 / Published online: 5 July 2015
Springer Science+Business Media New York 2015
Abstract The Speech User Interface Service Quality
(SUISQ) questionnaire is a standardized instrument for the
assessment of the usability of interactive voice response
(IVR) applications, developed by Polkosky (Toward a
social-cognitive psychology of speech technology: affective
responses to speech-based e-service, 2005; Mediated
interpersonal communication, 2008). During its develop-
ment, participants rated the quality of recorded interactions
rather than interactions in which they participated, leaving
open the question of the extent to which the findings would
generalize to personal as opposed to observed interactions.
The results of a large-scale unmoderated usability study of a
natural-language speech recognition IVR demonstrated the
utility of the SUISQ for the purpose of assessing personal
experiences with service-providing speech user interfaces.
The psychometric properties of construct validity and reli-
ability were very similar to those reported by Polkosky.
Additional item analyses led to the definition of two subsets
of the full set of 25 SUISQ items—a reduced version
(SUISQ-R, 14 items) and a maximally-reduced version
(SUISQ-MR, 9 items). The SUISQ-R had similar psycho-
metric properties to the full SUISQ, but analysis the
SUISQ-MR revealed some weaknesses in its reliability and
construct validity. This replication of the original SUISQ
findings in a markedly different context of measurement
and the availability of a shorter, psychometrically qualified,
version of the questionnaire (SUISQ-R) should enhance its
utility for usability practitioners who work on the devel-
opment and assessment of speech-recognition IVRs.
Keywords IVR Interactive voice response Subjective
assessment of IVR quality Psychometric evaluation SUI
service quality questionnaire Usability questionnaire
1 Introduction
Designers strive to produce usable designs. This is, however,
not easy to do—especially in the complex design space of
using speech technologies to provide automated service over
a telephone (Lewis 2011). One critical aspect of the devel-
opment of usable speech-enabled interactive voice response
(IVR) applications is the measurement of its usability.
The direct measurement of usability is not possible
because it is not a property of a person or thing (Lewis 2012;
Sauro and Lewis 2012). Usability is an emergent property that
depends on the interactions among users, products, tasks and
environments, as recognized in the international standard ISO
9241 (International Standards Organization 1998). The ISO
standard defines three major components of usability mea-
surement: effectiveness, efficiency, and satisfaction. The first
two are performance metrics, typically collected as successful
task completions for effectiveness and task completion times
for efficiency. Satisfaction, in contrast, is a subjective mea-
surement related to perceived usability, typically collected
using a standardized questionnaire.
1.1 Standardization of measurement
A standardized measurement is one for which there is an
established procedure for collecting and presenting the
&James R. Lewis
jimlewis@us.ibm.com
Mary L. Hardzinski
mhardzinski@yahoo.com
1
IBM Corporation, Boca Raton, FL, USA
2
State Farm Mutual Automobile Insurance Company,
Bloomington, IL, USA
123
Int J Speech Technol (2015) 18:479–487
DOI 10.1007/s10772-015-9289-1
Author's personal copy
measurement, such as the measurement of time in seconds
or temperature in degrees Celsius. Standardized measures
have a number of advantages in the practice of science and
engineering. Standardized measurements support objec-
tivity of studies and make studies easier to replicate. A
number of usability researchers have demonstrated that
standardized usability questionnaires are more reliable
(more consistently produce the same measurement under
the same circumstances) than homegrown or ad hoc
questionnaires (Hornbæk 2006; Hornbæk and Law 2007;
Sauro and Lewis 2009).
The development of standardized measures requires a
substantial amount of work. Once developed, however, they
are extremely economical. Standardization also makes it
easier for practitioners to communicate their results in a
way that other practitioners will understand. Standardiza-
tion also aids the assessment of the generalization of results.
As part of the development of standardized question-
naires, it is the typical practice for the developer to report
measurements of its reliability and validity. These are the
fundamental elements of psychometric qualification
(Nunnally 1978).
1.2 Brief review of psychometric practice
1.2.1 Reliability
Reliability is an assessment of the consistency of a mea-
surement. The most common measurement of a scale’s
reliability is coefficient alpha (Nunnally 1978), a measure
of internal consistency. Coefficient alpha can range from 0
(completely unreliable) to 1 (perfectly reliable). For pur-
poses of research or evaluation in which the final score will
be the average of ratings from more than one questionnaire,
the typical minimally acceptable reliability is .70 (Lan-
dauer 1988; Nunnally 1978).
1.2.2 Validity
Validity refers to the extent to which a measurement
actually measures what it claims to measure. There are a
number of different approaches to the assessment of
validity, including content validity, criterion-related
validity, and construct validity. A questionnaire has valid
content when the initial pool of items comes from sources
that have a rational relationship to the measurement of
interest. There are no metrics for content validity; rather,
content validity is an outcome of an appropriate process for
the creation of candidate items.
Researchers commonly use the correlation coefficient to
assess criterion-related validity (the relationship between
the measure of interest and a different concurrent or pre-
dictive measure). The magnitude of the correlation does
not need to be large to provide evidence of validity, but the
correlation should be statistically significant. A common
minimum criterion for the magnitude of correlations that
support the validity hypothesis is .30.
The most common method for assessing construct
validity is factor analysis. Factor analysis is a statistical
procedure that examines the correlations among variables
to discover groups of related variables (Nunnally 1978).
Because summated (Likert) scales are more reliable than
single-item scores and it is easier to interpret and present a
smaller number of scores, it is common to conduct a factor
analysis to determine if there is a statistical basis for the
formation of measurement scales based on factors. Gen-
erally, a factor analysis requires a minimum of five par-
ticipants per item to ensure stable factor estimates
(Nunnally 1978). There are a number of methods for esti-
mating the number of factors in a set of scores when
conducting exploratory analyses, including discontinuity
and parallel analysis (Cliff 1987; Coovert and McNelis
1988). When previous research has established an expected
number of factors, there is a shift of focus from exploratory
to confirmatory analysis.
1.2.3 Number of scale steps
Despite the difficulties that individual respondents some-
times have in matching their subjective ratings with scale
anchors (Dillman 2000; Sudman et al. 1996; Tourangeau
et al. 2000), scale reliability (typically assessed with
coefficient alpha) increases as the number of scale steps
increases (Nunnally 1978). As the number of scale steps
increases from two to twenty, there is an initially rapid
increase in reliability that tends to level off at about seven
steps (Nunnally 1978). After eleven steps there is very little
gain (but no loss) in reliability as the number of steps
increases. Lewis (1993) found that mean differences
between experimental groups measured with questionnaire
items having seven steps correlated more strongly with the
observed significance level of statistical tests than did
similar measurements using items that had only five scale
steps, supporting the use of seven rather than five scale
steps.
1.3 Previous research on standardized
questionnaires for speech user interfaces
1.3.1 Mean Opinion Scale (MOS)
The MOS questionnaire has been widely used for the
assessment of speech heard over a telephone channel and
for the assessment of synthetic speech, recommended by
the International Telecommunications Union (Schmidt-
Nielsen 1995; ITU 1994; van Bezooijen and van Heuven
480 Int J Speech Technol (2015) 18:479–487
123
Author's personal copy
1997). The MOS is a Likert-style questionnaire, typically
with seven 5-point scale items addressing the following
TTS characteristics: (1) Global Impression, (2) Listening
Effort, (3) Comprehension Problems, (4) Speech Sound
Articulation, (5) Pronunciation, (6) Speaking Rate, and (7)
Voice Pleasantness. In the most typical use of the MOS,
naı
¨ve listeners assign scores for each item after listening to
speech stimuli, usually sentences (Schmidt-Nielsen 1995).
Factor analysis of these items indicated that they supported
two underlying constructs: Intelligibility and Naturalness
(Kraft and Portele 1995; Lewis 2001).
Polkosky and Lewis (2003) investigated the reliability
and validity of the MOS and used psychometric principles
to revise and improve the scale. This work resulted in the
MOS-Revised (MOS-R). Four subsequent experiments
expanded the MOS-R beyond its previous focus on Intel-
ligibility and Naturalness, to include measurement of the
Prosody and Social Impression of synthetic voices. The
result of this work was the MOS-Expanded (MOS-X), a
rating scale shown to be reliable, valid, and sensitive for
high-quality evaluation of synthetic speech in applied
industrial settings (a total of 15 items, with 4 for Intelli-
gibility, 4 for Naturalness, 3 for Prosody, and 4 for Social
Impression). Although the MOS-X and related question-
naires are excellent instruments for their intended purpose,
they do not address a sufficient scope for the assessment of
the overall perceived usability of an IVR.
1.3.2 Subjective Assessment of Speech System Interfaces
(SASSI)
The SASSI (Hone and Graham 2000) is a questionnaire
developed for the assessment of users’ subjective experi-
ences with speech recognition systems. Starting with an
initial pool of 50 items, the final version of the SASSI had
34 items distributed across six scales: System Response
Accuracy (9 items), Likeability (9 items), Cognitive
Demand (5 items), Annoyance (5 items), Habitability (5
items) and Speed (2 items). The reliabilities of these scales,
assessed with coefficient alpha, were respectively .90, .91,
.88, .77, .75, and .69. The database of completed SASSI
questionnaires used for the psychometric analyses con-
tained 214 questionnaires, collected during usability stud-
ies of four different applications.
In contrast to the MOS, the SASSI covers a much broader
scope of usability attributes for systems employing speech
recognition. A number of researchers have used the SASSI
in their evaluations of speech systems. For example, the
Association for Computing Machinery (ACM) Digital
Library shows over 30 citations from 2005 through 2013.
Despite its popularity, there are aspects of the SASSI that
reduce its utility when assessing IVR applications. The
primary focus of the SASSI is on the consequences of
speech input and how it affects perceived usability and
affect (positive and negative). Furthermore, a key goal of
the SASSI developers was to build a questionnaire that
would be generalizable across a broad spectrum of speech
applications, from extremely limited use of speech input in
small devices to in-car systems to natural language under-
standing (NLU) queries. Having items applicable across
this range of products resulted in a questionnaire that did not
address some of the key characteristics of IVR applications.
Due to their common use in enterprise customer service
to direct users to skill groups in call centers for human
assistance or to automated self-service applications, the
assessment of IVRs requires attention to aspects of
usability that are not applicable to the broad range of
speech-enabled applications. Specifically, the assessment
of IVRs requires attention not only to the quality of speech
input (the focus of the SASSI) or speech output (the focus
of the MOS), but also to the quality of the delivered ser-
vice, which is one of the key elements of the Speech User
Interface Service Quality (SUISQ) questionnaire (Polkosky
2005,2008).
1.4 The SUISQ questionnaire
The framework for the SUISQ is a questionnaire developed
specifically for the assessment of key usability attributes of
IVR applications (Polkosky 2005,2008). An initial pool of
76 items was obtained from the literatures of social psy-
chology, communication, and services marketing. Follow-
ing several rounds of item and factor analysis, the final
version of the SUISQ contained 25 items; factor analysis
indicated the presence of four factors, corresponding to its
four scales (see Appendix 1). The four scales of the SUISQ
(with number of items, estimated reliability, and correla-
tions with customer satisfaction shown in parentheses) are:
User Goal Orientation (UGO: 8 items, a=.92; r=.71),
Customer Service Behaviors (CSB: 8 items, a=.89,
r=.43), Speech Characteristics (SC: 5 items, a=.87,
r=.40), and Verbosity (V: 4 items, a=.69, r=-.27).
The User Goal Orientation items relate to the system’s
efficiency, user trust, confidence in the system, and clarity
of the speech interface. Customer Service Behavior
includes items that relate to the friendliness and politeness
of the system, its speaking pace, and its use of familiar
terms. The Speech Characteristics factor relates to natu-
ralness and enthusiasm of the system voice. Verbosity
includes items related to the talkativeness and repetitive-
ness of the system. In experiments in which participants
listened to recordings of interactions with IVR applications
and rated them using the SUISQ, Polkosky (2005) found
significant correlations between all four metrics and cus-
tomer satisfaction, with participants preferring higher
levels of the first three scales and lower levels of Verbosity.
Int J Speech Technol (2015) 18:479–487 481
123
Author's personal copy
To obtain the data required to develop the SUISQ,
Polkosky (2005) recruited 862 students from the University
of South Florida Psychology Department’s participant pool
(688 females, 161 males, mean age of 20.6), and dis-
tributed them about equally among six interface stimuli
(Tennis Scoreboard, Directory Dialer, Flights, Movies,
Financial Services, and Prescription Refill). Participants
listened to a recorded interaction for their assigned stimu-
lus, and then completed a questionnaire that included all
the candidate items for the SUISQ plus a variety of con-
current measures including customer satisfaction. To
ensure that participants attended to the assigned interac-
tion, they completed a short post-session quiz. The results
of the quizzes showed that participants recalled the details
of the interactions with reasonable accuracy.
Although she cited precedents in marketing and inter-
personal communication studies for using third-party
observers to provide ratings of sentiments (e.g., Cargile
et al. 1994; Dabholkar and Bagozzi 2002; Patterson 1996),
Polkosky stated:
One of the most important limitations of the present
research was the use of observers instead of actual
interface users. Findings from social cognition high-
light this issue for not only applied speech technology
research, but also marketing and interpersonal com-
munication studies, which frequently use observers to
generate data on conversational and service interac-
tions. In contrast, findings from the social-cognitive
literature warn that interactants and observers may
have different affective outcomes. Thus, the present
results are limited to observers of speech interface
usage and do not necessarily apply to users themselves.
This methodological problem has important implica-
tions because the use of observers is an efficient and
practical means of conducting applied research. It
should be a central goal of future research efforts that
potential differences in uses and observer affective
responses be explored. (Polkosky 2005, p. 85–86)
Thus, even though findings from the marketing and
interpersonal communication literature suggest that obser-
vers of interactions might provide ratings of sentiment
similar to those who actually experienced the interactions, it
is an open research question as to whether participants who
actually experience the interaction would provide responses
to the SUISQ that would have similar psychometric prop-
erties as those reported by Polkosky (2005,2008).
1.5 Research goals
Our primary research goal was to investigate the psycho-
metric properties of the SUISQ with data collected from
participants (callers who actually interacted with the IVR
to complete assigned tasks) rather than observers (people
who listened to recorded interactions between a caller and
an IVR). A secondary research goal was to conduct addi-
tional item analyses to explore the reliability and validity
of versions of the SUISQ containing fewer than 25 items.
2 Psychometric evaluation of the SUISQ
questionnaire
2.1 Method
As part of a larger research effort, 549 employees from a
large corporation (415 females, 134 males) volunteered to
complete tasks with a test version of a banking IVR using
natural-language call routing (Kuo et al. 2003; Lee et al.,
2000) using an unmoderated remote usability testing sys-
tem (Albert et al. 2010). The participants’ ages covered a
wide span, with 10.6 % from 18 to 29 years old, 24.0 %
from 30 to 39, 29.1 % from 40 to 49, 32.1 % from 50 to 59,
and 4.2 % over 60. Eighty-five percent of participants used
a land-line during the evaluation; 15 % used a cell phone.
Over one-third of the participants indicated that they used
automated speech systems at least once a week, and about
half indicated use once or twice per month. Over half of the
respondents indicated that they were Comfortable or Very
Comfortable using these types of systems; just under a
third indicated they were Uncomfortable or Very
Uncomfortable.
There were three task groups, each with three different
tasks. Participants attempted to complete the tasks in their
assigned task group (Group 1, Group 2, or Group 3). The
Group 1 tasks were to pay a bill, review transactions from
the last three months, and get information about a maturing
certificate-of-deposit (CD). For Group 2, the tasks were to
update an address, transfer funds, and get information
about a health savings account (HSA). The Group 3 tasks
were to troubleshoot problems getting into an account,
getting the payoff information for a car, and reporting a lost
debit card. After completing their assigned group of tasks,
participants completed the SUISQ (presented online by the
unmoderated usability testing system with items in the
order specified by Polkosky, 2008) and provided a rating of
satisfaction (‘‘Overall how satisfied are you with your
experience using the automated speech system’’ using a
5-point scale anchored with Extremely Dissatisfied,Dis-
satisfied,Neither Satisfied nor Dissatisfied,Satisfied, and
Extremely Satisfied). They also indicated via self-report
whether they did not accomplish any tasks (Comple-
tion =0), accomplished some tasks (Completion =1), or
accomplished all tasks (Completion =2).
482 Int J Speech Technol (2015) 18:479–487
123
Author's personal copy
2.2 Results
These initial analyses investigated the extent to which
using the SUISQ as published by Polkosky (2005,2008),
but in this different context of measurement, produced
results similar to those in the original research.
2.2.1 Reliability
The estimated reliability for the Overall scale (using all 25
items) was .93. For the specific scales, the values of
coefficient alpha were .94 for UGO, .91 for CSB, .78 for
SC, and .71 for V. These results were comparable with
those reported by Polkosky (2005) which were, respec-
tively, .92 (UGO), .89 (CSB), .87 (SC), and .69
(V) (Polkosky did not report an Overall reliability). The
values for coefficient alpha were within .02 for UGO, CSB,
and V. The greatest difference was for SC. In both Polk-
osky (2005) and the present study, however, the reliability
exceeded the typical minimum criterion of .70.
2.2.2 Construct validity
Table 1shows the results of a varimax-rotated principal
components analysis of the SUISQ items. Almost all of the
items (23 out of 25) aligned with the same component as in
Polkosky (2005). Item 7 (‘‘The system was organized and
logical’’) aligned with UGO instead of the expected CSB,
and Item 14 (‘‘The system’s voice was pleasant’’) aligned
with CSB instead of the expected SC.
2.2.3 Criterion-related validity
Correlations between the SUISQ scales and the satisfaction
rating provided evidence of criterion-related (concurrent)
validity. Specifically, the correlations (all p\.01) were
UGO: .74, CSB: .36, SC: .23; and V: -.27. The correla-
tions were within .04 of the values reported by Polkosky
(2005) for UGO, CSB, and V. The correlation for SC was
lower than that reported by Polkosky (.43), but was still
statistically significant and in the same direction.
2.2.4 Sensitivity
As evidence of scale sensitivity, a mixed-model ANOVA
using SUISQ Scale as a within-subjects variable and
Completion as a between-subjects variable revealed that
more successful participants gave significantly better
overall SUISQ ratings (F(2, 519) =33.7, p\.01), with
significant improvement for each higher level of Comple-
tion (Bonferroni multiple comparisons, p\.05). There
was also a significant Scale x Completion interaction (F(6,
1557) =24.5, p\.01). As shown in Fig. 1, the effect of
Completion on mean rating was strongest for UGO and
weakest for SC and V.
3 Psychometric evaluation of the SUISQ-R
questionnaire
3.1 Method
Having established a considerable degree of consistency
between the psychometric outcomes reported by Polkosky
(2005,2008) and this independent set of data collected in a
very different context, it seemed reasonable to analyze the
SUISQ items to develop a shorter questionnaire, the
SUISQ-Reduced (SUISQ-R). The goal at this stage was to
identify 3–4 items per scale that would still have accept-
able psychometric properties.
3.2 Results
3.2.1 Item analysis
Analysis of factor loadings and correlations with satisfac-
tion indicated that the items listed in Table 2should pro-
vide sufficient representation for the SUISQ-R to have
acceptable psychometric properties (see Appendix 2).
3.2.2 Reliability
The SUISQ-R scales had acceptable reliability as measured
with coefficient alpha (UGO: .91, CSB: .88, SC: .80, V:
.67, Overall: .88). V had the lowest reliability, under the
typical criterion of .70, but just under it. Researchers who
require all reliabilities to exceed .70 could modify the
SUISQ-R by including all four V items (see Table 1).
3.2.3 Criterion-related validity
The SUISQ-R scales significantly correlated with the rating
of satisfaction (UGO: .70, CSB: .32, SC: .21, V: -.32,
Overall: .54—all p\.01).
3.2.4 Construct validity
A PCA conducted with the items of the SUISQ-R con-
firmed that the items aligned as expected on the scales.
3.2.5 Sensitivity
For the SUISQ-R scales in the same mixed-model ANOVA
as that reported in the previous section, the main effect of
Completion (F(2, 519) =31.8, p\.01) and the
Int J Speech Technol (2015) 18:479–487 483
123
Author's personal copy
Completion x Scale interaction (F(6, 1557) =18.9,
p\.0001) were statistically significant, with patterns of
means similar to those in Fig. 1.
4 Psychometric evaluation of the SUISQ-MR
questionnaire
4.1 Method
To explore a maximally-reduced version of the SUISQ, we
created an initial version that had only two items per scale
(the SUISQ-MR).
4.2 Results
4.2.1 Item analysis
Analysis of factor loadings and correlations with satisfac-
tion led to the assignment of the items listed in Table 3to
each scale.
4.2.2 Reliability
The reliabilities of the initial SUISQ-MR scales were
estimated as follows using coefficient alpha: UGO: .88,
CSB: .75, SC: .68, V: .47.
Table 1 Principal components analysis of the 25 items
Item Content UGO CSB SC V
13 I would be likely to use this system again .858 .228 .146 -.124
12 I could trust this system to work correctly .834 .205 .117 -.088
17 I felt confident using this system .834 .245 .159 -.088
10 The system would help me be productive .831 .155 .078 -.089
5 I could find what I needed without any difficulty .805 .190 .031 -.073
3 The system gave me a good feeling about being a customer of this business .800 .180 .162 -.025
1 The system made me feel like I was in control .799 .219 .028 -.098
19 The quality of this system made me want to remain a customer of this business .794 .164 .297 -.105
7* The system was organized and logical .628 .439 -.009 -.099
6 The system used everyday words .336 .758 .041 -.099
11 The system seemed polite .256 .739 .316 -.105
9 The system spoke at a pace that was easy to follow .127 .736 .079 -.214
14* The system’s voice was pleasant .188 .726 .434 -.084
4 The system used terms I am familiar with .355 .711 -.054 -.041
25 The system seemed professional in its speaking style .271 .668 .400 -.156
23 The system seemed friendly .290 .648 .482 -.150
21 The system seemed courteous .260 .599 .447 -.163
18 The system’s voice sounded like a regular person .096 .139 .808 -.054
20 The system’s voice sounded natural .164 .242 .797 -.140
24 The system’s voice sounded enthusiastic or full of energy .127 .238 .658 -.045
16 The system’s voice sounded like people I hear on the radio or television .027 -.004 .585 .121
22 I felt like I had to wait too long for the system to stop talking so I could respond -.139 -.161 -.036 .730
2 The messages were repetitive -.185 .084 -.084 .706
8 The system gave me more details than I needed .075 -.199 .011 .701
15 The system was too talkative -.223 -.431 .029 .655
Factor loadings greater than .500 appear in bold
Fig. 1 Scale x Completion interaction. Note So all scales have a
consistent alignment in the figure with higher numbers indicating a
better outcome, the scale for V has been reversed using the formula
V
r
=8-V
484 Int J Speech Technol (2015) 18:479–487
123
Author's personal copy
For this maximally-reduced version, two of the scales
had estimated reliabilities less than .7. SC was just under
the criterion (.02); V was .23 below it. Thus, the final
version of the SUISQ-MR included the three V items from
the SUISQ-R, with estimated reliability of .67 (see
Appendix 3).
4.2.3 Criterion-related validity
The scales for the final version of the SUISQ-MR signifi-
cantly correlated with satisfaction ratings (UGO: .70, CSB:
.29, SC: .22, V: -.28, Overall: .55—all p\.01).
4.2.4 Construct validity
A PCA conducted with the SUISQ-MR items indicated
some structural weaknesses. Specifically, the UGO and
CSB items coalesced onto one component, and the V items
split across two.
4.2.5 Sensitivity
The results of the mixed-model ANOVA using the SUISQ-
MR scales had outcomes similar to those reported for the
SUISQ and SUISQ-R, with a statistically significant main
effect of Completion (F(2, 519) =32.0, p\.0001) and a
significant Completion x Scale interaction (F(6, 1557) =
19.8, p\.0001).
5 Discussion
The results provided compelling evidence that the SUISQ,
developed based on the ratings of observers of speech
interactions rather than participants in those interactions,
works very well with participants. In addition to this
fundamental difference in the measurement context, there
were also major differences in backgrounds (students vs.
corporate employees), age ranges, and tasks. Despite these
differences, the psychometric properties of the SUISQ
were strikingly consistent with those originally reported by
Polkosky (2005,2008).
The attempts to develop shorter versions of the SUISQ
(SUISQ-R and SUISQ-MR) met with mixed success. The
psychometric properties of the 14-item SUISQ-R strongly
suggested that it would be an adequate substitution for the
full SUISQ. Its only weakness was that the reliability of V
was slightly less than .70. It would, however, have rounded
up to .7, so this did not seem to be a critical weakness.
Researchers who require scales with estimated reliability
greater than .70 could accomplish this by using the 4-item
version of V from the full SUISQ, resulting in a 15- rather
than 14-item instrument.
The 9-item SUISQ-MR, however, suffered from a
number of weaknesses. The 2-item version of V had
extremely low reliability, which drove the decision to keep
the best three items for that scale despite the goal of
maximal reduction. The results for criterion-related (con-
current) validity and sensitivity were acceptable. There
were some slight weaknesses in the reliability estimates for
SC and V. Its greatest problem was its deviation from the
expected PCA structure, suggesting a serious weakness in
construct validity.
In summary, this replication of the original SUISQ
findings in a markedly different context of measurement
and the availability of a shorter, psychometrically qualified,
version of the questionnaire (SUISQ-R) should enhance its
utility for usability practitioners who work on the devel-
opment and assessment of speech-recognition IVRs.
Researchers who need the richest possible set of items for
diagnostic purposes should consider using the full SUISQ.
Those who absolutely require the shortest possible ques-
tionnaire might consider the SUISQ-MR. The SUISQ-R,
however, is probably the best choice for most researchers,
given its relative brevity and acceptable psychometric
properties.
6 Limitations to generalization
One limitation was that our completion metric was based
on self-reports rather than an objective assessment of
completion due to a limitation of the unmoderated usability
test tool in the context of evaluating a phone application.
Without downplaying the importance of the perception of
task completion, future research would benefit from also
obtaining objective measurement of task completion.
Although we had an a priori reason to explore a variety
of four-factor solutions based on the results of Polkosky
Table 2 Selected items for the
SUISQ-R scales Scale Items
UGO 1, 5, 13, 17
CSB 6, 11, 23, 25
SC 18, 20, 24
V 2, 15, 22
Table 3 Preliminary
assignment of items to SUISQ-
MR scales
Scale Items
UGO 13, 17
CSB 6, 11
SC 20, 24
V2,22
Int J Speech Technol (2015) 18:479–487 485
123
Author's personal copy
(2005,2008), it is possible that the use of one sample for
the various versions that we explored may have led to
overfitting the model. Also, our data came from an evalu-
ation of one type of voice application, so there is a clear
need to replicate the findings using other voice applica-
tions, preferably using a variety of input styles (natural
language call routing, directed dialog, and touchtone).
Successful replication using independent data sets from a
variety of voice applications in the future would enhance
the generalizability of these findings.
Appendix 1
The Standard SUI service quality (SUISQ) questionnaire
1. The system made me feel like I was in control.
2. The messages were repetitive.
3. The system gave me a good feeling about being a
customer of this business.
4. The system used terms I am familiar with.
5. I could find what I needed without any difficulty.
6. The system used everyday words.
7. The system was organized and logical.
8. The system gave me more details than I needed.
9. The system spoke at a pace that was easy to follow.
10. The system would help me be productive.
11. The system seemed polite.
12. I could trust this system to work correctly.
13. I would be likely to use this system again.
14. The system’s voice was pleasant.
15. The system was too talkative.
16. The system’s voice sounded like people I hear on the
radio or television.
17. I felt confident using this system.
18. The system’s voice sounded like a regular person.
19. The quality of this system made me want to remain a
customer of this business.
20. The system’s voice sounded natural.
21. The system seemed courteous.
22. I felt like I had to wait too long for the system to stop
talking so I could respond.
23. The system seemed friendly.
24. The system’s voice sounded enthusiastic or full of
energy.
25. The system seemed professional in its speaking
style.
SUISQ scales (based on specification in Polkosky
2005)
User goal orientation (UGO) average items 1, 3, 5, 10,
12, 13, 17, and 19.
Customer service behavior (CSB) average items 4, 6, 7,
9, 11, 21, 23, and 25.
Speech characteristics (SC) average items 14, 16, 18, 20,
and 24.
Verbosity (V) average items 2, 8, 15, and 22 (to reverse
score: V
r
=8-V).
Overall average of UGO, CSB, SC, and V
r
.
Appendix 2
See Table 4.
Table 4 The reduced SUI service quality (SUISQ-R) questionnaire
Item Original Scale Item content
1 13 UGO I would be likely to use this system again
2 17 UGO I felt confident using this system
3 5 UGO I could find what I needed without any
difficulty
4 1 UGO The system made me feel like I was in
control
5 6 CSB The system used everyday words
6 11 CSB The system seemed polite
7 25 CSB The system seemed professional in its
speaking style
8 23 CSB The system seemed friendly
9 18 SC The system’s voice sounded like a regular
person
10 20 SC The system’s voice sounded natural
11 24 SC The system’s voice sounded enthusiastic or
full of energy
12 22 V I felt like I had to wait too long for the
system to stop talking so I could respond
13 2 V The messages were repetitive
14 15 V The system was too talkative
SUISQ-R scales
User goal orientation (UGO): average items 1–4
Customer service behavior (CSB): average items 5–8
Speech characteristics (SC): average items 9–11
Verbosity (V): average items 12–14 (to reverse score: V
r
=8-V)
Overall: average of UGO, CSB, SC, and V
r
486 Int J Speech Technol (2015) 18:479–487
123
Author's personal copy
Appendix 3
See Table 5.
References
Albert, T., Albert, B., & Tedesco, D. (2010). Beyond the usability lab:
Conducting large-scale online user experience studies. Burling-
ton: Morgan Kaufmann.
Cargile, A., Giles, H., Ryan, E., & Bradac, J. (1994). Language
attitudes as a social process: A conceptual model and new
directions. Language & Communication, 14, 211–236.
Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt
Brace Jovanovich.
Coovert, M. D., & McNelis, K. (1988). Determining the number of
common factors in factor analysis: A review and program.
Educational and Psychological Measurement, 48, 687–693.
Dabholkar, P., & Bagozzi, R. (2002). An attitudinal model of
technology-based self-service: Moderating effects of consumer
traits and situational factors. Journal of the Academy of
Marketing Science, 30(3), 184–201.
Dillman, D. A. (2000). Mail and Internet surveys: The tailored design
method (2nd ed.). New York: John Wiley.
Hone, K. S., & Graham, R. (2000). Towards a tool for the subjective
assessment of speech system interfaces (SASSI). Natural
Language Engineering, 6(3–4), 287–303.
Hornbæk, K. (2006). Current practice in measuring usability:
Challenges to usability studies and research. International
Journal of Human-Computer Studies, 64(2), 79–102.
Hornbæk, K., & Law, E.L. (2007). Meta-analysis of correlations
among usability measures. In Proceedings of CHI 2007 (pp.
617–626). San Jose: ACM.
International Standards Organization. (1998). Ergonomic require-
ments for office work with visual display terminals (VDTs)—Part
11: Guidance on usability (ISO 9241-11:1998(E)). Geneva: ISO.
International Telecommunication Union. (1994). A method for
subjective performance assessment of the quality of speech
voice output devices (ITU-T recommendation (p. 85). Geneva:
ITU.
Kraft, V., & Portele, T. (1995). Quality evaluation of five German
speech synthesis systems. Acta Acustica, 3, 351–365.
Kuo, H. J., Siohan, O., & Olive, J. P. (2003). Advances in natural
language call routing. Bell Labs Technical Journal, 7(4),
155–170.
Landauer, T. K. (1988). Research methods in human-computer
interaction. In M. Helander (Ed.), Handbook of human–computer
interaction (pp. 905–928). New York: Elsevier.
Lee, C.-H., Carpenter, B., Chou, W., Chu-Carroll, J., Reichl, W.,
Saad, A., & Zhou, Q. (2000). On natural language call routing.
Speech Communication, 31, 309–320.
Lewis, J. R. (1993). Multipoint scales: Mean and median differences
and observed significance levels. International Journal of
Human-Computer Interaction, 5, 383–392.
Lewis, J.R. (2001). Psychometric properties of the mean opinion
scale. In Proceedings of HCI International 2001: Usability
Evaluation and Interface Design (pp. 149–153). Mahwah:
Lawrence Erlbaum.
Lewis, J. R. (2011). Practical speech user interface design. Boca
Raton: Taylor & Francis Group.
Lewis, J. R. (2012). Usability testing. In G. Salvendy (Ed.), Handbook
of human factors and ergonomics (pp. 1267–1312). New York:
John Wiley.
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-
Hill.
Patterson, M. L. (1996). Social behavior and social cognition: A
parallel process approach. In J. L. Nye & A. M. Brower (Eds.),
What’s social about social cognition? Research on socially
shared cognition in small groups (pp. 87–105). Thousand Oaks:
Sage.
Polkosky, M. D. (2005). Toward a social-cognitive psychology of
speech technology: Affective responses to speech-based e-
service. Unpublished doctoral dissertation. University of South
Florida.
Polkosky, M. D. (2008). Machines as mediators: The challenge of
technology for interpersonal communication theory and
research. In E. Konjin (Ed.), Mediated interpersonal communi-
cation (pp. 34–57). New York: Routledge.
Polkosky, M. D., & Lewis, J. R. (2003). Expanding the MOS:
Development and psychometric evaluation of the MOS-R and
MOS-X. International Journal of Speech Technology, 6,
161–182.
Sauro, J., & Lewis, J. R. (2009). Correlations among prototypical
usability metrics: Evidence for the construct of usability. In
Proceedings of CHI 2009 (pp. 1609–1618). Boston: ACM.
Sauro, J., & Lewis, J. R. (2012). Quantifying the user experience:
Practical statistics for user research. Waltham: Morgan
Kaufmann.
Schmidt-Nielsen, A. (1995). Intelligibility and acceptability testing
for speech technology. In A. Syrdal, R. Bennett, & S. Greenspan
(Eds.), Applied speech technology (pp. 195–232). Boca Raton:
CRC Press.
Sudman, S., Bradburn, N. M., & Schwartz, N. (1996). Thinking about
answers: The application of cognitive processes to survey
methodology. San Francisco: Jossey-Bass Publishers.
Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of
survey response. Cambridge: Cambridge University Press.
van Bezooijen, R., & van Heuven, V. (1997). Assessment of synthesis
systems. In D. Gibbon, R. Moore, & R. Winski (Eds.), Handbook
of standards and resources for spoken language systems (pp.
481–563). New York: Mouton de Gruyter.
Table 5 The maximally-reduced SUI service quality (SUISQ-MR)
questionnaire
Item Original Scale Item content
1 13 UGO I would be likely to use this system again
2 17 UGO I felt confident using this system
3 6 CSB The system used everyday words
4 11 CSB The system seemed polite
5 20 SC The system’s voice sounded natural
6 24 SC The system’s voice sounded enthusiastic or
full of energy
7 22 V I felt like I had to wait too long for the
system to stop talking so I could respond
8 2 V The messages were repetitive
9 15 V The system was too talkative
SUISQ-MR scales
User goal orientation (UGO): average items 1–2
Customer service behavior (CSB): average items 3–4
Speech characteristics (SC): average items 5–6
Verbosity (V): average items 7–9 (to reverse score: V
r
=8-V)
Overall: average of UGO, CSB, SC, and V
r
Int J Speech Technol (2015) 18:479–487 487
123
Author's personal copy
... For example, scales like Experience in Virtual Environment Questionnaire (EVEQ) (Takatalo et al., 2008) or the NASA Task Load Index (Hart & Staveland, 1988) are developed based on very specific research contexts due to which they might not be effective in the human-CA interaction scenario. Likewise, scales such as the Subjective Assessment of Speech System Interfaces (SASSI) (Hone & Graham, 2000) or the Speech User Interface Service Quality (SUISQ) questionnaire (Lewis & Hardzinski, 2015), although have been developed keeping speech interfaces in mind, however such interfaces are "nonsmart" that do not possess any type of intelligence or decision making capabilities like the CAs. For instance, both SASSI and SUISQ were created to test Interactive Voice Response (IVR) applications. ...
... D. Polkosky & Lewis, 2003), SUISQ (M. Polkosky, 2005), SUISQ-R (Lewis & Hardzinski, 2015), or SUISQ-MR (Lewis & Hardzinski, 2015), the focus was on TTS systems. For developing all the scales pre-recorded audio clips have been used that are assessed by the users. ...
... D. Polkosky & Lewis, 2003), SUISQ (M. Polkosky, 2005), SUISQ-R (Lewis & Hardzinski, 2015), or SUISQ-MR (Lewis & Hardzinski, 2015), the focus was on TTS systems. For developing all the scales pre-recorded audio clips have been used that are assessed by the users. ...
Article
Conversational agents are growing in popularity, and as such they must provide a good user experience and meet the needs of the users. Yet, how to measure the user experience in the conversational AI scenario remains an open and urgent question to be solved, which may hinder further empirical studies on human-agent interactions. In fact, there have been very few studies that explore how users perceive interacting with these conversational agents, which is important to ensure their sustainability. Accordingly, in this work, we used a subjective technique by following a measurement approach to develop a standardized measurement instrument/scale called the Conversational Agent Scale for User Experience. In terms of the methodology, we used a mixed-method approach involving an iterative process spanning across six different user studies (three qualitative and three quantitative) at different points of time for the purpose of dimension identification , item generation and subsequent item refinements. As a part of the qualitative studies we conducted a Systematic Literature Review, semi-structured interviews with users of conversational agents, and expert interviews. For the quantitative studies a lab-based experiment was performed for the pre-test, followed by two online surveys as a part of pilot testing and scale development. The qualitative studies initially identified a total of 13 distinct user experience dimensions and 418 measurement items. Finally, the Exploratory Factor Analysis as a part of the main survey resulted in 9 user experience dimensions (practicality perception, proficiency, humanness, sentiment, robustness, etiquette & mannerism, personality, anthropomorphism, and ease of use) and 34 measurement items. This was supplemented by conducting another Confirmatory Factor Analysis for establishing the reliability and validity of the proposed scale, together with checking the model-fit indices. All the 9 dimensions had sufficient validity, and a reasonable level of statistical reliability. Some of the user experience dimensions like humanness, personality, anthropomorph-ism, and etiquette & mannerism show the uniqueness of the conversational AI scenario from the traditional usage factors used commonly while evaluating graphical user interface-based systems. This work fills the gap of a lack of research on how to classify and measure the conversational experience of users and provides a reference for practitioners and designers in developing these agents and continuously improving the usage experience.
... According to the result, only 9.5% of the scales (n =2) were created with smart speakers, and 19% of the scales (n=4) were created using multiple interfaces. The majority, comprising 67% of the scales (n = 14), were created using diverse software interfaces, including text-to-speech [114], third party applications such as gaming applications, travel software, and customer relationship management software [94]. In terms of the publication year, 50% of the studies (n = 9) have been published between 2000 and 2010, 11% of the studies (n = 2) have been published between 2011 to 2015; and 39% of the studies (n = 7) published between 2016 to 2022. ...
... Although the scale modifications were carried out for different reasons, the result shows that in most cases, there is at least an interval of a decade between an original scale and its modified version. For instance, [94] created SUISQ-R and SUISQ-MR, which are a modifications of SUISQ [92], which occurred after nearly a decade. Likewise, [98] created UEQ-S, which is a modified version of UEQ [97] roughly after a similar time frame. ...
... Content validity is significantly affected by the number of items on a scale; therefore, modified scales often opt for item reduction by eliminating dimensions to increase their content validity. In cases of item reduction, a deductive method should be employed for principal component selection [94]. ...
Article
Full-text available
The use of Voice Assistants (VA) in both commercial and personal contexts has experienced significant growth, emphasizing the importance of assessing their user experience (UX) for long-term viability. Currently, the development of appropriate scales that capture user viewpoints after interacting with a system has become a popular method for measuring UX of the Graphical User Interface (GUI) systems. However, the applicability of these scales that are meant for GUI systems on VA is still questionable, hence the need for analyzing the nature of previous scales used for measuring UX of VA. Additionally, in order to keep track of the state of UX research in the VA domain, it is crucial to understand the dimensions of UX that are being utilized. In this study, a comprehensive Systematic Literature Review (SLR) was carried out to identify 21 individual scales used for measuring UX of VA. Furthermore, this study present the evaluation criteria for assessing the rigor of operationalization during the development of these scales. The study analysis reveals that the scales used for measuring UX of VA extends beyond the traditional VUDA (value, usability, desirability, adoptability) principles and incorporates novel aspects such as anthropomorphism and machine personality. Future VA UX researchers should also acknowledge the variations in the rigorous measures employed during scale development, notwithstanding some common and accepted practices. Consequently, an overview is provided, along with suggestions for prospective studies in the field of VA UX research.
... The well-known PARAdigm for Dialogue System Evaluation framework [21] suggests that the interaction quality with chatbots should be considered as a weighted product of success in achieving the tasks (maximize task success) at an acceptable cost (efficiency and quality of chatbot's performance). However, as acknowledged by Borsci and colleagues [22][23][24] without a specifically designed instrument to measure chatbot user experience, experts cannot reliably compare their results and often use qualitative instruments, or scales developed for point-and-click interaction (e.g., the System Usability Scale [25], Usability metrics for User Experience Lite version (UMUX-Lite) [26]), or scales developed for conversational interfaces-e.g., Speech User Interface Service Quality scale [27] and the Subjective Assessment of Speech System Interfaces [28]. Satisfaction scales like SUS and UMUX are highly reliable (Cronbach's α > 0.7 [29]) and validated measures to capture the quality experience by users after the interaction with interactive systems; nevertheless, these scales were not meant to capture the aspects associated with the textual or verbal dialogical exchanges, i.e., the ability to maintain the sense of the conversation and its context in an efficient and effective [30,31]. ...
Article
Full-text available
Intelligent systems, such as chatbots, are likely to strike new qualities of UX that are not covered by instruments validated for legacy human–computer interaction systems. A new validated tool to evaluate the interaction quality of chatbots is the chatBot Usability Scale (BUS) composed of 11 items in five subscales. The BUS-11 was developed mainly from a psychometric perspective, focusing on ranking people by their responses and also by comparing designs’ properties (designometric). In this article, 3186 observations (BUS-11) on 44 chatbots are used to re-evaluate the inventory looking at its factorial structure, and reliability from the psychometric and designometric perspectives. We were able to identify a simpler factor structure of the scale, as previously thought. With the new structure, the psychometric and the designometric perspectives coincide, with good to excellent reliability. Moreover, we provided standardized scores to interpret the outcomes of the scale. We conclude that BUS-11 is a reliable and universal scale, meaning that it can be used to rank people and designs, whatever the purpose of the research.
... Researchers have identified various observable factors that influence human-AI interaction, including appearance (especially personified appearance) [118][119][120][121][122][123][124], voice characteristics like tone, pitch, and style [125][126][127][128][129][130][131][132][133][134][135], dialogue [136][137][138], movement and behavior [138][139][140], and emotional and social expression [138,[141][142][143][144][145]. Fine-tuning these factors allows AI system designers to enhance the impact of virtual agents or physical robots and strengthen their relationship with users. ...
Preprint
Full-text available
This thesis investigates the psychological factors that influence belief in AI predictions, comparing them to belief in astrology- and personality-based predictions, and examines the "personal validation effect" in the context of AI, particularly with Large Language Models (LLMs). Through two interconnected studies involving 238 participants, the first study explores how cognitive style, paranormal beliefs, AI attitudes, and personality traits impact perceptions of the validity, reliability, usefulness, and personalization of predictions from different sources. The study finds a positive correlation between belief in AI predictions and belief in astrology- and personality-based predictions, highlighting a "rational superstition" phenomenon where belief is more influenced by mental heuristics and intuition than by critical evaluation. Interestingly, cognitive style did not significantly affect belief in predictions, while paranormal beliefs, positive AI attitudes, and conscientiousness played significant roles. The second study reveals that positive predictions are perceived as significantly more valid, personalized, reliable, and useful than negative ones, emphasizing the strong influence of prediction valence on user perceptions. This underscores the need for AI systems to manage user expectations and foster balanced trust. The thesis concludes with a proposal for future research on how belief in AI predictions influences actual user behavior, exploring it through the lens of self-fulfilling prophecy. Overall, this thesis enhances understanding of human-AI interaction and provides insights for developing AI systems across various applications.
... Although EEG has served as a proficient tool for quantifying objective assessments of emotional responses to chatbot interventions, it is customary to use usability scales for capturing the subjective dimensions pertaining to emotions and experiences in the context of chatbot systems. While a universally accepted benchmark for conducting usability tests on chatbots remains elusive, numerous studies have gravitated toward the adoption of the System Usability Scale (SUS-10) [24][25][26][27][28][29] and the Speech User Interface Service Quality scale. SUS-10 captures the overall usability of a system independently of the platform or interface. ...
Article
Full-text available
Background The use of chatbots in mental health support has increased exponentially in recent years, with studies showing that they may be effective in treating mental health problems. More recently, the use of visual avatars called digital humans has been introduced. Digital humans have the capability to use facial expressions as another dimension in human-computer interactions. It is important to study the difference in emotional response and usability preferences between text-based chatbots and digital humans for interacting with mental health services. Objective This study aims to explore to what extent a digital human interface and a text-only chatbot interface differed in usability when tested by healthy participants, using BETSY (Behavior, Emotion, Therapy System, and You) which uses 2 distinct interfaces: a digital human with anthropomorphic features and a text-only user interface. We also set out to explore how chatbot-generated conversations on mental health (specific to each interface) affected self-reported feelings and biometrics. Methods We explored to what extent a digital human with anthropomorphic features differed from a traditional text-only chatbot regarding perception of usability through the System Usability Scale, emotional reactions through electroencephalography, and feelings of closeness. Healthy participants (n=45) were randomized to 2 groups that used a digital human with anthropomorphic features (n=25) or a text-only chatbot with no such features (n=20). The groups were compared by linear regression analysis and t tests. Results No differences were observed between the text-only and digital human groups regarding demographic features. The mean System Usability Scale score was 75.34 (SD 10.01; range 57-90) for the text-only chatbot versus 64.80 (SD 14.14; range 40-90) for the digital human interface. Both groups scored their respective chatbot interfaces as average or above average in usability. Women were more likely to report feeling annoyed by BETSY. Conclusions The text-only chatbot was perceived as significantly more user-friendly than the digital human, although there were no significant differences in electroencephalography measurements. Male participants exhibited lower levels of annoyance with both interfaces, contrary to previously reported findings.
Preprint
Full-text available
Conversational AI (CAI) systems which encompass voice- and text-based assistants are on the rise and have been largely integrated into people's everyday lives. Despite their widespread adoption, users voice concerns regarding privacy, security and trust in these systems. However, the composition of these perceptions, their impact on technology adoption and usage and the relationship between privacy, security and trust perceptions in the CAI context remain open research challenges. This study contributes to the field by conducting a Systematic Literature Review and offers insights into the current state of research on privacy, security and trust perceptions in the context of CAI systems. The review covers application fields and user groups and sheds light on empirical methods and tools used for assessment. Moreover, it provides insights into the reliability and validity of privacy, security and trust scales, as well as extensively investigating the subconstructs of each item as well as additional concepts which are concurrently collected. We point out that the perceptions of trust, privacy and security overlap based on the subconstructs we identified. While the majority of studies investigate one of these concepts, only a few studies were found exploring privacy, security and trust perceptions jointly. Our research aims to inform on directions to develop and use reliable scales for users' privacy, security and trust perceptions and contribute to the development of trustworthy CAI systems.
Chapter
Full-text available
Covers the basics of usability testing plus some statistical topics (sample size estimation, confidence intervals, and standardized usability questionnaires).
Book
Although speech is the most natural form of communication between humans, most people find using speech to communicate with machines anything but natural. Drawing from psychology, human-computer interaction, linguistics, and communication theory, Practical Speech User Interface Design provides a comprehensive yet concise survey of practical speech user interface (SUI) design. It offers practice-based and research-based guidance on how to design effective, efficient, and pleasant speech applications that people can really use. Focusing on the design of speech user interfaces for IVR applications, the book covers speech technologies including speech recognition and production, ten key concepts in human language and communication, and a survey of self-service technologies. The author, a leading human factors engineer with extensive experience in research, innovation and design of products with speech interfaces that are used worldwide, covers both high- and low-level decisions and includes Voice XML code examples. To help articulate the rationale behind various SUI design guidelines, he includes a number of detailed discussions of the applicable research. The techniques for designing usable SUIs are not obvious, and to be effective, must be informed by a combination of critically interpreted scientific research and leading design practices. The blend of scholarship and practical experience found in this book establishes research-based leading practices for the design of usable speech user interfaces for interactive voice response applications.
Book
You're being asked to quantify your usability improvements with statistics. But even with a background in statistics, you are hesitant to statistically analyze their data, as they are often unsure which statistical tests to use and have trouble defending the use of small test sample sizes. The book is about providing a practical guide on how to solve common quantitative problems arising in usability testing with statistics. It addresses common questions you face every day such as: Is the current product more usable than our competition? Can we be sure at least 70% of users can complete the task on the 1st attempt? How long will it take users to purchase products on the website? This book shows you which test to use, and how provide a foundation for both the statistical theory and best practices in applying them. The authors draw on decades of statistical literature from Human Factors, Industrial Engineering and Psychology, as well as their own published research to provide the best solutions. They provide both concrete solutions (excel formula, links to their own web-calculators) along with an engaging discussion about the statistical reasons for why the tests work, and how to effectively communicate the results. *Provides practical guidance on solving usability testing problems with statistics for any project, including those using Six Sigma practices *Show practitioners which test to use, why they work, best practices in application, along with easy-to-use excel formulas and web-calculators for analyzing data *Recommends ways for practitioners to communicate results to stakeholders in plain English. © 2012 Jeff Sauro and James R. Lewis Published by Elsevier Inc. All rights reserved.
Book
Examines the psychological processes involved in answering different types of survey questions. The book proposes a theory about how respondents answer questions in surveys, reviews the relevant psychological and survey literatures, and traces out the implications of the theories and findings for survey practice. Individual chapters cover the comprehension of questions, recall of autobiographical memories, event dating, questions about behavioral frequency, retrieval and judgment for attitude questions, the translation of judgments into responses, special processes relevant to the questions about sensitive topics, and models of data collection. The text is intended for: (1) social psychologists, political scientists, and others who study public opinion or who use data from public opinion surveys; (2) cognitive psychologists and other researchers who are interested in everyday memory and judgment processes; and (3) survey researchers, methodologists, and statisticians who are involved in designing and carrying out surveys. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
1. Introduction. 2. Methods for Determining Cognitive Processes and Questionnaire Problems. 3. Answering a Survey Question: Cognitive and Communicative Processes. 4. Psychological Sources of Context Effects in Survey Measurement. 5. The Direction of Context Effects: What Determines Assimilation or Contrast in Attitude Measurement. 6. Order Effects Within a Question: Presenting Categorical Response Alternatives. 7. Autobiographical Memory. 8. Event Dating. 9. Counting and Estimation. 10. Proxy Reporting. 11. Implications for Questionnaire Design and the Conceptualization of the Survey Interview.
Chapter
This chapter discusses the conduct of research to guide the development of more useful and usable computer systems. Experimental research in human-computer interaction involves varying the design or deployment of systems, observing the consequences, and inferring from observations what to do differently. For such research to be effective, it must be owned—instituted, trusted and heeded—by those who control the development of new systems. Thus, managers, marketers, systems engineers, project leaders, and designers as well as human factors specialists are important participants in behavioral human-computer interaction research. This chapter is intended as much for those with backgrounds in computer science, engineering, or management as for human factors researchers and cognitive systems designers. It is argued in this chapter that the special goals and difficulties of human-computer interaction research make it different from most psychological research as well as from traditional computer engineering research. The main goal, the improvement of complex, interacting human-computer systems, requires behavioral research but is not sufficiently served by the standard tools of experimental psychology such as factorial controlled experiments on pre-planned variables. The chapter contains about equal quantities of criticism of inappropriate general research methods, description of valuable methods, and prescription of specific useful techniques.
Article
How to measure usability is an important question in HCI research and user interface evaluation. We review current practice in measuring usability by categorizing and discussing usability measures from 180 studies published in core HCI journals and proceedings. The discussion distinguish several problems with the measures, including whether they actually measure usability, if they cover usability broadly, how they are reasoned about, and if they meet recommendations on how to measure usability. In many studies, the choice of and reasoning about usability measures fall short of a valid and reliable account of usability as quality-in-use of the user interface being studied. Based on the review, we discuss challenges for studies of usability and for research into how to measure usability. The challenges are to distinguish and empirically compare subjective and objective measures of usability; to focus on developing and employing measures of learning and retention; to study long-term use and usability; to extend measures of satisfaction beyond post-use questionnaires; to validate and standardize the host of subjective satisfaction questionnaires used; to study correlations between usability measures as a means for validation; and to use both micro and macro tasks and corresponding measures of usability. In conclusion, we argue that increased attention to the problems identified and challenges discussed may strengthen studies of usability and usability research.