Content uploaded by Louanne E. Boyd
Author content
All content in this area was uploaded by Louanne E. Boyd on Sep 05, 2018
Content may be subject to copyright.
SayWAT: Augmenting Face-to-Face Conversations
for Adults with Autism
LouAnne E. Boyd1, Alejandro Rangel2, Helen Tomimbang1, Andrea Conejo-Toledo1,
Kanika Patel1, Monica Tentori2, Gillian R. Hayes1
1Department of Informatics
UC Irvine, Irvine, USA
{boydl, htomimba, aconejot, kanikap,
hayesg}@uci.edu
2Department of Computer Science
CICESE, Ensenada, Mexico
arangel@cicese.edu.mx, mtentori@cicese.mx
ABSTRACT
During face-to-face conversations, adults with autism
frequently use atypical rhythms and sounds in their speech
(prosody), which can result in misunderstandings and
miscommunication. SayWAT is a Wearable Assistive
Technology that provides feedback to wearers about their
prosody during face-to-face conversations. In this paper, we
describe the design process that led to five design
guidelines that governed the development of SayWAT and
present results from two studies involving our prototype
solution. Our results indicate that wearable assistive
technologies can automatically detect atypical prosody and
deliver feedback in real time without disrupting the wearer
or the conversation partner. Additionally, we provide
suggestions for wearable assistive technologies for social
support.
Author Keywords
Autism, communication, prosody, wearable computing,
social skills, assistive technology.
ACM Classification Keywords
K4.2 [Social Issues]: Assistive technologies for persons with
disabilities; H.5.5 [Sound and Music Computing]: Signal
analysis, synthesis, and processing
INTRODUCTION
Atypical prosody is one of the most noticeable
characteristics of autism [15]—making speakers
inadvertently sound angry, bored, or tired. It is also one of
the most difficult social skills to change over a lifetime,
often requiring intensive and extensive intervention [8,17].
Prosody includes the rhythm and sounds in speech, and
refers to the acoustic way words are spoken to convey
meaning through changes in pitch, volume, and rate of
speech. Atypical prosodic patterns can heavily limit the
opportunities for individuals with autism to establish social
connections during face-to-face conversations. These
challenges appear to contribute negatively to peer’s
perceptions of the speaker and the overall social interaction
experience [8,28]. These characteristics are evident in the
struggles of some individuals with autism to develop peer
relationships, recognize emotions, and generally develop
social skills for interactive communication [20,24,27]
Over time, atypical prosody contributes to chronic social
difficulties that impact quality of life for people with
autism. These challenges typically lead to social isolation,
low rates of employment, and significant mental health
concerns [10,11,19]. Young adults with autism have the
lowest rates of participation in employment compared with
youth with other disabilities [26]. Those who are employed
often are employed below their level of education and have
difficulty maintaining stable employment [26]. Young
adults with autism also have some of the highest rates of
anxiety and depression [11,32] with a suicide rate 28 times
higher than the general population [19].
Although intensive intervention programs exist for social
skills generally and prosody in particular, social skills
therapists have been overcome in recent years by the sheer
number of people needing services. We are currently
experiencing a massive shortfall in the number of trained
professionals available to provide expert social skills
support [10]. Technologies have the potential to bridge that
gap in services, a goal that drove this research.
In this work, we were interested in understanding how
wearable assistive technologies might provide adults with
autism with awareness of prosody without disrupting the
social interaction. We followed a user-centered design
approach to uncover key design guidelines for such a
solution. Building on these guidelines, we developed and
evaluated SayWAT, a wearable assistive technology that
uses Google Glass to display visual information in the
wearer’s peripheral vision. Our results indicate that
wearable assistive technologies can provide awareness of
social communication without disrupting conversations.
The primary contributions of this work are (1)
establishment of the feasibility of this approach for
supporting adults with autism and (2) empirically validated
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions
from Permissions@acm.org.
CHI'16, May 07-12, 2016, San Jose, CA, USA
© 2016 A CM. ISBN 978-1-4503-3362-7/16/05…$15.00 DOI:
http://dx.doi.org/10.1145/2858036.2858215
design guidelines for wearable assistive technologies to
support face-to-face communication in situ.
BACKGROUND AND RELATED WORK
In this section, we first provide relevant background on
prosody improvement. We then describe research related to
assistive technologies for social communication more
broadly. Taken together, this research indicates that
technological interventions, especially wearables, may be
instrumental in overcoming the challenges associated with
atypical prosody. More generally, technologies to support
social skills could benefit from greater mobility and
awareness of the context of the social intervention. We seek
to address both of these areas of research in this work.
Prosody
Conveying and interpreting emotions during conversations
requires using and understanding body language (facial
expressions, gestures), spoken words, and prosody. Prosody
helps people understand the meaning of spoken words. For
example, when using sarcasm, the words being spoken do
not match their meaning. In this example, prosody helps
people understand that the speaker is not being sincere.
Prosody has been shown to convey 70%-90% of the
meaning of an utterance [16]. Thus, challenges in using
prosody to convey meaning can be a major hindrance to
making oneself understood by others. Atypical prosody in
people with autism has been described as being “monotone”
[22,28]. A monotone voice is one that is “flat,” robotic, or
mechanic resulting from the narrow range in pitch along
with short pauses between phrases. For example, in the
acoustical graph of a brief conversation in Figure 1, the first
adult is speaking with a pitch range that spreads across
220Hz, while the second speaker, an adult with autism, is
speaking with a narrow pitch range of 25Hz.
Best practices for prosody intervention include providing
multiple opportunities to practice typical prosody along
with feedback in the context of use. This feedback tends to
take the form of a teacher or therapist commenting on role
played scenarios. However, the goal of such interactions is
to make the practiced conversations as natural as possible.
People participating in social skills training typically
engage in these practice sessions regularly over several
weeks or even years. These kinds of training sessions are
common in schools, but social skill supports for adults with
autism are significantly lacking[10]. Additionally, even for
children who may have opportunities to roleplay regularly,
there are limited opportunities to practice these skills
outside of a school setting. The focus of this research, then,
is on bridging that gap through technological supports that
can provide frequent feedback and opportunities to practice
skills in natural settings.
Assistive Technologies for Communication
A variety of tools have been created and studied to support
social communication for people with autism. In this
section, we describe those that are most relevant to this
research, notably those to support prosody and related
communication challenges.
Only one research effort has explicitly focused on using
technologies for teaching prosody. SpeechPrompts [29] was
designed on a tablet to support speech therapists during the
treatment of prosody in school-aged children with autism
and other developmental disabilities. Wall et al. found more
typical volume and stress patterns at the post treatment
measures. All ten speech-language pathologists reported
their students maintained engagement and enjoyed the
sessions when using SpeechPrompts [29].
Although not explicitly focused on prosody, other efforts
Figure 1: A conversation shown over time between a person with autism (light yellow bar), exhibiting a monotone voice, and a
person without autism (blue striped bar) exhibiting typical prosody. The black bar denotes no one speaking.
Top: amplitude of air displacement of the sound. Bottom: range of pitch.
have been more focused on supporting users during social
interactions. For example, MOSOCO is an augmented
reality system on a mobile phone platform that provides
step-by-step guidance during social interactions in real life
situations. This work builds on the idea of Interaction
Immediacy [30] for identifying when to intervene in a
social setting. In a study of the use of MOSOCO with 12
fourth graders (nine neuro-typical and three with autism),
Escobedo et al. found MOSOCO improved social
interactions for both children with autism and their typically
developing peers [9].
Wearable assistive technologies [4,9,23,25], have the
potential to provide feedback in the moment of need. For
example, the Facial Analysis system and Emotion Bubbles
project [18] paired two technologies to enable users to
monitor their own emotional expression using facial
recognition techniques. Studies examining the paired
wearable camera and minicomputer showed that users are
able to modify their emotional expression in response to the
reactions of peers during everyday social interactions [18].
Recent research demonstrates that users value the support
of augmented social interactions on Google Glass [33].
This body of work shows that wearable technologies are
appropriate tools to augment social communication for
individuals with autism. Building on past works, we apply
the concept of augmenting social interaction with
immediate visual feedback on pitch and volume for adults
with autism. This visual augmentation allows users to
comfortably and privately monitor their behavior in a
natural conversational setting. No tools to date address the
difficulties with prosody for individuals with autism during
face-to-face conversations in real life situations.
SAYWAT DESIGN PROCESS
Our design process aimed to explore the manner in which
adults with autism might use and experience wearable
technologies to improve their face-to-face communication
style. After identifying prosody as a key issue, we then
worked to understand whether detecting atypical prosody
was even possible in real time in such a way that an
intervention can be delivered. These questions inherently
require answers that can only be found in use. Thus, our
wearable designs had to work in real time without
distracting the wearer from the primary task.
The first author led social skills training sessions for
approximately twelve years. To supplement these first
person experiences, however, we conducted additional
empirical work to drive our design efforts. Specifically, we
conducted a series of interviews and design workshops with
experts to ensure appropriate designs of our technologies.
In this section, we describe the methods we undertook to
develop design guidelines that eventually led to the
implementation of SayWAT: a wearable assistive
technology that provides feedback to wearers about their
prosody during face-to-face conversations.
Fieldwork
We interviewed two psychologists with expertise in treating
adults with autism that commonly face social missteps
during real-life social situations. Interviews lasted between
90 minutes and two hours and were conducted in person at
the schools where the psychologists work. Interview topics
focused on understanding the supports and strategies used
to teach age-appropriate social skills. Specifically, the
psychologists explained how they teach conflict avoidance
as well as greater awareness of atypical prosody.
All field notes and interview transcripts were analyzed
together iteratively by a subset of the research team. A
combination of deductive and inductive analysis was used.
We explicitly looked for deviations from and agreement
with the social skills literature and best practices learned
from the first author’s years as a practitioner. We also
coded for emergent themes, however, particularly in
relation to the potential for new designs and technologies
that have not previously been possible. Using these results,
we developed scenarios depicting the context of
intervention and personas illustrating potential users.
Design Sessions
Building on the results of our interviews and observations,
we identified prosody as a key issue that has thus far been
underexplored in social skills supports. We then assembled
a multidisciplinary design team, which included two school
psychologists, an acoustic engineer, a graphic designer, a
speech and language pathologist, a computer science
graduate student, and an information science graduate
student who is also a former autism treatment specialist. In
the US and Mexico, we conducted seven multidisciplinary
design sessions during which teams met in subgroups over
the course of nine weeks. In total, these sessions lasted
approximately 12 hours and produced dozens of sketches of
early stage prototypes as well as a variety of graphics to be
considered for use in our eventual system. Extensive field
notes were taken during and immediately following each
design session, and the artifacts were retained for analysis.
We iteratively crafted our design guidelines, using a
constant comparative method across our interview,
observation, and design session data. We used open and
axial coding and micro-analysis of the interview data from
our formative study. All codes were integrated in an affinity
diagram that was further discussed by the team during
several sessions. For each design session, we used this
affinity diagram as input, and the design team envisioned
low-fidelity prototypes taking into account the results
portrayed in the diagram. As we encountered specific
design and technical concerns, we returned to our data,
crafted several alternative solutions, conducted brief trials
with research team members, and then modified our design
guidelines in light of technical constraints and experiences
with the prototypes. This iterative approach resulted in the
principles below.
DESIGN GUIDELINES
We developed five key design guidelines specific to the
creation of SayWAT. In this section, we describe those
principles.
1. Focus on awareness, not instruction.
Given the challenges seen in the clinical and educational
literature around teaching people about atypical prosody
[9,21], it is perhaps unsurprising that we often observed
students unable to understand how to “appropriately” use
prosody. Statements like “Change the rhythm of your voice
when you are happy” are extremely hard to act on in
practice. Additionally, feedback like “Speak higher when
you are asking a question” requires that the person giving
the feedback knows that the utterance will be a question,
which is often indeterminable until after the utterance has
been made. Thus, we recognized early in our design process
the need to focus on supporting awareness of atypical
prosody for the user rather than direct instruction. In this
way, users can understand their own behaviors and possibly
even experiment with modifications.
Additionally, therapists we interviewed stressed the need
for this awareness to be built during the social interaction.
Existing therapeutic practices and social skills interventions
already do a fairly good job in helping students to reflect on
their performance after an interaction. However, currently
little is possible in the way of support during the interaction
without substantially disrupting the conversation through,
for example, therapist intervention. In our design, we
focused on providing real-time feedback through a
wearable platform. In this way, the user can begin to build
awareness visually during conversations without disruption
of those interactions. By simply being worn, such a device
can provide information that allows the wearer to observe
the social misstep in a natural context. Ideally, by building
real-time awareness, wearers can then develop their own
strategies to respond to these alerts rather than having to
interpret specific instruction on the fly.
2. Make alerts rapidly understandable.
Given the goal of providing awareness in real-time, some
mechanisms for specific feedback are required. In our
design sessions, we explored the idea of using visual, audio,
and even haptic feedback for conversation missteps. As the
audio channel is already occupied with understanding and
attending to tone of voice, volume, and other audio cues in
the conversation; visual feedback for social communication
is recommended [6]. Haptic feedback can be challenging as
well during social conversations, particularly when trying to
convey a variety of information. Thus, although also a
crowded channel, in terms of body language and facial
expressions, we chose to focus our attention on providing
understandable visual feedback that would not distract too
much from the visual channel and the conversation at hand.
Because alerts could easily distract the wearer from the
primary task of interacting with another person, visual cues
must be able to be consumed or ignored quickly. Using
simple shapes, figures, colors, or minimal text can
particularly support people with autism who often have
difficulty ignoring details [6] or filtering out information in
their environment [22]. Alongside the need to provide
feedback frequently but not too frequently, this feedback
must be displayed for an appropriate amount of time to be
interpreted without being overwhelming. At some point,
providing awareness of a pervasive issue on a constant
basis will likely create information overload.
3. Provide feedback only when you need it.
Individuals receiving social skills training often struggle
with the amount of feedback they receive. In our
observations, students commonly needed to stop the role-
playing activity to attempt to process all of the information
being thrust upon them. Likewise, teachers and therapists
can have trouble identifying when a student is struggling
and needs information about performance and when to stay
quiet and let the practice session play out. Technologies are
notoriously unhelpful when it comes to knowing when not
to bother users with “help” and often require some shared
agency with the human user [14]. At the same time, not
providing support when it is needed—or expected—
emerged as nearly as problematic a concern for interview
participants as providing too much support.
Alerts that are rare and attention grabbing could pressure
the wearer to act upon them at each occurrence, without
necessarily knowing how to do so. On the other hand, alerts
that are too frequent may lead wearers to ignore them
altogether, presuming an error in the system or just getting
overloaded by the feedback. Thus, in our work, we aimed to
provide continuous monitoring of the audio stream. As
described in the following implementation section,
SayWAT ultimately supported this continuous monitoring
for two dimensions of prosody (volume and pitch) as a first
step towards a comprehensive solution for monitoring
social communication.
4. Support data collection and reporting for both users
and clinicians.
Collecting clinically–meaningful data may be in tension
with what information the user is interested in receiving in
live conversations. Additionally, although clinicians we
interviewed wanted massive amounts of data—including
the audio recordings themselves—considerations around
the privacy of users (and their conversation partners) [21]
with potentially sensitive personal information needs to be
thought through. Who can see the alerts? Who knows the
alerts are occurring? Who has access to performance data
after the interaction? How much should a conversation
partner know about the assistive device?
To support sensitive data collection for any kind of assistive
technology, designers must consider the privacy and
security levels the platform permits. In the case of prosody
support, recording of audio in particular can be sensitive or
even illegal [1,13,31]. Thus, the system should store as little
conversation audio as needed to process the prosody levels.
Although clinicians invariably request support for
performance data of any intervention, this data collection
must be secondary to the primary task [2,31]. To support
data collection without distracting from the primary task,
systems to support prosody interventions should collect
meta-data regarding pitch range and volume and alert
triggering in the background.
5. Build self-efficacy over time.
The long-term goal of many assistive technologies and
interventions is not to improve skills just while wearing the
system but ideally to build those skills without need for the
device. In terms of social interactions, awareness as noted
above and confidence in one’s ability to respond to that
self-awareness are major steps towards improvement. A
variety of strategies are available for creating confidence
and feelings of self-efficacy, including allowing the users to
determine their own goals and to customize thresholds for
intervention based on those goals.
Additionally, it may take time for the alerts to be
meaningful to the wearer. Thus, practice over time—both of
the skills supported and in using the device itself—can
make a dramatic difference in the acceptability and
usefulness of the system. Systems should allow users to
easily navigate out of an assistive mode for brief periods of
time to allow wearers to practice their skills both with and
without the support without alerting their conversation
partner to whether or not they are receiving alerts. Over
time, it is our expectation that wearers might use the system
less and less frequently but have it available in case they
need it. In this way, they may better be able to internalize
the initial awareness provided by such a support.
THE SAYWAT SYSTEM
Based on the results of our fieldwork and design sessions,
we developed a prototype system that provides awareness
of prosody missteps during face-to-face social interactions.
SayWAT encourages micro-interactions using a hands-free
and heads up Google Glass display, in two modes:
• Volume mode: users receive an alert when their
volume is “too high”; and
• Pitch mode: users receive an alert when their pitch
range is “flat.”
To support rapid interpretation—and potential dismissal—
SayWAT provides either iconography or a single word for
rapid processing of the feedback (see Figure 2). For
example, when SayWAT detects that the user is
substantially louder than the ambient sound, it displays a
voice meter animation with a color spectrum from green to
yellow to red (Figure 2a left). Similarly, users receive an
alert when their pitch range is atypically small in the form
of the single word “flat” in white text on top of a black
background (Figure 2b right). This design leverages recent
recommendations of to use a simple static word for
nonverbal communication support while speaking [7]. The
feedback loop focuses on opportunities to improve, similar
to a sign held up for “um” during a speech. We found in
early trials that constant feedback was too overwhelming,
and users preferred warnings only.
Alerts are automatically dismissed once the system either
detects a change indicating the user has corrected the issue
or a timeout window has been reached. Specifically, as
recommended by research on micro-interactions [25], we
defined a three-second-delay interval following each
prompt when users cannot be prompted again.
Because there is no clear definition for “flat” or “loud” for
young adults with autism, we empirically defined alert
thresholds to establish ground truth. We conducted audio
analysis of data from an available sample from previous
work [12] to determine a parameter for pitch range from a
sample population. We analyzed 52 minutes of audio
samples from interview recordings of 14 adults. As pitch
range differs between men and women, we grouped these
recordings by gender and diagnostic label. From audio
samples of seven young men with autism and three without,
we extracted the standard deviation from the mean pitch
Figure 2: An individual with autism (shown with his permission) during the field study receiving on his Google Glass (left) a
volume alert when speaking “too high” and (right) a pitch alert when talking “flat.”
range derived from the fundamental frequency for the
neurotypical adults (Mean SD= 31 hz, range of SD=23-38
hz) and from the adults with autism (Mean SD=18 hz, range
of SD=12-29 hz). We also examined the acoustic
recordings of one woman with intellectual disabilities
(SD=22 hz) and three typically developing women (Mean
SD= 48 hz, range of SD=39-60 hz). Based on this analysis,
we defined “flat” prosody as having frequency variation of
less than 25 hz for two consecutive seconds. We defined
“loud” events as those that have amplitude three times
higher than the average amplitude of the conversation for at
least three seconds (see Table 1).
SayWAT records audio and immediately processes it
without storing the raw data. This design choice reflects our
concern about privacy and confidentiality, as noted above,
and limitations on the device itself. If audio were stored
alongside its meta-data, only very short session data could
be collected, minimizing the potential benefit for later
analysis by clinicians and researchers.
To detect thresholds at which alerts need to be triggered,
SayWAT uses a hamming window function to cluster two
seconds of audio signals from which it extracts features. To
detect “loud” episodes, SayWAT calculates the root mean
square (RMS) from the signal amplitude. To detect “flat”
episodes, SayWAT calculates the pitch range from the
fundamental frequency. The detection of the fundamental
frequency on the device uses a pitch extraction algorithm
based on the Yin algorithm [5], using a sampling rate of 8
kHz and 16-bits of depth for the audio analysis.
To ensure that alerts are understood but not bothersome,
SayWAT uses thresholds both for when to provide the
information and for how long to display it, with three
seconds as the maximum time any alert is shown. Alerts are
dismissed in less than three seconds if the user corrects the
speech in that time.
SayWAT Scenario of Use
In this scenario, an adult with autism uses SayWAT in a
social interaction, demonstrating some of the key
functionality in the system.
Jane is nervous about meeting her college roommate for the
first time, picks up her SayWAT device and places it on over
her glasses. She adjusts the device so that the screen is in
full view and taps the side of the device to select “volume”
mode, as she has been known to speak so loudly that people
complain. She walks into her dorm room to find Lucy, her
new roommate awaiting her. Lucy greets her, and SayWAT
measures the ambient noise in the room—both Lucy’s voice
and the music she is playing in the background. When Jane
begins to talk, SayWAT measures her volume against the
room, finding Jane to be too loud. SayWAT delivers an alert
in the form of a red speaker with several lines showing. As
she adjusts her volume and speaks more softly, the alert
disappears. After an hour of chatting and unpacking, Jane
has become much more comfortable matching her volume,
and SayWAT is rarely supplying alerts anymore. So, Jane
decides to check in on her pitch, knowing she can tend to
sound a bit bored, and she wants Lucy to know how excited
she is. She taps the side of her device, switching modes, and
begins talking again. A “flat” alert immediately pops up.
The alert is displayed for three seconds then fades away, as
Jane was not able to modulate her voice during that time
frame. After the alert, the system rests for three seconds and
then alerts her again. After a few alerts, she successfully
modulates her voice, and Lucy, noticing the change in her
emotion, smiles at her new friend.
EVALUATION
To address whether SayWAT can accurately detect atypical
prosody and intervene when detected, we conducted a
laboratory study of short conversations between individuals
with autism and those without. Additionally, this
experimental study provided evidence that the intervention
can be efficacious in supporting prosody improvement.
Finally, to validate our design guidelines and assess the
acceptability and perceived utility of the system, we
supplemented the laboratory study with exit interviews and
conducted a second more naturalistic study with ten adults
with autism attending an employment-training program in
an adult education facility. We describe the methods used in
each of these studies and our combined results here.
Methods
Experimental Evaluation
We assessed feasibility of our approach and technical
accuracy through a within-subjects study with four young
adults with autism (one woman, aged 22-25) interacting
socially with research volunteers from UC Irvine.
Participants were recruited from our local contacts; the
gender ratio of 5:1 is fairly representative of the distribution
of autism in the population[3]. They were not screened for
specific prosodic concerns, because there are limited tools
to measure prosody in a standardized way [8]. However, all
participants exhibited challenges with social interactions.
The behavioral research lab in which this study took place
contains five individual rooms that accommodate one to
two people in an interview setting and a configurable group
conference room. All rooms are equipped with audio and
video recording equipment connected to a control room that
is equipped with a camera remote controller, recording
controllers, and a post-production station.
Prosody
(mode)
Feature Threshold
Sustained
time
Loud
(volume)
Amplitude
RMS 3 times >
average
For 2
seconds
Flat
(pitch)
Fundamental
Frequency
Pitch range <
25 Hz
For 3
seconds
Table 1: Summary of features and thresholds defined for the
detection of “loud” and “flat.”
Participants first reviewed the study information and
provided consent. Each participant was invited to explore
the Google Glass device for as long as they chose to
become comfortable wearing it. For between one and five
minutes each, participants used the device (without the
SayWAT system running), and asked researchers usability
questions about the device. Then each participant interacted
with one research volunteer serving as conversational
partners for five-minute sessions, randomly rotating
through the following three conditions, three times each:
• No treatment mode: participants wearing Google
Glass with the SayWAT software turned off.
• Pitch mode: participants wearing the system and
receiving text-based alerts when their voice was “flat”.
• Volume mode: participants wearing the system and
receiving visual alerts when their speech volume does
not match the surrounding noise.
All sessions were audio and video recorded, and logs were
stored indicating every time an alert was triggered and for
how long that alert remained active. Both participants with
autism and their conversational partners filled a survey with
a five-point Likert scale (1 = inadequate, 5 = excellent) to
rate the quality of each five-minute conversation
immediately after completion.
Upon completion of the experimental portion of the study,
each participant was interviewed briefly (on average ten
minutes). These short and highly structured interviews
allowed us to follow up on anything interesting observed
during the experiments and included questions like: “If you
received alerts, what did you do?”, “Do you think the
system helped you modulate your voice?” and “Is there
something you wish wearable technology could do for
you?” All interviews were recorded and transcribed
verbatim for analysis.
Naturalistic Study
To establish the usability and acceptance of wearing the
device during face-to-face conversations we conducted a
naturalistic study in a real world office setting. The
naturalistic study was conducted with ten young adults with
autism recruited from a work training program in Southern
California (mean age = 24.9, SD=3.7).
Participants first listened to an orientation to the system and
those who consented were then given the opportunity to use
SayWAT during face-to-face social conversations.
Participants were asked to participate as part of a “speed
dating” exercise with their coworkers. Each conversation
lasted approximately three to five minutes, with the
participants rotating amongst partners and sharing three
SayWAT systems amongst each other.
Analysis
We analyzed log data from the laboratory study by first
calculating the total and average of alerts each participant
received for each condition. This required one researcher to
manually coded all recorded videos for pitch and volume
events by speaker. Finally, we used a correlation test to
compare adults’ with autism perception of the quality of
each conversation by condition with those of their typical
partners to provide a measure of ecological validity.
As in the original design study, we used a constant
comparative method across our interview and observation
data from both the laboratory and naturalistic studies
together. We used a deductive approach to assess whether
and how our implementation of the design guidelines
through SayWAT met the needs of participants.
Additionally, we used an inductive approach to understand
new emergent design concerns and considerations for future
wearable assistive technologies.
Feasibility
To determine if the system was accurately detecting the
thresholds set for pitch and volume, we calculated the
agreement between the log entries and the video recording
values for pitch and volume. We found agreement to be
95.3% for pitch and 97.4% for volume. Since detecting a
threshold is independent of who is speaking, we determined
who was speaking by hand coding all sessions for speakers
and then summing up the number of alerts triggered by
participants as captured in the logs. System’s logs indicated
that each participant received approximately 150 combined
alerts for pitch and volume sessions (M=22.08, SD=8.3) per
participant so each had ample opportunity to come in
contact with awareness alerts about their prosody.
Once assured that our approach was in fact feasible, as
described above, it was also important to understand the
ways in which the designs, formed from the five guiding
principles, impacted the social interactions observed in our
studies. In the following subsections, we describe the ways
in which SayWAT successfully addressed the design issues
as measured by its efficacy and the validation of the
guidelines through acceptability of the system as perceived
by potential users.
Efficacy
The ultimate goal of a system like SayWAT is to improve
social communication both within and outside of contexts
in which a person might use the system. Determination of
this kind of impact on outcomes, however, requires a large
sample size and long-term engagement with the system. As
a step towards understanding the potential for wearable
assistive technologies to impact this kind of social
communication, in this work, we assessed whether there
was an immediate effect of SayWAT in reducing volume
and pitch missteps during the laboratory study.
Specifically, we compared how many times a user hit the
threshold values for pitch and volume during each
conversation as a measure of atypical prosody in a given
conversation. Fewer such events overtime would indicate
that the wearer is improving in their ability to match
volume to their conversation partner and in their ability to
vary pitch as is more typical. When in the volume condition
(receiving alerts about volume), users reduced the number
of times they hit the volume thresholds (too loud) over the
course of the three volume sessions (see Figure 3 left), but
there was no clear pattern of reduction in alerts across the
pitch sessions (see Figure 3 right).
Of course, the exact volume or pitch of an utterance matters
most in terms of the way in which the people in the
conversation perceive the speech. Thus, we also compared
ratings of the conversation’s quality by both the wearer and
his/her conversation partner. Participants rated the quality
of their conversations higher in the volume and no
treatment condition, slightly less positively for the pitch
mode. Conversations in the volume mode were rated more
highly on a 5-point scale (M=4.38, SD=.80) than in the
other conditions (pitch M=4.1, SD=.71; baseline M= 4.2,
SD=.71). Additionally, scores from the participants and
their partners in the volume condition were highly
correlated (r=.94). No such relationship was found for the
no treatment condition (r=.10) or pitch condition (r=.64)
Although not entirely conclusive, these results indicate that
volume may be more easily addressed through an
intervention than pitch. Additionally, people with autism
may have an easier time assessing their own abilities in
terms of volume.
Usability and Acceptability
Even the most effective interventions will fail if people
receiving them do not find them to be useful and cannot
tolerate their delivery. Thus, in this work, we were also
interested in understanding how natural it was to use
SayWAT, whether the feedback delivered by the system
was understandable and actionable, and any other potential
issues that emerged through the use of the system.
Wearing an Intervention
Overall, participants reported that SayWAT was simple,
and sometimes enjoyable to use. In particular, a major
concern of our design was that wearers not be distracted
from their primary task of engaging in conversation.
“[it] was ok, I wouldn’t say distracting, interesting
seeing something just right there” -Laboratory
Participant, male, age 25 (LP3)
“[I] just let it go, I did fine, I think I really enjoyed
it”. –-Laboratory Participant, female, age 22 (LP4)
Nearly every participant across the two studies described
the system as fairly natural. They commented they were
able to blend its feedback into their interactions. For
example, one participant in the naturalistic study noted:
“sometimes I would see the volume-meter, I didn’t
even realize I had it on”. - Naturalistic study
participant, male, age 27 (FP1)
Despite the overall positive responses, however, there were
some concerns around wearing the Google Glass device
itself. For example, one participant in the naturalistic study
described the device as too prominent in his line of vision:
“your natural inclination is to go a little cross-eyed” -
Naturalistic study participant, male, age 24 (FP4)
These results indicate that for short-term engagement (e.g.,
one to two hours as in our study), visual wearable feedback
is likely to be appropriate. However, for longer-term use,
additional work is needed to understand the potential
opportunities and challenges of our approach.
Understanding and Responding to Feedback
Although our primary goal with SayWAT was to improve
awareness of one’s own social missteps, this awareness is
more useful when one can respond to it. Thus, we queried
participants not only about how well they understood the
feedback provided to them but also about how well they felt
capable of responding to it.
Figure 3: Laboratory participants’ mean number of occurrences when participants met criteria for receiving an alert (what we call
Events) for volume (left) (i.e., too loud compared to speech partner) and for pitch range (right) (e.g., showed too little variance over
three second time window). Events are measured even when in a conditi on that does not trigger that alert for that event. Lower
values indicate fewer events. The X-axis demarks each of the three sessions dedicated to each condition.
Overall, most participants described appreciating the
feedback and being able to understand it.
“sometimes you can’t tell what’s going on with
yourself, you kinda need feedback and that’s what
[SayWAT] does”. -Naturalistic study participant,
male, age 27 (FP3)
“It wasn’t so bad, maybe it could help me out” -
Naturalistic Study participant, male, age 24 (FP2)
One participant explicitly asked for more support from the
system:
“directives, directions. Step through a conversation,
like it would be good to emote in a certain way.
Picture and symbols, that’s the strongest way to
translate an idea” – Naturalistic study participant,
male, age 24 (FP4)
Alerts were delivered only when a threshold was met, and
every participant received alerts in treatment sessions. Yet,
one participant reported not needing the alerts.
“For me, I am calm right now so it might be more
useful when I am excited…I wish it could do both
functions, I didn’t really need to be prompted”. -
Laboratory Participant, male, age 25 (LP2)
Two laboratory participants and one naturalistic study
participant reported seeing no alerts at all, despite
verification by us that the system was working and log
analysis indicating that alerts were indeed generated.
“I didn’t (look), I didn’t hear it beeping”.
-Naturalistic study participant, male, age 24 –FP2
As noted by this participant’s quote, this could be a
usability error on our part—he was expecting an audible
alert, and we generated only visual ones. So open questions
remain as to explore more channels for providing feedback.
Similarly, another participant noted that the system was
“very difficult, you have to put in right position to see the
whole screen”. - Naturalistic study participant, male, age 28
(FP6)
Despite these issues, the laboratory study participants who
both stated they saw no alerts (LP1 and LP2) had fewer
atypical volume events in the volume condition (see Figure
5). As we probed further, in some cases the initial claim of
not seeing the alert changed to a discussion of not wanting
to see the alert. For example,
“[I] was wrapped in the conversation, didn’t really
want to look at it. ... would see flat a few times, like
200 times it was the volume thing, mostly ignore it.” -
Laboratory Participant, male, age 25 (LP1)
Participants described ignoring alerts for a variety of
reasons, but one theme that was common regarded the
difficulty of taking action in the pitch condition. In the
volume condition, however, the alerts seemed to be much
more understandable,
“I could see the volume, it made sense. It was very
natural, I just did what was natural in my normal
voice, if I tried to go up or down it seemed
unnatural...after getting adapted to it, the very low
ones were my partner, then the larger ones was me,
after figuring that out, it was much easier to do it”. -
Laboratory Participant, male, age 25 (LP3)
“(I saw the volume alert) somewhere in the yellow, it
felt normal”. – Naturalistic study participant, male,
age 24 (FP5)
The ability to comprehend volume more readily than pitch
feedback likely explains the demonstration of efficacy for
volume but not for pitch described in the previous section.
Potential for Collaborative Assistive Technologies
Although created for social situations, SayWAT is an
individual tool. A major challenge of delivering feedback
based on audio data, however, is that the other speaker
influences the information received. As described
previously, in the volume condition, this is quite explicit.
User speech volume is measured against the context of the
ambient volume, to the degree possible. In the pitch
condition, the conversation partner’s speech was intended
to be ignored. In both conditions, however, the limitations
of the Glass hardware made this an imperfect system,
sometimes delivering feedback when the wearer was not
speaking. Participants had varied responses to the receipt of
this feedback, but nearly all of them noted it:
“it would sometimes say flat but that’s when he was
talking.” – Laboratory Participant, male, age 25 LP2
In some cases, people even described wanting more such
feedback. For example, one participant suggested that the
feedback could be related to their conversational partner’s
interest in their interaction:
“If it could know a person’s interests due to the tone
of their voice that might be interesting. If it can pick
up if the person is bored, the person you are talking
to, it would be pretty interesting, like allow the person
speaking to shift gears to get to something they
know…who they are talking to”. – Laboratory
Participant, male, age 25 (LP3)
His interest in exploring the potential of an augmented
social interaction leave us with questions about how we
might adapt our design decisions to be flexible enough to
provide more structure for interested users. For example,
alerts about a conversation partner’s prosody could help the
wearer to identify nonverbal behaviors or even emotions in
their conversational partners. This type of exploration could
prove to be useful in understanding the conversational
partner’s state as well as the impact of one’s own prosody
in the conversation, and ultimately the communication
between them.
LIMITATIONS AND DISCUSSION
Interventions are most successful when the people receiving
them perceive them to be useful and can tolerate their
delivery. In our case, our team had a collective seventy
years of experience working with people with autism. Thus,
we initially relied on our past experiences and recognized
gaps in the research literature. However, by working with
potential end users earlier in the design process, we might
have found additional opportunities to develop supports or
changed our approach. Inherent to that concept is the need
to expand how and when HCI researchers engage with
populations who can be difficult to access or to draw into
research and design projects. In particular, in our case,
people with autism often struggle with abstract concepts.
Thus, creating new design processes that allow for
participatory and co-design without having to conceptualize
abstractions would improve the access of these individuals
to important research and design experiences.
Additionally, in mental health applications—whether
related to autism, depression, or any other experience—the
challenges of the condition can actually make it difficult for
a potential end user to see the worth of the technology.
Inherent to prosody challenges is that people with autism
often cannot hear the difference in their own prosody as
compared to others with whom they are speaking. Thus, an
individual with autism may not be aware of atypical
prosody much less be seeking a solution for it. These
tensions leave open questions for exploring novel design
and implementation methods, as well as potentially
marketing and communicating about such designs.
With these challenges and limitations as a caveat, here we
reflect on how the adults with autism responded to our
design in particular. First, in regard to focusing on
awareness, participants confirmed they saw the alerts and
acknowledged the information was new, as well as in a few
cases, asked for more information or instruction. Second, in
regard to making alerts rapidly understandable, participants
understood the alerts without distraction and were able to
explore their voice. However, some participants described
needing to decipher if the alert was related to their way of
speaking or that of their partner. In future versions, the
system could be easier to understand by clearly identify to
whom the alert relates: the wearer or the conversation
partner. This deciphering may have gotten easier to
interpret over time. Perhaps further orientation to the
system at the start would have improved understanding
during the implementation in live conversations. Thirdly, in
regard to providing feedback only when needed, in general,
the participants described being able to see the alerts when
they wanted to see them and could tune them out when not
wanted. Fourthly, in regard to supporting data collection,
users’ expressed interests in data attribution—making sure
which data were their own and pondering about data that
had to do with the other person. Lastly, in regard to
building self-efficacy overtime, we found evidence of
user’s willingness to explore and consider the utility of such
a system. These results indicate that for short-term
engagement (e.g., the one to two hours of wearing that
participants did on our study), visual wearable feedback is
likely to be appropriate. However, for longer-term use,
additional work is needed to understand the potential
opportunities and challenges of our approach.
CONCLUSION AND FUTURE WORK
Social skills training, an accessibility issue for adults with
autism [8], can be augmented with wearable assistive
technologies to support face-to-face conversations in real
time. We developed SayWAT, an application to support
awareness of two dimensions of atypical prosody—volume
and pitch—and evaluated it through two studies with adults
with autism.
Using a mixed method approach to understand the design
considerations for wearable assistive technologies to
support social awareness, we developed five key design
guidelines for the creation of a system to support face-to-
face conversations for adults with autism. We then used
these guidelines to design, develop, and evaluate SayWAT.
This work demonstrates that usage of the design guidelines
produces a successful system for automatically detecting
and providing feedback on atypical prosody. Our evaluation
shows the system can be usable by adults with autism with
minimal training while maintaining sociability.
This work contributes empirically derived and validated
recommendations for designing wearable assistive
technologies. In particular, we demonstrated the role that
ephemeral interfaces on a heads-up display serve in
supporting the ambiguous dual role of being noticeable and
actionable to the wearer when needed but subtle when not.
SayWAT goes beyond the current literature in that the
system extends support with prosody beyond a therapy
session into natural interactions. Glass, although not yet
been widely adopted, is less stigmatizing solution compared
to traditional assistive device. In a future version, the
system could be less visible and customizable for each user
to provide the feedback most desired at the time, thus
allowing for individualized exploration of the neurotypical
social world.
ACKNOWLEDGMENTS
We thank all of the individuals that participated in this study—the
participants, their families; research assistants Kimberley Nachelle
Klein and Korayma Arriaga, Juan Leon and NFAR; STAR lab
members. We thank Hana lab for use of their lab space. We
acknowledge the generous support of Google research. This
research was approved by human subjects protocol 2015-1917.
This work is supported by Robert and Barbara Kleist, the
UCMexus-CICESE Graduate Student Short-Term Research,
CONACYT Fellowship of the second author, and the UCMexus
Collaborative Grant of the last two authors.
REFERENCES
1. Gregory D. Abowd, Gillian R. Hayes, Giovanni
Iachello, Julie A. Kientz, Shwetak N. Patel, Molly M.
Stevens, and Khai N. Truong. 2005. Prototypes and
Paratypes: Designing Mobile and Ubiquitous
Computing Applications. IEEE Pervasive Computing
4, 4: 67–73.
2. Gregory D. Abowd and Elizabeth D. Mynatt. 2000.
Charting past, present, and future research in
ubiquitous computing. In Proceedings of the ACM
Transactions on Computer-Human Interaction 7, 1:
29–58.
3. Jon Baio. 2015. Prevalence of Autism Spectrum
Disorder Among Children Aged 8 Years — Autism
and Developmental Disabilities Monitoring Network,
11 Sites, United States, 2010. Retrieved from cdc.gov
4. Jérémy Bauchet, Hélène Pigot, Sylvain Giroux, Dany
Lussier-Desrochers, Yves Lachapelle, and Mounir
Mokhtari. 2009. Designing judicious interactions for
cognitive assistance. In Proceedings of the eleventh
international ACM SIGACCESS conference on
Computers and accessibility - ASSETS ’09, 11.
5. Alain de Cheveigné and Hideki Kawahara. 2002. YIN,
a fundamental frequency estimator for speech and
music. The Journal of the Acoustical Society of
America 111, 4: 1917.
6. P. Crissey. 2009. Teaching Communicaiton Skills to
Chidlren with Autism. Attainment.
7. Ionut Damian, Chiew Seng Tan, Tobias Baur,
Johannes Schöning, Kris Luyten, and Elisabeth André.
2015. Augmenting Social Interactions: Realtime
Behavioural Feedback using Social Signal Processing
Techniques. In Proceedings of the 33rd Annual ACM
Conference on Human Factors in Computing Systems,
565–574.
8. Joshua J. Diehl and Rhea Paul. 2009. The assessment
and treatment of prosodic disorders and neurological
theories of prosody. International Journal of Speech-
Language Pathology 11, 4: 287–292.
9. Lizbeth Escobedo, David H Nguyen, LouAnne Boyd,
Sen Hirano, Alejandro Rangel, Daniel Garcia-Rosas,
Monica Tentori, and Gillian Hayes. 2012. MOSOCO :
A Mobile Assistive Tool to Support Children with
Autism Practicing Social Skills in Real-Life Situations.
In Proceedings of the Conference on Human Factors
in Computing Systems: 2589–2598.
10. Peter F. Gerhardt and Ilene Lainer. 2011. Addressing
the Needs of Adolescents and Adults with Autism: A
Crisis on the Horizon. Journal of Contemporary
Psychotherapy 41, 1: 37–45.
11. Mohammad Ghaziuddin, Neera Ghaziuddin, and John
Greden. 2002. Depression in persons with autism:
Implications for research and clinical care. Journal of
autism and developmental disorders 32, 4: 299–306.
12. Gillian R. Hayes, V. Erick Custodio, Oliver L.
Haimson, Kathy Nguyen, Kathryn E. Ringland, Rachel
Rose Ulgado, Aaron Waterhouse, and Rachel Weiner.
2015. Mobile video modeling for employment
interviews for individuals with autism. Journal of
Vocational Rehabilitation 43, 3: 275–287.
13. Gillian R. Hayes, Shwetak Patel, Khai N. Truong,
Giovanni Iachello, Julie A. Kientz, Rob Farmer,
Gregory D. Abowd. 2004. The Personal Audio Loop
Designing a Ubiquitous Audio-Based Memory Aid -
Springer. Mobile Human-Computer Interaction, 168-
174.
14. Eric Horvitz. 1999. Principles of mixed-initiative user
interfaces. In Proceedings of the SIGCHI conference
on Human Factors in Computing Systems, 159-166.
15. Leo Kanner. 1943. Autisitc Disturbances of Affective
Contact. 217–250. Retrieved from
http://neurodiversity.com/library_kanner_1943.pdf
16. Mark Knapp, Judith Hall, and Terrence Horgan. 2013.
Nonverbal Communication in Human Interaction.
Cengage Learning.
17. Scott Lindgren and Alissa Doobay. 2011. Evidence-
Based Interventions for Autism Spectrum Disorders.
18. Miriam Madsen, Rana El Kaliouby, Matthew
Goodwin, and Rosalind Picard. 2008. Technology for
just-in-time in-situ learning of facial affect for persons
diagnosed with an autism spectrum disorder.
Proceedings of the 10th international ACM
SIGACCESS Conference on Computers and
Accessibility, 19–26.
19. Susan Dickerson Mayes, Angela A. Gorman, Jolene
Hillwig-Garcia, and Ehsan Syed. 2013. Suicide
ideation and attempts in children with autism.
Research in Autism Spectrum Disorders 7, 1: 109–119.
20. Carla a Mazefsky and Donald P Oswald. 2007.
Emotion perception in Asperger’s syndrome and high-
functioning autism: the importance of diagnostic
criteria and cue intensity. Journal of autism and
developmental disorders 37, 6: 1086–95.
http://doi.org/10.1007/s10803-006-0251-6
21. David H. Nguyen, Gabriella Marcu, Gillian R. Hayes,
Khai N. Truong, James Scott, Marc Langheinrich, &
Christof Roduner. 2009. Encountering SenseCam:
Personal recording technologies in everyday life.
Proceedings of the 11th International Conference on
Ubiquitous Computing: 165–174.
22. Rhea Paul, Amy Augustyn, Ami Klin, and Fred R.
Volkmar. 2005. Perception and Production of Prosody
by Speakers with Autism Spectrum Disorders. Journal
of Autism and Developmental Disorders 35, 2: 205–
220.
23. Ronald J. Seiler. 2007. Assistive Technology for
Individuals with Cognitive Impairments. The Idaho
Assistive Technology Project, Center on Disabilities
and Human Development, University of Idaho,
Moscow.
24. M D Rutherford, Simon Baron-Cohen, and Sally
Wheelwright. 2002. Reading the Mind in the Voice : A
Study with Normal Adults and Adults with Asperger
Syndrome and High Functioning Autism. Journal of
autism and developmental disorders. 32, 3: 189-194.
25. M. Satyanarayanan. 2004. Augmenting cognition.
IEEE Pervasive Computing 3, 4–5.
26. P. T. Shattuck, S. C. Narendorf, B. Cooper, P. R.
Sterzing, M. Wagner, and J. L. Taylor. 2012.
Postsecondary Education and Employment Among
Youth With an Autism Spectrum Disorder.
PEDIATRICS 129, 6: 1042–1049.
http://doi.org/10.1542/peds.2011-2864
27. Lawrence D Shriberg, Rhea Paul, Lois M Black, and
Jan P van Santen. 2011. The hypothesis of apraxia of
speech in children with autism spectrum disorder.
Journal of autism and developmental disorders 41, 4:
405–26.
28. Lawrence D. Shriberg, Rhea Paul, Jane L. McSweeny,
Ami Klin, Donald J. Cohen, and Fred R. Volkmar.
2001. Speech and prosody characteristics of
adolescents and adults with high-functioning autism
and Asperger syndrome. Journal of Speech, Language,
and Hearing Research 44, 5: 1097–1115.
29. Elizabeth Schoen Simmons, Rhea Paul, and Frederick
Shic. 2014. The Use of Mobile Technology in the
Treatment of Prosodic Deficits in Autism Spectrum
Disorders. Retrieved September 22, 2015 from
http://digitalcommons.sacredheart.edu/speech_fac/114/
30. Monica Tentori and Gillian R. Hayes. 2010. Designing
for interaction immediacy to enhance social skills of
children with autism. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems,
1-8, 51.
31. Khai N. Truong and Gillian R. Hayes. 2007.
Ubiquitous Computing for Capture and Access.
Foundations and Trends® in Human–Computer
Interaction 2, 2: 95–171.
32. S. W. White, A. Scarpa, C. M. Conner, B. B. Maddox,
and S. Bonete. 2014. Evaluating Change in Social
Skills in High-Functioning Adults With Autism
Spectrum Disorder Using a Laboratory-Based
Observational Measure. Focus on Autism and Other
Developmental Disabilities.
33. Qianli Xu, Michal Mukawa, Liyuan Li, Joo Hwee Lim,
Cheston Tan, Shue Ching Chia, Tian Gan, and
Bappaditya Mandal. 2015. Exploring users’ attitudes
towards social interaction assistance on Google Glass.
In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, 9–12.