IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013 23
A Voice-Input Voice-Output Communication Aid
for People With Severe Speech Impairment
Mark S. Hawley, Stuart P. Cunningham, Phil D. Green, Pam Enderby, Rebecca Palmer, Siddharth Sehgal, and
Abstract—A new form of augmentative and alternative
communication (AAC) device for people with severe speech
impairment—the voice-input voice-output communication aid
(VIVOCA)—is described.The VIVOCA recognizes the disordered
speech of the user and builds messages, which are converted into
synthetic speech. System development was carried out employing
user-centered design and development methods, which identified
and refined key requirements for the device. A novel methodology
for building small vocabulary, speaker-dependent automatic
speech recognizers with reduced amounts of training data, was
applied. Experiments showed that this method is successful in
generating good recognition performance (mean accuracy 96%)
on highly disordered speech, even when recognition perplexity
is increased. The selected message-building technique traded
off various factors including speed of message construction and
range of available message outputs. The VIVOCA was evaluated
in a field trial by individuals with moderate to severe dysarthria
and confirmed that they can make use of the device to produce
intelligible speech output from disordered speech input. The
trial highlighted some issues which limit the performance and
usability of the device when applied in real usage situations, with
mean recognition accuracy of 67% in these circumstances. These
limitations will be addressed in future work.
Index Terms—Augmentative and alternative communication,
automatic speech recognition, dysarthria, voice output communi-
cannot use natural speech reliably to communicate, especially
with strangers . For instance, the speech of people with mod-
erate to severe dysarthria—the most common speech disorder
POKEN language communication is a fundamental factor
in quality of life, but as many as 1.3% of the population
Manuscriptreceived October 25, 2011; revised April02, 2012; accepted May
11, 2012. Date of publication August 03, 2012; date of current version January
04, 2013. This work was sponsored by the U.K. Department of Health New and
Emerging Application of Technology (NEAT) programme under Grant E090.
The views expressed in this publication are those of the authors and not neces-
sarily those of the Department of Health.
M. S. Hawley, P. Enderby, and R. Palmer are with the School of
Health and Related Research, University of Sheffield, S1 4DA Sheffield,
U.K. (e-mail: email@example.com; firstname.lastname@example.org;
S. P. Cunningham and S. Sehgal are with the Department of Human Com-
munication Sciences, University of Sheffield, S1 4DA Sheffield, U.K. (email:
P. D. Green is with the Department of Computer Science, University of
Sheffield, S1 4DA Sheffield, U.K. (e-mail: email@example.com).
P. O’Neill is with the Assistive Technology Team, Barnsley Hos-
pital NHS Foundation Trust, Barnsley, S75 2EP Barnsley, U.K. (e-mail:
Digital Object Identifier 10.1109/TNSRE.2012.2209678
affecting 170 per 100000 of population —is usually unintel-
ligible to unfamiliar communication partners. For these people,
their speech impairmentcan preclude them from interacting in a
manner that allows them to exploit their potential in education,
employment and recreation.
Speech impairment is often associated with severe physical
disabilities as a result of progressive neurological conditions
such as motor neurone disease, congenital conditions such as
cerebral palsy, or acquired neurological conditions as a result of
stroke or traumatic brain injury. Current technological tools for
communication, voice-output communication aids (VOCAs),
generally rely on a switch or keyboard for input. Consequently,
they can be difficult to use and tiring for many users, and they
do not readily facilitate natural communication as they are rela-
tively slow and disrupt eye contact . O’Keefe et al.  report
that users need a device which is physically easy to operate in
a wide range of positions and environments. Many people with
VOCAs often prefer to speak rather than use the aid, even if
their speech is largely unintelligible, as it is a more natural form
of communication . In addition, Todman et al. , found that
listeners rated users of a communication aid as more socially
competent if they had a more rapid rate of delivery. It is there-
fore desirable that a new communication aid retain, as far as
possible, the speed and, ideally, the naturalness of spoken com-
Despite its apparent attractiveness as an access method, the
potential complications of recognizing impaired speech have
meant the prospect of spoken access to technology remains un-
fulfilled. Commercially available automatic speech recognition
(ASR) systems can work well for some people with mild and
there is an inverse relationship between the degree of impair-
ment and the accuracy of speech recognition. For people with
severe speech impairment, commercial speech recognition sys-
tems are not a viable access solution. Moreover, the small-scale
laboratory experiments reported in ,  do not represent the
range of environmental conditions that are likely to be encoun-
tered in realistic usage, which is known to degrade recognition
Thus, while ASR has been used for many years as a method
of access to technology by some people with disabilities but
channel for VOCAs. Previous prototypes of voice-input voice-
have not been tested extensively with users or reached the stage
of becoming available as commercial products , . A dif-
ferent approach has been proposed by Wisenburn and Higgin-
1534-4320/$31.00 © 2012 IEEE
24IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013
botham who explored the potential for using speech recognition
in a VOCA to recognize the speech of a conversation partner
, . This process was then used to present suggested ut-
terances to the speech impaired user based on what their con-
versation partner had said.
In recent years there has been a realization that, for people
with severe speech impairment, an alternative approach to
ASR must be followed. The authors’ previous work has been
successful in developing speech controlled interfaces to home
control systems (also known as environmental control sys-
tems or ECS) for people with severe dysarthria . In this
work, we applied statistical ASR techniques, based on hidden
Markov models (HMMs), to the speech of severely dysarthric
speakers to produce speaker dependent recognition models,
and developed a novel methodology for recognizer-building.
This approach relied on a user-training phase in which the user
practised speaking to the recognizer, whilst receiving consistent
visual feedback based on the similarity between their current
attempt and the distribution of their previous attempts. This
enabled the user to become more efficient at producing the
target utterances, by reducing variation in their vocalizations,
while at the same time facilitating the collection of additional
speech examples that were then used to train the final recog-
nizer. These enhancements resulted in speech recognition being
a viable means of controlling assistive technology for small
input vocabularies, even for people with severe speech disor-
ders . More recently, Sharma and Hasegawa-Johnson 
have demonstrated that maximum a priori (MAP) adaptation
from speaker-independent ASR can improve recognition rates,
sometimes producing better performance than the equivalent
speaker-dependent ASR, though this has not yet been applied
in an assistive technology context.
This paper describes the development of a VIVOCA which
is intended to recognize and interpret an individual’s disordered
speech and deliver the required message in clear synthesized
II. SYSTEM DESCRIPTION
The development made use of a user-centred design and de-
considered the views of both potential VIVOCA users and of
speech and language therapists/pathologists who provide voice-
output communication aids . A wide range of user require-
ments were elicited and the VIVOCA was implemented to meet
these requirements where feasible. The development process
was iterative and the implementation was gradually refined by
testing developments with a group of four potential VIVOCA
users. These four users were people with moderate or severe
dysarthria, with one individual having additional verbal dys-
praxia. They tested the device at each stage and gave feedback
to allow us to improve the VIVOCA.
Fig. 1 shows a schematic of the system and its major com-
ponents. The user speaks into a microphone and the speech is
processed and recognized by a speech recognizer. The recog-
nized words are passed to a message building module. Depen-
dent on this input, the message building module will update the
screen, potentially supply audio feedback to the user, and deter-
mine the range of possible future inputs. This process continues
Fig. 1. Schematic diagram of the voice-input voice-output communication aid
in an iterative fashion as the user builds their message. When
the message is complete it is passed to the speech synthesizer,
producing intelligible spoken output via a speaker. The system
components are described below.
A. Speech Recognition
In prevailing methods, automatic speech recognition (ASR)
is based on statistical models (usually HMMs) of speech units.
These models are trained on a large corpus (perhaps hundreds
of hours) of data recorded by many speakers. For a large vo-
cabulary system, the speech units will be at the level of indi-
vidual speech sounds, phones. The resulting speaker-indepen-
dent recognizer can be adapted for an individual speaker, given
a small amount of enrolment speech data from that speaker.
However, this ASR technique is unsuitable for speakers with
severe speech disorders because the amount of material avail-
able for training is severely limited (as speaking often requires
great effort), the material is highly variable, often has a lim-
ited phonetic repertoire, and is too different from the “normal”
speech used in training speaker-independent models for many
conventional adaptation techniques to be of assistance. Instead,
we have introduced a new methodology for building small vo-
cabulary, speaker-dependent personal recognizers with reduced
amounts of training data. Using this approach, which we out-
line below, accurate recognition of severely dysarthric speech
has been shown to be feasible for relatively small vocabularies
Initial recordings were collected from the user. Depending
on their preference these were collected using either a headset
microphone (Sony-Erikson Akono HBH-300 headset con-
nected via Bluetooth), or a desktop microphone (Acoustic
Magic Voice Tracker array microphone), connected to a laptop
computer (Dell Inspiron 1100). Signals were sampled at 8
kHz, the maximum sampling frequency on the Bluetooth audio
The recordings consist of isolated productions of each of
the words that are required for the recognizer’s input vocabu-
lary. These examples are used to train the initial whole word
models. In this study we used HMMs with 11 states, with a
straight-through arrangement. The acoustic vectors were 12
Mel-frequency cepstral coefficients (MFCCs) derived from
a 26-channel filterbank with a 25 ms analysis window and
10 ms frame-rate. Energy normalization and cepstral mean
normalization were also applied to the input features. This is
a conventional ASR front-end. The models were trained using
the HMM toolkit  with the Baum-Welch algorithm.
HAWLEY et al.: A VOICE-INPUT VOICE-OUTPUT COMMUNICATION AID FOR PEOPLE WITH SEVERE SPEECH IMPAIRMENT
This approach is straightforward for a typical speaker, but it
is more problematic for the intended users of a VIVOCA due
to their speech impairment. This means there is a scarcity of
variability in the productions of speakers with dysarthria .
However, the approach described in detail in  is to use the
initial recordings to estimate models that can be incorporated
into a “user-training” application. This application prompts the
user repeatedly to speak each of the words in the initial recog-
nition vocabulary. Each utterance is recorded, but crucially the
user is given feedback on “closeness of fit” of each attempt to
their own recognition model. This is determined from the log
probability of the model generating the word by the most likely
path (computed by the Viterbi algorithm) , . The user is
also guided in this process by being able to listen to their “best
attempt so far,” so that they may attempt to replicate it. Here,
be most likely to generate. Previous studies have shown that
both listening to their previous best attempt, and repeated prac-
, and this has the additional benefit of providing additional
training examples for retraining recognition models .
At the conclusion of this user-training step, the recognition
models can be re-estimated using both the initial training exam-
ples, and subsequent examples collected with the user training
application, producing a recognizer which is more accurate and
robust to variations in the user’s speech. The process can, of
course, be repeated.
We have previously found that recognition accuracies above
80% for isolated words, and above 70% for commands (short
strings of words) are consistently attainable for small vocabu-
laries of severely dysarthric speech . Whilst home control
tasks can be carried out with a relatively small number of con-
trol inputs (and small input vocabulary of around 10–15 words),
supporting speech communication requires more flexibility in
its output and is therefore likely to require a larger input vocab-
ulary. For speaker-dependent recognition it is known that word
recognition accuracy falls with increasing vocabulary size ,
and this reduction is likely to be exacerbated when speech input
is highly variable, as is the case with dysarthric speech. There-
fore, a major challenge in this work is to be able to accommo-
date larger input vocabularies whilst retaining acceptable levels
of word accuracy.
B. Message Building
The message building module constructs messages, which
the user wishes to communicate, from the recognized input
words. Using input speech to drive output speech from a VOCA
presents a new challenge which has not been addressed in any
depth in previous research. The simplest, and in many ways the
ideal, form of message building, given that we are recognizing
word units, would be to recognize each word individually
and speak out the same word in a clearer (synthesized) voice.
However, since the accuracy of the recognition of severely
dysarthric speech decreases rapidly as the input vocabulary size
increases, this is not currently possible and we are in practice
constrained to find methods which generate meaningful mes-
sages but which require relatively small input vocabularies.
We need to constrain the “perplexity” of the recognition task,
that is the number of words which the recognizer must choose
between at a given point in the process. Doing this also has the
advantage of making it easier for the user to recall what the
recognizer will accept at any point.
Given this constraint, we considered a number of candidate
message building methods drawn from augmentative and alter-
native communication (AAC)  as well as from other means
of coding language, such as spelling or texting, for their suit-
ability to be used in the VIVOCA.
able to communicate . We define communication rate as the
number of words per unit time which are generated correctly
according to the intentions of the user. In this definition, the
correction of errors must be taken into account, and this is a
major consideration when dealing with speech recognition, as
error rates are high compared to most other input methods.
Modelling communication rate for different message
building methods showed that, in conditions of high recogni-
tion accuracy, methods with the larger input vocabularies give
greater communication rates, in line with expectations. How-
ever, due to the time cost of correcting errors, communication
rate falls off rapidly with decreasing recognition rate . By
combining the models of communication rate with information
on recognition accuracy for a range of input vocabulary sizes,
we are able to estimate the communication rate of the message
building methods for different levels of severity of dysarthria.
For individuals with mild and moderate dysarthria, the positive
relationship between input vocabulary and communication
rate is retained. However, for those people with more severe
dysarthria, this is not the case. As a result of the reduction of
recognition rate with increasing input vocabulary, the positive
relationship between input vocabulary and communication rate
no longer holds. For people with severe dysarthria, message
building techniques requiring a smaller input vocabulary are,
counter-intuitively, more efficient.
As part of the user-centred design and development process,
design meetings were held between the research team and po-
tential users at which different message-building methods were
considered, with knowledge of the modelled communication
rates. Users prioritized methods which tended to have high
communication rate. Some users, however, also regarded a
large output vocabulary as being vital regardless of its effect
on communication rate. We therefore decided to implement a
hybrid translation method as a combination of phrase building
and spelling, as follows.
Phrase building is used to generate frequently used phrases
requiring rapid generation, such as answering the phone,
conversational fillers or communicating immediate needs/prob-
lems. For example, inputting the sequence of words “want”
“drink” “water” could generate the phrase “Can I have a drink
of water please.” Using this approach in a structured way
greatly reduces the recognition perplexity. Spelling may be
used for the remainder of less-frequently used words allowing
unlimited output vocabulary where greater precision and con-
versational range are required, though at the cost of greater
perplexity and much lower communication rate.
26IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013
The components of the message building method, and the
input and output vocabularies were individually tailored to the
needs and wishes of each participant. For instance, they were
able to choose a configuration that could be used to map words
onto phrases or allow the spelling of words or a combination
of both. It is envisaged that, once the VIVOCA becomes more
widely available, this individualization will be carried out by
the user’s speech and language pathologist or similar clinical
C. Speech Synthesis
One of the user requirements for the system output was that
it should be possible to have both prerecorded and synthetic
output, and that the synthetic output should be as natural
sounding as possible.
To satisfy these requirements, the system software was de-
signed to work with both prerecorded output (in the form of
waveform files) and to interface with a speech synthesizer. As
with the process of speech recognition, speech synthesis is a
prototype we utilized a small footprint speech synthesizer de-
signed for mobile computer platforms called Flite , which
is a variant of the larger and popular synthesis system known as
Festival . A specially compiled version of the Flite software
was prepared for the Windows Mobile for Pocket PC operating
The acceptability of the output was evaluated with potential
users. Several users preferred prerecorded output for their
system, either because of the potential for delay introduced
by the computationally intensive synthesis process, or the
perceived poor quality of the synthetic speech. In some cases
we made use of the high-quality synthesis provided by the
Festival system to synthezize the required outputs and store
these as waveform files that could be played out on the device
D. Hardware and Software Implementation
In order to meet users’ requirements, the hardware upon
which VIVOCA was implemented needed to be small and light
and have a suitable visual interface. When the VIVOCA devel-
opment began, in 2005, the most suitable hardware was judged
to be a personal digital assistant (PDA). The models used in this
study were the HP iPAQ HX2700 running Windows Mobile
5.0 for Pocket PC. The PDA takes voice input from the user
via a microphone, which can either be head-worn (Bluetooth
or wired) or lapel-type, or the internal microphone of the PDA.
The PDA’s internal speaker was found to produce speech at
too low a volume for practical use in any but quiet ambient
conditions. Therefore, a separate amplifier and speaker were
used for the spoken output. Fig. 2 shows the PDA running the
VIVOCA application with input via a Bluetooth microphone.
The central processing units (CPUs) in PDAs do not have
support for rapid numerical computation, and havenodedicated
hardware for floating-point calculations. It is still possible to
perform floating-point operations on a PDA, however, they re-
quire software emulation of the dedicated hardware found on
more powerful processors. This emulation is much slower than
a dedicated unit and experimentation showed that this reduction
Fig. 2. A member of the project team demonstrating the prototype VIVOCA
device. The user is wearing a headset microphone, and the VIVOCA software
is running on the PDA mounted onto his wheelchair.
in speed introduced a significant overhead for speech recogni-
A solution is to use an alternative method to represent real
numbers—namely a fixed-point representation. A fixed-point
representation is a method for using binary integers to represent
fractional numbers, however, for any fixed-point representation
it is necessary to use a dedicated library of functions to perform
basic mathematical operations. Therefore, the first step that was
required, before any of the components of the system could be
implemented, was the development and testing of a library of
mathematical operations for our chosen fixed point representa-
We developed and tested a library for fixed-point arithmetic
which has a Q format Q18.14. The library consisted of the basic
operations on fixed point numbers (add, subtract, etc.), as well
as more complex operations such as logarithm, exponential and
the trigonometric functions, all of which are required for the
signal processing and speech recognition parts of the system.
A further consequence of the limitations of the CPUs avail-
able on PDAs meant that, for the prototype system, the software
to train recognition models and configure the aid was not lo-
cated on a PDA, rather on a conventional PC. This decision was
taken to expedite the time that would be required to train a set
of models on a device with such limited processing power.
1) User Interface—User Training Application: The first in-
terface that the user experiences is the user-training applica-
tion. Fig. 3 shows a series of example screen displays from
the PDA-based user training application. The interface initially
prompts the user to speak a word (panel A). When the user
the amount of the black circle that is filled with color. In Fig. 3
panel B, a high (89%) score is shown, and the circle is nearly
entirely filled; whereas panel C shows a low (37%) score.
2) User Interface—VIVOCA Application: In order to fa-
cilitate eye contact between communication partners, we
had originally envisaged that users would be able to use the
HAWLEY et al.: A VOICE-INPUT VOICE-OUTPUT COMMUNICATION AID FOR PEOPLE WITH SEVERE SPEECH IMPAIRMENT
Fig. 3. The user-training application interface. In panel A the user is prompted
to say the word “mike.” Panel B shows the result when the word has been rec-
ognized with a high level of accuracy. Panel C shows the result when the word
has been recognized with a low level of accuracy. The size of the shaded circle
is proportionate to the accuracy, and is filled green when the accuracy is greater
VIVOCA without referring to a screen, receiving essential au-
dible prompts via an earpiece. Our user requirements research,
however, indicated that users regarded a screen as important.
We have, therefore, implemented both screen-based and audio
interfaces. Fig. 4 shows the screen-based interface, in which
the available vocabulary is listed in the white space. The user
chooses the appropriate input word and the phrase, as it builds
up, is shown in the top panel. For the audio interface, each
available vocabulary item for each stage is presented sequen-
tially to the user as a reminder. Clearly, the audio interface can
become unwieldy for large vocabularies. In order to initiate
the recognition of a series of words (the input phrase) leading
to the generation of an output phrase, the user is required to
press a switch to indicate that the recognizer should begin to
“listen.” As the users are generally people with severe physical
disabilities, the switch is chosen and set up for each individual
The final prototype device was evaluated in a user trial de-
signed to assess the process of configuring the device for a new
user as well as the performance of the device in real communi-
where each of the stages assessed a particular aspect of the con-
figuration and usage of the device.
Ethical approval was obtained from Barnsley and North
Sheffield National Health Service (NHS) ethics committees
to recruit NHS patients into the trial. The principal inclusion
criterion was that the participants should have moderate or
severe speech impairment. Such impairment would result in
Fig. 4. The VIVOCA message building interface. The panels show an instan-
tiation of how a user builds up a sentence by speaking keywords to the device.
Panel A illustrates the “top-level” choice of words available to this particular
user. After saying the word “drink” the screen changes to that shown in panel
B. Saying the word “cup” causes the screen to change to that shown in panel
C. The user completes the utterance by saying “tea,” and as that corresponds to
the final possible choice the entire output phrase is then spoken by the speech
low conversational intelligibility, typically less than 50%. The
assessment of participants was conducted by a speech and
language therapist using the Frenchay Dysarthria Assessment
A total of nine participants took part in the evaluation, in-
cluding two of the participants from the development phase. In-
formation pertaining to the nineparticipants is shown in Table I.
The group of participants contained existing users of VOCAs,
as well as people who did not normally use any augmentative
or alternative communication method to support their commu-
nication. The group also contained participants with both pro-
gressive and stable speech impairment.
The stages of the evaluation are shown in Table II. At stage 1
the researcher discussed with the participant how they might
wish to use the VIVOCA device. Through these discussions a
set of possible outputs was identified to cover a range of usage
scenarios, and from these outputs the researcher defined a suit-
able vocabulary of input words that could be used to control the
device to produce the outputs. The participant was then given
time to reflect on the target output and the input vocabulary. On
a second visit the researcher was able to make modifications to
the proposed configuration in the light of the reflections of the
Once the participant was satisfied with the configuration,
recordings of their speech could be made (stage 2). The par-
ticipant was recorded speaking around 20 repetitions of each
of the words in the input vocabulary. To enable the participant
to complete this stage they were supplied with a laptop and
software to enable them to record their own speech in their
28IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013
PARTICIPANTS IN THE USER TRIAL
STAGES OF THE EVALUATION
own time. Once they had completed sufficient repetitions of
each of the words in the input vocabulary, the recordings
were downloaded from the laptop and used to train the initial
speech recognition system. At the completion of stage 2 the
recognition accuracy of the initial models for each participant
was tested using leave-one-out cross validation.
For stage 3, the participant used the user-training application
for a period of 2–4 weeks, during which they were asked to
practise for an hour a day where possible. Once completed it
was possible to compare the recognition accuracy of the initial
and final models.
The final system was configured and provided to the partic-
ipant for a period of familiarization (stage 4). During this pe-
riod the participant could identify changes they would like to
be made to the system, and build confidence by practising using
After the familiarization period, and any issues with the
functionality of the participant’s system had been identified and
remedied, the final user trial commenced (stage 5). Although
we had originally expected users to use a headset microphone
as an input device, in practice wearing a headset was difficult
for the users who had associated physical disability and all but
one (D1) chose to rely on the internal microphone of the PDA.
During the trial, participants were asked to maintain a diary
to record how they used the device and their general feelings
about the device. A researcher also regularly contacted the
participant to ensure no difficulties had been encountered.
At the end of the user trial, at stage 6, a series of evaluations
were conducted. These included testing the performance of the
system. For this test, each participant completed a number of
communication acts which involved them activating the device
to produce a desired output using spoken commands. In order to
be deemed successful, all the words in an input phrase needed
to be recognized correctly. Each act was repeated three times,
however for some participants it was not possible to complete
every possible output of the system. Some participants’ config-
urations (e.g., D2 and E1) had a large number of possible out-
puts and thishad obvious risksfor tiring the participant. In these
cases a random subset of the possible outputs was used. Partic-
ipants were also interviewed to obtain their views about the de-
vice and the trial they had completed.
During the evaluation trial four of the participants withdrew
from the project, three of whom had progressive speech disor-
ders. These occurred at different stages in the project, and for a
variety of reasons. One withdrew (E7) early in the evaluation as
they found the process of recording sufficient speech examples
to be too tiring. Two others withdrew (E4 and E6) as they were
unable to maintain sufficient volume when speaking to activate
speech recognition. Participant E2 withdrew after completing
the first four stages of the evaluation as, after the familiarization
stage, his speech production ability and general health deterio-
The participants identified a range of scenarios in which they
would like to use the device. Principally these were for com-
municating immediate needs (“I would like a drink, please”); or
communicating in social or commercial situations (“Please be
patient it may take me sometime to answer you.”) It is notice-
able that most of the participants settled on a relatively small
number of goal-oriented communications. Only 2 of the partic-
ipants wished to use the system in a less constrained manner by
having the facility to spell words into the system (D2 and E7).
Table III shows the recognition accuracy for the initial
(stage 2) and final (stage 3) recognition models for each of the
evaluation users to complete both stages.
After participants had completed the user training (stage 3)
the recognition accuracy improved for all participants com-
pared, with the exception of E3, whose recognition score
remained stable at 99%. This improvement is a result of both
the increase in the amount of data used to train the recognition
models and the reduction in variability the user has gained from
repeated practise of the words.
Table IV shows VIVOCA performance for four users who
completed all six stages of the evaluation.
The results show that the VIVOCA device performed better
when used for phrase building than when being used for
spelling. This is due to the fact that in the phrase building
mode the perplexity is relatively small, as it is equivalent to
the number of competing words, typically 3–10 in these trials.
Conversely, when the user is spelling input using the NATO
alphabet there are always at least 28 competing models (one for
each letter plus space and delete). The users who tried to use
the device for spelling (D1 and E1) both had lower accuracy
HAWLEY et al.: A VOICE-INPUT VOICE-OUTPUT COMMUNICATION AID FOR PEOPLE WITH SEVERE SPEECH IMPAIRMENT
RESULTS FROM FIRST THREE STAGES OF THE EVALUATION
PERFORMANCE OF THE SYSTEM TESTED AT THE END OF THE TRIAL
rates. In the case of D1 the accuracy was lower when spelling
compared to phrase building (62% versus 75%).
was good. All thought that the device offered the prospect of
easier and more rapid communication, though one did not think
that the device would be appropriate for her needs. Most felt
that the limited number of outputs of the trial device limited
its usefulness. They agreed that the device would become more
useful the more outputs it could produce.
Users commented that they sometimes found it harder to use
the VIVOCA to communicate than to use their usual commu-
nication method of either speaking or speaking supported by
a conventional VOCA. They all related this reduced ease of
communication to the accuracy of the speech recognition in
the VIVOCA. Several commented that they thought that device
could lead to improved communication if the accuracy of the
speech recognition was higher. One stated “I wouldn’t get frus-
trated if it got it right.” Participants also stated that the recogni-
tion errors meant that the device was sometimes slower to use
than their conventional VOCA.
A voice output communication aid controllable by automatic
speech recognition has been successfully produced and tested.
The development of the device followed a user-centred iterative
process, whereby a group of users evaluated each stage of the
development and this led to modifications and improvements
to the device. The eventual aim is to develop a VIVOCA de-
vice which can be commercialized and the design features that
emerged from this process should make the final device more
appropriate to a wider group of end users.
Our major challenge was to achieve accurate recognition of
disordered speech at the larger vocabulary sizes required to
achieve a practical VIVOCA. As shown in Table III, we have
been able to construct recognizers with recognition accuracy in
excess of 85%, for vocabulary sizes which are larger than pre-
viously reported for such speakers under similar test conditions
(e.g., ) and large enough to construct a basic VIVOCA.
This indicates that the methodology we have adopted to rec-
ognize dysarthric speech is viable for the task of producing
a voice-input voice-output communication aid, and this is a
significant finding of our research.
This study has, however, once more highlighted a major
difference in recognition accuracy in controlled conditions,
compared to the accuracy attained under realistic usage con-
ditions (comparing results in Tables III and IV). Some of the
reduction in performance is due to the fact that Table IV reports
results for the recognition of a string of input words, whereas
Table III reports results for single words. The misrecognition
of any word in the string sequence results in an incorrectly
completed task. However, due to the hierarchical nature of the
message-building process it would be impossible to quantify
this effect using a linear model of error accumulation. There are
other factors influencing performance which may not simply
be characterized as an example of the classic “training-testing
mismatch” that has beset speech recognition technology for
many years. In this study, the initial training recordings and
final evaluations were conducted in the participant’s home,
therefore there were not major differences between the acoustic
environments. There were, however, two notable differences
in usage conditions, which affect the recognition accuracy:
namely, the different microphones required and the need for
the user to press a switch to activate the recognizer. As, for
reasons of comfort and convenience, most of our users felt
unable to use a close-talking microphone for everyday usage,
instead choosing to use the PDA’s internal microphone, the
mismatch between the microphones used for training and
testing were a significant cause of degraded performance. In
addition, in using the VIVOCA, the user must press a switch
in order to indicate to the recognizer that it should expect to
receive and recognize a word or word string (an action which
is not required in the recording or user training applications).
This push-to-speak feature was introduced in order to reduce
the possibility of activation of the VIVOCA by environmental
30IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 21, NO. 1, JANUARY 2013
sounds. However, due to the nature of the physical disabilities
of our users, pressing a switch can require considerable effort,
often produces associated body and head movement, and can
even affect the quality of speech produced. This additional
head movement and degraded speech exacerbates the problem
of not being able to use close-talking microphones.
Previous work with speech-input home control systems had
similar practical issues  and demonstrated similar perfor-
mance degradation. Whereas for home control applications
some users found the level of performance acceptable, for
control of a VOCA, feedback from users has confirmed that
such performance is not acceptable and that higher accuracy at
larger vocabulary sizes is essential.
While users held a positive view of the technology concept,
they felt that in its current form it would not substantially add
to their communication ability or independence. However, the
majority of participants felt that, if the prototype could be im-
proved in terms of accuracy and range of output, it would be a
viable and useful aid to their communication. We conclude that
some fundamental practical issues of using speech recognition
with disabled users must be addressed before the VIVOCA can
become a viable tool.
Our future work will address two practical limitations of the
performance. The first of these will be to develop a VIVOCA
based on a platform which supports a better quality internal mi-
crophone, which is most users’ preferred option.
In a second development we aim to remove the necessity for
push-to-speak operation. We will introduce a “word-spotting”
ASR mode so that users are relieved of the burden of pressing
a switch each time they wish to speak. The expectation is that
this will also improve recognition accuracy.
This paper has described the development of portable, voice
output communication aid controllable by automatic speech
recognition. The device can be configured to enable the user to
create either simple or complex messages using a combination
of a relatively small set of input “words.” Evaluation with a
group of potential users showed that they can make use of the
device to produce intelligible speech output. The evaluation
also, however, highlighted several issues which limit the perfor-
mance and usability of the device, confirming that further work
is required before it becomes an acceptable tool for people with
moderate to severe dysarthria. Overcoming these limitations
will be the focus of our future research.
The authors wish to thank all the participants who gave up
their time to take part in this study.
 D. Beukelman and P. Mirenda, Augmentative and Alternative Commu-
nication, 3rd ed.Baltimore, MD: Paul H. Brookes, 2005.
 P. Enderby and L. Emerson, Does Speech and Language Therapy
Work?.London, U.K.: Singular, 1995.
 C. L. Kleinke, “Gaze and eye contact: A research review,” Psychol.
Bull., vol. 100, no. 1, pp. 78–100, 1986.
 B. O’Keefe, N. Kozak, and R. Schuller, “Research priorities in aug-
mentative and alternative communication as identified by people who
use AAC and their facilitators,” Augmentative Alternative Commun.,
vol. 23, no. 1, pp. 89–96, 2007.
 J. Murphy, “I prefer contact this close: Perceptions of AAC by people
with motor neurone disease and their communication partners,” Aug-
mentative Alternative Commun., vol. 20, pp. 259–271, 2004.
 J. Todman, N. Alm, J. Higginbotham, and P. File, “Whole utterance
approaches in AAC,” Augmentative Alternative Commun., vol. 24, no.
3, pp. 235–254, 2008.
 L. J. Ferrier, H. C. Shane, H. F. Ballard, T. Carpenter, and A. Benoit,
“Dysarthric speakers’ intelligibility and speech characteristics in
relation to computer speech recognition,” Augmentative Alternative
Commun., vol. 11, no. 3, pp. 165–175, 1995.
puterized speech recognition: Influence of intelligibility and percep-
tual consistency on recognition accuracy,” Augmentative Alternative
Commun., vol. 14, no. 1, pp. 51–56, 1998.
 R. N. Bloor, K. Barrett, and C. Geldard, “The clinical application of
microcomputers in the treatment of patients with severe speech dys-
function,”IEEColloquium High-TechHelp Handicapped, pp.9/1–9/2,
 U. Sandler and Y. Sonnenblick, “A system for recognition and transla-
tion of the speech of handicapped individuals,” in 9th Mediterranean
Electrotech. Conf. (MELECON’98), 1998, pp. 16–19.
 B. Wisenburn and D. J. Higginbotham, “An AAC application using
speaking partner speech recognition to automatically produce contex-
Commun., vol. 24, no. 2, pp. 100–109, 2008.
 B. Wisenburn and D. J. Higginbotham, “Participant evaluations of rate
and communication efficacy of an AAC application using natural lan-
guage processing,” Augmentative Alternative Commun., vol. 25, no. 2,
pp. 78–89, 2009.
 M. S. Hawley et al., “A speech-controlled environmental control
system for people with severe dysarthria,” Med. Eng. Phys., vol. 29,
no. 5, pp. 586–593, 2007.
 H. V. Sharma and M. Hasegawa-Johnson, “State-transition interpola-
tion and MAP adaptation for HMM-based dysarthric speech recogni-
tion,” in NAACL HLT Workshop Speech Language Process. Assistive
Technol., 2010, pp. 72–79.
 R. Palmer, P. Enderby, and M. S. Hawley, “A voice input voice output
Technol., vol. 4, no. 2, pp. 4–14, 2010.
 S. Young et al., Cambridge University Engineering Department The
HTK book, 2006.
 P. D. Green, J. Carmichael, A. Hatzis, P. Enderby, M. S. Hawley, and
M. Parker, “Automatic speech recognition with sparse training data
for dysarthric speakers,” in Eur. Conf. Speech Commun. Technol. (Eu-
rospeech), 2003, pp. 1189–1192.
 L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov
models,” IEEE ASSP Mag., Jun. 1986.
 R. Palmer, P. Enderby, and S. P. Cunningham, “The effect of three
practice conditions on the consistency of chronic dysarthric speech,”
J. Med. Speech-Language Pathol., vol. 12, no. 4, pp. 183–188, 2004.
 M. Parker, S. P. Cunningham, P. Enderby, M. S. Hawley, and P.
D. Green, “Automatic speech recognition and training for severely
dysarthric users of assistive technology: The STARDUST project,”
Clin. Linguistics Phonetics, vol. 20, no. 2–3, pp. 149–156, 2006.
 J. N. Holmes and W. J. Holmes, Speech Recognition and Synthesis,
2nd ed.London, U.K.: Taylor Francis, 2001.
 S. L. Glennon and D. C. DeCoste, Handbook of Alternative and Aug-
mentative Communication. San Diego, CA: Singular, 1997.
 J. Todman, “Rate and quality of conversations using a text-storage
AAC system: Single-case training study,” Augmentative Alternative
Commun., vol. 16, no. 3, pp. 164–179, 2000.
 M. S. Hawley, S. P. Cunningham, F. Cardinaux, A. Coy, S. Seghal,
and P. Enderby, “Challenges in developing a voice input voice output
communication aid for people with severe dysarthria,” in Proc.
AAATE—Challenges Assistive Technol., 2007, pp. 363–367.
 A. Black and K. Lenzo, “Flite: A small fastrun-time synthesis engine,”
in 4th ISCA Speech Synthesis Workshop, 2001, pp. 157–162.
 A. W. Black, P. Taylor, and R. Caley, The Festival Speech Synthesis
System.Edinburgh, Univ. Edinburgh, 1999.
Austin, TX: Pro-ed, 2008.
HAWLEY et al.: A VOICE-INPUT VOICE-OUTPUT COMMUNICATION AID FOR PEOPLE WITH SEVERE SPEECH IMPAIRMENT
Mark S. Hawley received the Ph.D. degree from the University of Sheffield,
He is Professor in the School of Health and Related Research, Head of the
Rehabilitation and Assistive Technology Research Group, and Director of the
Centre for Assistive Technology and Digital Healthcare, at the University of
Sheffield. He is also Honorary Consultant Clinical Scientist at Barnsley Hos-
Prof. Hawley is a Chartered Scientist and Member of the Institute for Physics
and Engineering in Medicine. He was awarded the Honorary Fellowship of The
Royal College of Speech and Language Therapists in 2007 for his service to
speech therapy research.
U.K., in 1997 and 2003, respectively.
He is a lecturer in Human Communication Sciences at the University of
Sheffield, Sheffield, U.K. His primary research interests are in the automatic
recognition of disordered speech, and the development of speech-based
interfaces for assistive technology.
Phil D. Green received the B.Sc. degree in cybernetics from the University
of Reading, Reading, U.K., in 1967, and the Ph.D. degree in automatic speech
recognition from the University of Keele, Keele, U.K., in 1971.
He holds a Chair in Computer Science in the Department of Computer Sci-
ence at the University of Sheffield, U.K., where he previously held the posts
of Lecturer, Senior Lecturer, Reader, and Head of Department. He heads the
Speech and Hearing Research Group, has authored around 100 publications in
speech science and technology and acted as Principal Investigator for around 15
Pam Enderby is a qualified speech and language therapist and received the
Ph.D. degree on the subject of dysarthria from Bristol University, Bristol, U.K.
She was awarded an honorary doctorate from the Department of Computing
and Mathematics, the University of West of England, Bristol, U.K., 2000; and
an MBE for services to speech and language therapy, in 1993.
Prof. Enderby is Professor of Community Rehabilitation at the University of
Sheffield and has recently stepped down from being the clinical director of the
South Yorkshire Comprehensive Local Research Network. She has conducted
clinical research through most of her career combining posts in the NHS with
that of University of Sheffield.
Rebecca Palmer received a first class bachelor’s degree in linguistics and lan-
guage pathology from the University of Reading, Reading, U.K., in 1999, fol-
lowed by the Ph.D. degree in the treatment of dysarthria from the University of
Sheffield, Sheffield, U.K., in 2005.
She is currently a Senior Clinical Lecturer at the University of Sheffield,
Sheffield, U.K. Previous posts held include highly specialist speech and lan-
guage therapist and rehabilitation trials manager for the NIHR Trent Stroke Re-
search Network. Her primary research interest is the use of speech and language
technology to assist improved communication of stroke survivors.
Dr. Palmer is a member of the Royal College of Speech and Language Ther-
apists and the Health Professions Council.
Siddharth Sehgal received the B.Sc. degree in mathematics and a M.Sc. de-
gree in applied operational research from the University of Delhi, Delhi, India,
in 1996 and 1999, respectively. He received the M.Sc. degree in advanced com-
puter science from the University of Sheffield, Sheffield, U.K., in 2004. He also
has a Diploma in network centered computing from National Institute of Infor-
mation Technology, Delhi, Delhi, India, in 2001. He is currently pursuing the
Ph.D. degree at the University of Sheffield, Sheffield, U.K.
He worked from 2000 to 2003 as a Software Engineer with Futuresoft India
Private Limited and Birlasoft, Delhi, India. Since 2005, he has been a Research
Associate in the Department of Computer Science and Human Communication
Sciences at the University of Sheffield, Sheffield, U.K.
Peter O’Neill received the B.Sc. (Hons.) degree in software engineering from
Sheffield Hallam University, Sheffield, U.K., in 1996 and has worked in the
assistive technology domain for the last 16 years. He attained his Doctorate
with the title “Enhancing the Prescription of Electronic Assistive Technology,”
from Sheffield Hallam University, in 2006.
He has held the position of Research Associate at Barnsley Hospital NHS
fellow at the University of Sheffield, Sheffield, U.K.