Conference PaperPDF Available

The Vocal Joystick


Abstract and Figures

The Vocal Joystick is a novel human-computer interface mechanism designed to enable individuals with motor impairments to make use of vocal parameters to control objects on a computer screen (buttons, sliders, etc.) and ultimately electro-mechanical instruments (e.g., robotic arms, wireless home automation devices). We have developed a working prototype of our "VJ-engine" with which individuals can now control computer mouse movement with their voice. The core engine is currently optimized according to a number of criterion. In this paper, we describe the engine system design, engine optimization, and user-interface improvements, and outline some of the signal processing and pattern recognition modules that were successful. Lastly, we present new results comparing the vocal joystick with a state-of-the-art eye tracking pointing device, and show that not only is the Vocal Joystick already competitive, for some tasks it appears to be an improvement
Content may be subject to copyright.
Jeff A. Bilmes, Jonathan Malkin, Xiao Li, Susumu Harada§, Kelley Kilanski,
Katrin Kirchhoff, Richard Wright, Amarnag Subramanya, James A. Landay§,
Patricia Dowden, Howard Chizeck
Dept. of Electrical Engineering
§Dept. of Computer Science & Eng.
Dept. of Linguistics
Dept. of Speech & Hearing Science
University of Washington
Seattle, WA
The Vocal Joystick is a novel human-computer interface mecha-
nism designed to enable individuals with motor impairments to make
use of vocal parameters to control objects on a computer screen (but-
tons, sliders, etc.) and ultimately electro-mechanical instruments
(e.g., robotic arms, wireless home automation devices). We have
developed a working prototype of our “VJ-engine” with which indi-
viduals can now control computer mouse movement with their voice.
The core engine is currently optimized according to anumber of cri-
terion. In this paper, we describe the engine system design, engine
optimization, and user-interface improvements, and outline some of
the signal processing and pattern recognition modules that were suc-
cessful. Lastly, we present new results comparing the vocal joystick
with a state-of-the-art eye tracking pointing device, and show that
not only is the Vocal Joystick already competitive, for some tasks it
appears to be an improvement.
Many existing human-computer interfaces (e.g., the mouse and key-
board, touch screens, pen tablets, etc.) are ill-suited to individuals
with motor impairments. Specialized (and often expensive) human-
computer interfaces have been developed specifically for this group,
including sip-and-puff switches [1], head mice [2, 1, 3], eye-gaze
and eye tracking devices [4], chin joysticks [5], and tongue switches
[6]. While many individuals with motor impairments have complete
use of their vocal system, these assistive devices do not make full
use of it. Sip and puff switches, for example, control a device by
sending binary signals and thus have relatively low communication
bandwidth, making it difficult to perform complex control tasks.
Natural spoken language is regarded as an obvious choice for
a human-computer interface. Despite significant research efforts in
automatic speech recognition (ASR), however, existing ASR sys-
tems are still not perfectly robust to a wide variety of speaking
conditions, noise, and accented speakers, and they have not yet
been universally adopted as a dominant human-computer interface.
In addition, while natural speech is optimal for human-to-human
communication, it may be sub-optimal for manipulating computers,
windows-icons-mouse-pointer (WIMP) interfaces, and other electro-
mechanical devices (such as a prosthetic robotic arm). Standard spo-
ken language commands, moreover, are ideal for discrete but not
for continuous operations. For example, to move a cursor from the
bottom-left to the upper-right of a screen, a user could repeatedly
This material is based on work supported by the National Science Foun-
dation under grant IIS-0326382.
utter “up” and “right”, or alternatively “stop” and “go” after setting
an initial trajectory and rate, but this can be inefficient. Other meth-
ods for using controlling mouse movement with speech have also
been developed [7, 8, 9] but none of these take advantage of the full
continuous nature of the human vocal system.
For the above reasons, we have developed an alternative voice-
based assistive technology termed the Vocal Joystick (VJ) [10]. Un-
like standard ASR, our system goes beyond the capabilities of se-
quences of discrete speech sounds, and exploits continuous vocal
characteristics such as pitch, vowel quality, and loudness which
are then mapped to continuous control parameters. Several video
demonstrations of the Vocal Joystick system are available online In previous work,
we gave a high-level overview of the vocal joystick [10] and details
regarding motion acceleration [11] and adaptation [12, 13]. In this
work, we provide details about our design goals, the signal process-
ing and pattern recognition modules, and we report on a new user-
study that shows that the VJ compares favorably to a standard mod-
ern eye tracking device.
The Vocal Joystick system maps from human vocalic effort to a set of
control signals used to drive a mouse pointer or robotic arm. It also
allows a small set of discrete spoken commands usable as mouse
clicks, button presses, and modality shifts. We use a “joystick” as
an analogy since it has the ability to simultaneously specify several
continuous degrees of freedom along witha small number of button
presses, and we consider this to be a generalization of a mouse.
In developing the VJ system, we have drawn on our ASR back-
ground to produce a system that, as best as possible, meets the fol-
lowing goals: 1) easy to learn: the VJ system should be easy to learn
and remember in order to keep cognitive load at a minimum. 2) easy
to speak: using a VJ-controlled device should not produce undue
strain on the human vocal system. It should be possible to use the
system for many hours at a time. 3) easy to recognize: the VJ sys-
tem should be as noise robust as possible, and should try to include
vocal sounds that are as acoustically distinct as possible. 4) percep-
tual: the VJ system should respect any perceptual expectations that
a user might have and also should be perceptually consistent (e.g.,
given knowledge of some aspects of a VJ system, a new vocal effort
should, say, move the mouse in an expected way). 5) exhaustive:
to improve communications bandwidth, the system should utilize as
many capabilities of the human vocal apparatus as possible, without
conflicting with goal 1. 6) universal: our design should use vocal
characteristics that minimize the chance that regional dialects, or ac-
cents will preclude its use. 7) complementary: the system should
be complementary with existing ASR systems. We do not mean to
replace ASR, but rather augment it. 8) resource-light: a VJ system
should run using few computational resources (CPU and memory)
and leave sufficient computational headroom for a base application
(e.g., a web browser, spreadsheet). 9) infrastructure: the VJ system
should be like a library, that any application can link to and use.
Unlike standard speech recognition, the VJ engine exploits the
ability of the human voice to produce continuous signals, thus go-
ing beyond the capabilities of sequences of discrete speech sounds
(such as syllables or words). Examples of these vocal parameters
include pitch variation, type and degree of vowel quality, and loud-
ness. Other possible (but not yet employed) qualities are degree of
vibrato, low-frequency articulator modulation, nasality, and velocity
and acceleration of the above.
2.1. Primary Vocal Characteristics
Three continuous vocal characteristics arecurrently extracted by the
VJ engine: energy,pitch, and vowel quality, yielding four simultane-
ous degrees of freedom. The first of these, localized acoustic energy,
is used for voice activity detection. In addition, it is normalized rela-
tive to the current detected vowel, and is used by our mouse applica-
tion to control the velocity of cursor movement. For example, a loud
voice causes a large movement while a quiet voicecauses a “nudge.
The second parameter, pitch, is also extracted but is currently unused
in existing applications (it thus constitutes a free parameter available
for future use). The third parameter is vowel quality. Unlike con-
sonants, which are characterized by a greater degree of constriction
in the vocal tract and which are inherently discrete in nature, vowels
are highly energetic and thus are well suited for environments where
both high accuracy and noise-robustness are crucial. Vowels can be
characterized using a 2-D space parameterized by F1 and F2, the first
and second vocal tract formants (resonant frequencies). We classify
vowels, however, directly and map them onto the 2-D vowel space
characterized by tongue height and tongue advancement (Figure 1)
(we found F1/F2 estimation to be too unreliable for this application).
In our initial VJ system, and in our VJ mouse control, we use the four
corners of this chart to map to the 4 principle directions of up, down,
left, and right as shown in Figure 1. We have also produced an 8-
and 9-class vowel system to enable more non-simultaneous degrees
of freedom and more precise specification of diagonal directions.
We also utilize a “neutral” schwa [ax] as a carrier vowel for when
other parameters (pitch and/or amplitude) are to be controlled with-
out any positional change. These other vowels (and their directional
correlates) are also shown in Figure 1.
In addition to the three continuous vocal parameters, “discrete
sounds” are also employed. We select (or reject) a candidate dis-
crete sound according to both linguistic criteria and the system cri-
teria mentioned in Section 2. So far, however, we have not utilized
more than two or three discrete sounds since our primary application
has been mouse control. Our research has thus focused on real-time
extraction of continuous parameters since that is less like standard
ASR technology.
We have developed a portable modular library (the VJ engine) that
can be incorporated into a variety of applications. Following the
goals from Section 2, the engine shares common signal processing
operations in multiple modules, to produce real-time performance
while leaving considerable computational headroom for the applica-
tions being driven by the VJ engine.
Tongue Height
Front Central Back
Tongue Advancement
[iy] [ix ] [uw ]
[ey] [ax ] [ow ]
[ae ] [a] [aa ]
[ix ]
[uw ]
[ey ]
[ow ]
[ae ]
[aa ]
[ax ]
Fig. 1. Left:Vowel configurations as a function of their dominant ar-
ticulatory configurations. Right: Vowel-direction mapping: vowels
corresponding to directions for mouse movement in the WIMP VJ
cursor control.
Waveform Feature
Discrete Sound
Motion Control
Motion Computer
Driver Adaptation
Fig. 2. The vocal joystick engine system structure.
The VJ engine consists of three main components: acoustic sig-
nal processing,pattern recognition, and motion control (see Fig-
ure 2). First, the signal processing module extracts short-term
acoustic features, such as energy, autocorrelation coefficients, lin-
ear prediction coefficients and mel frequency cepstral coefficients
(MFCCs). These features are piped into the pattern recognition
module, where energy smoothing, pitch and formant tracking, vowel
classification and discrete sound recognition take place. This stage
also involves pattern recognition methods such as neural networks,
support vector machines (SVMs), and dynamic Bayesian networks
(see [12, 14, 15]). Finally, energy, pitch, vowel quality, and discrete
sounds become acoustic parameters that are transformed into direc-
tion, speed, and other motion-related parameters for the back-end
An important first stage in the signal processing module is voice
activity detection (VAD). We categorize each frame into the three
separate categories silence,pre-active, or active, based on energy
and zero-crossing information. Pre-active frames may (or may not)
indicate the beginning of voice activity, for which only frontend fea-
ture extraction is executed. Active frames are those identified as
truly containing voice activity. Pattern recognition tasks, includ-
ing pitch tracking and vowel classification, are performed for these
frames. No additional computation is used for silence frames. If si-
lence frames occur after an unvoiced segment within a length range,
however, discrete sound recognition will be triggered.
The goal of the signal processing module is to extract low-level
acoustic features that can be used in estimating the four high-level
acoustic parameters. The acoustic waveforms are sampled at a rate
of 16,000 Hz, and a frame is generated every 10 ms. The extracted
frame-level features are energy, normalized cross-correlation coef-
ficients (NCCC), formants, and MFCCs [16, 17]. In addition, we
employ delta features and online (causal) mean subtraction and vari-
ance normalization.
Our pitch tracker module is based on several novel ideas. Many
pitch trackers require meticulous design of local and transition costs.
The forms of these functions are often empirically determined and
their parameters are tuned accordingly. In the VJ project, we use
a graphical modeling framework to automatically optimize pitch
tracking parameters in the maximum likelihood sense. Specifically,
we use a dynamic Bayesian network to represent the pitch- and
formant-tracking process and learn the costs using an EM algorithm
[14, 15]. Experiments show that this framework not only expe-
dites pitch tracker design, but also yields good performance for both
pitch/F1/F2 estimation and voicing decision.
Vowel classification accuracy is crucial for overall VJ perfor-
mance since these categories determine motion direction. Additional
requirements include real-time, consistent, and noise robust perfor-
mance. Vowel classification in the VJ framework differs from con-
ventional phonetic recognition in two ways: First, the vowels are
longer duration than in normal speech. Second, instantaneous classi-
fication is essential for real-time performance. In our system, we uti-
lize posterior probabilities of a discriminatively trained multi-layer
perceptron (MLP) using MFCC features as input. We have also de-
veloped a novel algorithm for real-time adaptation of the MLP and
SVM parameters [12, 13], this increases the accuracy of our VJ clas-
sifier considerably!
We have also found that acceleration has yielded significant im-
provements to VJ performance [11]. Unlike normal mouse accelera-
tion, which adjusts a mapping from a 2-D desktop location to a 2-D
computer screen location, a VJ system must map from vocal tract
articulatory change to positional screen changes. We utilize the idea
of “intentional loudness,” where we normalize energy based on how
a user intends to affect the mouse pointer, and have developed a non-
linear mapping that has shown in user studies to be preferable to no
vocal acceleration.
Our discrete sound recognition module uses a fairly standard
HMM system. We currently use consonant-only patterns for discrete
sounds, like /ch/ and /t/ (sufficient for a 1-button mouse), and we use
a temporal threshold to reject extraneous speech. This not only sig-
nificantly reduces false positives (clicks), but also saves computation
since only pure unvoiced segments of a certain length will trigger the
discrete sound recognition module to start decoding.
3.1. User Study: Comparison of VJ and Eye Tracker
We performed a study comparing a VJ-mouse with a standard eye
tracking mouse. Specifically, we investigated the difference in users’
performance between the Vocal Joystick (VJ) system and the eye
tracker (ET) system.
The eye tracker consisted of a single infrared camera that was
designed to focus on the users dominant eye and to track the move-
ment of the iris by analyzing the reflection of an infrared beam em-
anating from the camera. This particular system required the user’s
head to stay fairly steady, so we utilized a chin rest for the partic-
ipants to rest their chin and reduce the amount of head movement
(something not needed by a VJ-based system). Clicking is per-
formed by dwelling, or staring at the desired point for a fixed amount
of time. The dwelling time threshold is configurable, but we used the
default (0.25 seconds) throughout the experiment.
We recruited 12 participants from the UW Computer Science
department to participate in our experiment. Of the 12, there were
five females and seven males, ranging in age from 21 to 27. Seven
of the participants were native English speakers and the rest of them
were from Europe and Asia. Five participants wore glasses, two
wore contact lenses, and the rest had uncorrected vision. We ex-
posed each participant to two different modalities: VJ and ET. For
each modality, we had the participants perform two tasks: Target
Acquisition task (TA) and the Web Browsing task (WB). The order
of the tasks within each modality was fixed (TA then WB). The par-
ticipants completed both tasks under one modality before moving on
to the other modality. Before starting on any task for each modal-
ity, the participants were given a description of the system they were
about to use, followed by a calibration phase (for VJ, the partici-
pants were asked to vocalize the four vowels for two seconds each;
for ET, the participants were asked to look at a sequence of points
on the screen based on the Eye Tracker calibration software). They
were then given 90 seconds to try out the system on their own to get
familiar with the controls.
All experimental conditions were shown on a 19-inch 1024x768
24-bit color LCD display. The VJ system was running on a Dell
Inspiron 9100 laptop with a 3.2 GHz Intel Pentium IV processor
running the Fedora Core 2 operating system. A head-mounted An-
drea NC-61 microphone was used as the audio input device. The ET
system was running on a HP xw4000 desktop with a 2.4 GHz Intel
Pentium IV processor running Windows XP Service Pack 2. The ET
camera was Eye Response Technologies WAT-902HS model, and the
software was Eye Response Technologies ERICA version
For the TA tasks, we wrote an application that sequentially dis-
plays the starting point and the target for each trial within a maxi-
mized window and tracks the users clicks and mouse movements. A
Firefox browser was used for the WB tasks. The browser was screen
maximized such that the only portion of the screen not displaying
the contents of the web page was the top navigation toolbar, which
was 30 pixels high.
The TA task consisted of sixteen different experimental condi-
tions with one trial each. A trial consisted of, starting at a fixed re-
gion at the center of the screen (a 30 pixel wide square), attempting
to click on a circular target which appears with a randomized size,
distance, and angle from the center region.
The WB task consisted of one trial in which the user was shown
a sequence of web links that they needed to click through and was
told to follow the same links using a particular modality (VJ or ET).
The participants were first guided through the sequence by the ex-
perimenter, and then asked to go through the links themselves to
ensure that they were familiar with the order and the location of
each link. They were also instructed that if they click on a wrong
link, they must click on the browsers back button on their own to
return to the previous page and try again. Once the participant was
familiar with the link sequence, they were asked to navigate through
those links using a particular modality. The time between when the
participant started using the modality and when the participant suc-
cessfully clicked on the last link was recorded, as well as the num-
ber of times they clicked on a wrong link. The sequence of links
consisted of clicking on six links starting from the CNN homepage Most of the links were 15 pixels high and link
widths ranged from 30 to 100 pixels wide, and distances between
links ranged from 45 pixels to 400 pixels, covering directions corre-
sponding roughly to six different approach angles.
The mean task completion speed (inverse time) across all par-
ticipants for each of the 16 conditions across the two modalities is
shown in Figure 3. The higher bars on the graphs indicate faster
performance. The circles represent the target size, distance, and an-
gle relative to the start position (middle square) for all conditions.
The error bars represent the 95th percentile confidence interval (as
with the other error bars in the other figurein this section). The mean
Fig. 3. Mean task completion “speed” (1/seconds) for the TA task
across modalities.
Time (sec)
Fig. 4. Web browsing task completion times (sec) across modalities
task completion times (which include any missed-link error recovery
time) for the web browsing task are shown in Figure 4.
Overall, our results suggest that the Vocal Joystick allowed the
users to perform simple TA tasks at a comparable speed as the par-
ticular eye tracker we used, and that for the web browsing task, the
VJ was significantly faster than ET. This is quite encouraging given
that the VJ system is quite new!
3.2. Related Work
There are a number of systems that have used the human voice in
novel ways for controlling mouse movement. We point out, however,
that the Vocal Joystick is conceptually different than the other sys-
tems in several important respects, and this includes both latency and
design. First, VJ overcomes the latency problem in vocal control.
VJ allows the user to make instantaneous directional changes using
one’s voice (e.g., the user can dynamically draw a ”U” or ”L” shape
in one breath). Olwal and Feiner’s system [8] moves the mouse only
after recognizing entire words. In Igarashi’s system [7], one needs
first to specify direction, and then afterwards a sound to move in the
said direction. De Mauro’s system [18] moves the mouse after the
user has finished vocalizing. The VJ, by contrast, has latency (time
between control parameter change in response to a vocal change) on
the order of reaction time (currently, approximately 60 ms), so direc-
tion and other parameters can change during vocalization. The other
key difference from previous work is that VJ is general software in-
frastructure, designed from the outset not only for mouse control,
but also for controlling robotic arms, wheelchairs, normal joystick
signals, etc. A VJ system is customizable, e.g., the vowel-to-space
mapping can be changed by the user. Our software system, more-
over, is generic. It outputs simultaneous control parameters corre-
sponding to vowel quality, pitch, formants (F1/F2), and amplitude
(i.e., we have unused degrees of freedom in the mouse application).
The system can be plugged into either a mouse driver or any other
[1] “Pride mobility products group sip-n-puff system/head array control,
[2] “Origin instruments sip/puff switch and head mouse,” 2005,
[3] “Headmaster head mouse,” 2003,
[4] Assistive technologies’s eye gaze system for computer access,” 2003,
[5] “Chin joystick,” 2003,
[6] “Prentrom devices,” 2003,
[7] T. Igarashi and J. F. Hughes, “Voice as sound: Using non-verbal voice
input for interactive control,” in 14th Annual Symposium on User Inter-
face Software and Technology. ACM UIST’01, November 2001.
[8] A. Olwal and S. Feiner, “Interaction techniques using prosodic features
of speech and audio localization,” in IUI ’05: Proc. 10th Int. Conf. on
Intelligent User Interfaces. New York, NY, USA: ACM Press, 2005,
pp. 284–286.
[9] “Dragon naturally speaking, MousegridTM, ScanSoft Inc.” 2004.
[10] J. Bilmes, X. Li, J. Malkin, K. Kilanski, R. Wright, K. Kirchhoff,
A. Subramanya, S. Harada, J. Landay, P. Dowden, and H. Chizeck,
“The vocal joystick: A voice-based human-computer interface for in-
dividuals with motor impairments,” in Human Language Technology
Conf. and Conf. on Empirical Methods in Natural Language Process-
ing, Vancouver, October 2005.
[11] J. Malkin, X. Li, and J. Bilmes, “Energy and loudness for speed control
in the vocal joystick,” in Proc. IEEE Automatic Speech Recognition and
Understanding (ASRU), Nov. 2005.
[12] X.Li, J.Bilmes, and J.Malkin, “Maximum margin learning and adapta-
tion of MLP classifers,” in 9th European Conference on Speech Com-
munication and Technology (Eurospeech’05), Lisbon, Portugal, Sep-
tember 2005.
[13] X.Li and J.Bilmes, “Regularized adaptation of discriminative classi-
fiers,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal
Processing, 2006.
[14] X.Li, J.Malkin, and J.Bilmes, “A graphical model approach to pitch
tracking,” in Proc. Int. Conf. on Spoken Language Processing, 2004.
[15] J. Malkin, X. Li, and J. Bilmes, A graphical model for formant
tracking,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal
Processing, 2005.
[16] J. Deller, J. Proakis, and J. Hansen, Discrete-time Processing of Speech
Signals. MacMillan, 1993.
[17] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A
Guide to Theory, Algorithm, and System Development. Prentice Hall,
[18] C. de Mauro, M. Gori, M. Maggini, and E. Martinelli, “A voice device
with an application-adapted protocol for Microsoft Windows,” in Proc.
IEEE Int. Conf. on Multimedia Comp. and Systems, vol. II, Firenze,
Italy, 1999, pp. 1015—1016.
... Three commercial speech-based target acquisition techniques are mouse motion voice commands, Mouse Grid, and Show Numbers (Nuance Communications, 2011;Odell & Mukerjee, 2007;Pugliese & Gould, 1998). Mouse motion voice commands-what some researchers have called the "Speech Cursor" (Harada, Landay, Malkin, Li, & Bilmes, 2006)-are provided in Nuance's Dragon Naturally Speaking product and produce mouse movement with commands like "move mouse up" and "move mouse right." The mouse cursor moves at a steady pace until the user says "stop." ...
... For example, projects have taken advantage of humming (Igarashi & Hughes, 2001;Sporka, Kurniawan, Mahmud, & Slavík, 2006; and whistling (Sporka, Kurniawan, & Slavík, 2004). Our own VoiceDraw project, in the same spirit as EyeDraw before it, enables people with motor impairments to move a paintbrush across a canvas by uttering the vowel sounds defined by the Vocal Joystick (Harada et al., 2006;Harada, Wobbrock, & Landay, 2007). Our VoiceDraw project is described in the next section. ...
... Our work utilizes the Vocal Joystick, a recognition engine for non-speech voice-based sounds that enables the smooth control of anything from a mouse cursor to a robotic arm (Bilmes, Li, Malkin, Kilanski, Wright, Kirchhoff, Subramanya, Harada, Landay, Dowden & Chizeck, 2005;Harada et al., 2006). The Vocal Joystick responds to vocalizations of vowel sounds, where different sounds indicate different directions. ...
Pointing to targets in graphical user interfaces remains a frequent and fundamental necessity in modern computing systems. Yet for millions of people with motor impairments, children, and older users, pointing—whether with a mouse cursor, a stylus, or a finger on a touch screen—remains a major access barrier because of the fine-motor skills required. In a series of projects inspired by and contributing to ability-based design, we have reconsidered the nature and assumptions behind pointing, resulting in changes to how mouse cursors work, the types of targets used, the way interfaces are designed and laid out, and even how input devices are used. The results from these explorations show that people with motor difficulties can acquire targets in graphical user interfaces when interfaces are designed to better match the abilities of their users. Ability-based design, as both a design philosophy and a design approach, provides a route to realizing a future in which people can utilize whatever abilities they have to express themselves not only to machines, but to the world.
... Various types of non-speech input such as humming and whistling are used to provide continuous input [28], and Igarashi et al. demonstrate how duration, pitch and tonguing of sounds are used for interactive controls [13]. Closely related, others present interaction techniques using prosodic features of speech and non-verbal metrics [10,20]. Sakamoto et al. ...
Conference Paper
We present an alternate approach to smartwatch interactions using non-voice acoustic input captured by the device's microphone to complement touch and speech. Whoosh is an interaction technique that recognizes the type and length of acoustic events performed by the user to enable low-cost, hands-free, and rapid input on smartwatches. We build a recognition system capable of detecting non-voice events directed at and around the watch, including blows, sip-and-puff, and directional air swipes, without hardware modifications to the device. Further, inspired by the design of musical instruments, we develop a custom modification of the physical structure of the watch case to passively alter the acoustic response of events around the bezel; this physical redesign expands our input vocabulary with no additional electronics. We evaluate our technique across 8 users with 10 events exhibiting up to 90.5% ten-fold cross validation accuracy on an unmodified watch, and 14 events with 91.3% ten-fold cross validation accuracy with an instrumental watch case. Finally, we share a number of demonstration applications, including multi-device interactions, to highlight our technique with a real-time recognizer running on the watch.
... Nowadays, there are devices oriented to help handicapped people [17]; some examples of biomedical researchers' work deals with heart signals, arm aid devices and (less frequently) with voice controlled systems. Introducing speech modules on autonomous systems make these devices more useful and easier to work with. ...
Full-text available
In this paper we describe speaker and command recognition related experiments, through quantile vectors and Gaussian Mixture Modelling (GMM). Over the past several years GMM and MFCC have become two of the dominant approaches for modelling speaker and speech recognition applications. However, memory and computational costs are important drawbacks, because autonomous systems suffer processing and power consumption constraints; thus, having a good trade-off between accuracy and computational requirements is mandatory. We decided to explore another approach (quantile vectors in several tasks) and a comparison with MFCC was made. Quantile acoustic vectors are proposed for speaker verification and command recognition tasks and the results showed very good recognition efficiency. This method offered a good trade-off between computation times, characteristics vector complexity and overall achieved efficiency.
... Different systems based on Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) synthesis have already been researched for their use by the handicapped community. The STARDUST project aimed to provide oral commands and control of a home environment [1,2]; the Vocal Joystick was designed to provide accessibility to computers by heavily impaired users [3,4]; and, finally, the VIVOCA project created communicative aids able to recreate the speech from a disordered user [5,6]. ...
Full-text available
This thesis deals with the research and development of speech technology-based systems for the requirements of users with different impairments and disabilities, with the final aim of im-proving their quality of life. Speech disorders are shown to be a major challenge in the work with these users. This work per-forms all the steps in the research in speech technologies: start-ing with the acquisition of an oral corpus from young impaired speakers, the analysis of the acoustic and lexical variations in the disordered speech and the characterization of speaker de-pendent Automatic Speech Recognition (ASR) systems adapted to the acoustic and lexical variants introduced by these speak-ers. Furthermore, automated methods for detection and correc-tion of lexical mispronunciations are also evaluated. The results of the experiments show the on-going possibility for develop-ing a fully personalized ASR system for handicapped users that learns the speaker's speech characteristics on-line: while the user interacts with the recognition system. The development of speech therapy tools based on the knowledge gained is another outcome of the present thesis, where the development of "Co-munica" aims to improve the possibilities for semi-automated speech therapy in Spanish.
Conference Paper
To help both designers and people with tetraplegia fully realize the benefts of voice assistant technology, we conducted interviews with fve people with tetraplegia in the home to understand how this population currently uses voice-based interfaces as well as other technologies in their everyday tasks. We found that people with tetraplegia use voice assistants in specifc places, such as in their beds, or when traveling in their wheelchair. In addition, we note the inefciencies for people with tetraplegia when using voice assistance.
Conference Paper
Programming heavily relies on entering text using traditional QWERTY keyboards, which poses challenges for people with limited upper-body movement. Developing tools using a publicly available speech recognition API could provide a basis for keyboard free programming. In this paper, we describe our efforts in design, development, and evaluation of a voice-based IDE to support people with limited dexterity. We report on a formative Wizard of Oz (WOz) based design process to gain an understanding of how people would use and what they expect from a speech-based programming environment. Informed by the findings from the WOz, we developed VocalIDE, a prototype speech-based IDE with features such as Context Color Editing that facilitates vocal programming. Finally, we evaluate the utility of VocalIDE with 8 participants who have upper limb motor impairments. The study showed that VocalIDE significantly improves the participants' ability to make navigational edits and select text while programming.
Deep Kernel Learning (DKL) has been proven to be an effective method to learn complex feature representation by combining the structural properties of deep learning with the nonparametric flexibility of kernel methods, which can be naturally used for supervised dimensionality reduction. However, if limited training data are available its performance could be compromised because parameters of the deep structure embedded into the model are large and difficult to be efficiently optimized. In order to address this issue, we propose the Shared Deep Kernel Learning model by combining DKL with shared Gaussian Process Latent Variable Model. The novel method could not only bring the improved performance without increasing model complexity but also learn the hierarchical features by sharing the deep kernel. The comparison with some supervised dimensionality reduction methods and deep learning approach verify the advantages of the proposed model.
We present a joint demonstration between the Robotics, Autonomous Systems, and Controls Laboratory (RASCAL) at UC Davis and the Columbia University Robotics Group, wherein a human-in-the-loop robotic grasping platform in the Columbia lab (New York, NY) is controlled to select and grasp an object by a C3-C4 spinal cord injury (SCI) subject in the UC Davis lab (Davis, CA) using a new single-signal, multi-degree-of-freedom surface electromyography (sEMG) human-robot interface. The grasping system breaks the grasping task into a multi-stage pipeline that can be navigated with only a few inputs. It integrates pre-planned grasps with on-line grasp planning capability and an object recognition and target selection system capable of handling multi-object scenes with moderate occlusion. Previous work performed in the RASCAL lab demonstrated that by continuously modulating the power in two individual bands in the frequency spectrum of a single sEMG signal, users were able to control a cursor in 2D for cursor to target tasks. Using this paradigm, four targets were presented in order for the subject to command the multi-stage grasping pipeline. We demonstrate that using this system, operators are able to grasp objects in a remote location using a robotic grasping platform.
With the recent progress in computer hardware and computer graphics (CG) techniques, applications using 3D virtual space are getting popular. So far, a mouse and a keyboard are generally used in these applications. While a mouse is a very successful input device for continuously controlling 2D objects, it is not necessarily intuitive for controlling 3D objects. In order to control 3D objects such as an avatar or a moving camera in a virtual space, speech interface has a potential to be a more natural and powerful alternative to a mouse. We propose speech based direct manipulation interface based on stretched word-end voice that controls continuous movements of 3D objects. By combining the proposed method with normal word based commands, both continuous movements and discrete actions are seamlessly controlled. Therefore, everything can be controlled using speech. The proposed method is implemented as an interface to the Second Life system. We compare it with a conventional speech based method that specifies start and end timing of motions. Analyses based on human subjects show that the proposed method is superior to the conventional speech based method. Moreover, we show that the best result is obtained when both methods are combined.
Conference Paper
Full-text available
We present a novel voice-based human- computer interface designed to enable in- dividuals with motor impairments to use vocal parameters for continuous control tasks. Since discrete spoken commands are ill-suited to such tasks, our interface exploits a large set of continuous acoustic- phonetic parameters like pitch, loudness, vowel quality, etc. Their selection is opti- mized with respect to automatic recogniz- ability, communication bandwidth, learn- ability, suitability, and ease of use. Pa- rameters are extracted in real time, trans- formed via adaptation and acceleration, and converted into continuous control sig- nals. This paper describes the basic en- gine, prototype applications (in particu- lar, voice-based web browsing and a con- trolled trajectory-following task), and ini- tial user studies confirming the feasibility of this technology.
Conference Paper
Full-text available
We describe several approaches for using prosodic features of speech and audio localization to control interactive applications. This information can be applied to parameter control, as well as to speech disambiguation. We discuss how characteristics of spoken sentences can be exploited in the user interface; for example, by considering the speed with which a sentence is spoken and the presence of extraneous utterances. We also show how coarse audio localization can be used for low-fidelity gesture tracking, by inferring the speaker's head position.
We present a novel approach to estimating the first two for- mants (F1 and F2) of a speech signal using graphical models. Us- ing a graph that takes advantage of less commonly used features of Bayesian networks, both v-structures and soft evidence, the model presented here shows that it can learn to perform reasonably with- out large amounts of training data, even with minimal processing on the initial signal. It far outperforms a factorial HMM using the same assumptions and suggests that with further refinement the model may produce high quality formant tracks.
Conference Paper
We introduce a novel method for adapting discriminative classifiers (multi-layer perceptrons (MLPs) and support vector machines (SVMs)). Our method is based on the idea of regularization, whereby an optimization cost criterion to be minimized includes a penalty in accordance to how "complex" the system is. Specifically, our regularization term penalizes depending on how different an adapted system is from an unadapted system, thus avoiding the problem of overtraining when only a small amount of adaptation data is available. We justify this approach using a max-margin argument. We apply this technique to MLPs and produce a working real-time system for rapid adaptation of vowel classifiers in the context of the Vocal Joystick project. Overall, we find that our method outperforms all other MLP-based adaptation methods we are aware of. Our technique, however, is quite general and can be used whenever rapid adaptation of MLP or SVM classifiers are needed (e.g., from a speaker-independent to a speaker-dependent classifier in a hybrid MLP/HMM or SVM/HMM speech-recognition system)
Conference Paper
Conventional MLP classifiers used in phonetic recog- nition and speech recognition may encounter local min- ima during training, and they often lack an intuitive and flexible adaptation approach. This paper presents a hy- brid MLP-SVM classifier and its associated adaptation strategy, where the last layer of a conventional MLP is learned and adapted in the maximum separation mar- gin sense. This structure also provides a support vec- tor based adaptation mechanism which better interpo- lates between a speaker-independent model and speaker- dependent adaptation data. Preliminary experiments on vowel classification have shown promising results for both MLP learning and adaptation problems.
Conference Paper
The interaction between the user and a graphical interface relies on a protocol that defines the semantics of the actions performed by the user an the input devices. This protocol binds each action to an effect on the graphical interface. We propose a system that adapts this protocol to the particular context on the graphical interface in order to speed-up many operations. This approach is particularly useful for helping people having physical handicaps to have a comfortable interaction with the most common software tools, like web browsers. In particular we show how the proposed scheme can improve the usability of an input voice device that can be used to control the pointer on the screen instead of a mechanical mouse
Electrical Engineering Discrete-Time Processing of Speech Signals Commercial applications of speech processing and recognition are fast becoming a growth industry that will shape the next decade. Now students and practicing engineers of signal processing can find in a single volume the fundamentals essential to understanding this rapidly developing field. IEEE Press is pleased to publish a classic reissue of Discrete-Time Processing of Speech Signals. Specially featured in this reissue is the addition of valuable World Wide Web links to the latest speech data references. This landmark book offers a balanced discussion of both the mathematical theory of digital speech signal processing and critical contemporary applications. The authors provide a comprehensive view of all major modern speech processing areas: speech production physiology and modeling, signal analysis techniques, coding, enhancement, quality assessment, and recognition. You will learn the principles needed to understand advanced technologies in speech processing—from speech coding for communications systems to biomedical applications of speech analysis and recognition. Ideal for self-study or as a course text, this far-reaching reference book offers an extensive historical context for concepts under discussion, end-of-chapter problems, and practical algorithms. Discrete-Time Processing of Speech Signals is the definitive resource for students, engineers, and scientists in the speech processing field.
Conference Paper
We propose and describe several methods for using speech power as an estimate of intentional loudness, and a mapping from this loudness estimate to a continuous control. This is performed in the context of a novel voice-based human-computer interface designed to enable individuals with motor impairments to use vocal tract parameters for both discrete and continuous control tasks. The interface uses vocal gestures to control continuous movement and discrete sounds for other events. We conduct a user preference survey to gauge user reaction to the various methods in a mouse cursor control context. We find that loudness is an effective mechanism to control mouse cursor movement speed when mapping vocalic gestures to spatial position
Many pitch trackers based on dynamic programming require meticulous design of local cost and transition cost functions. The forms of these functions are often empirically determined and their parameters are tuned accordingly. Parameter tuning usually requires great effort without a guarantee of optimal performance. This work presents a graphical model framework to automatically optimize pitch tracking parameters in the maximum likelihood sense. Therein, probabilistic dependencies between pitch, pitch transition and acoustical observations are expressed using the language of graphical models, and probabilistic inference is accomplished using the Graphical Model Toolkit (GMTK). Experiments show that this framework not only expedites the design of a pitch tracker, but also yields remarkably good performance for both pitch estimation and voicing decision.
We describe the use of non-verbal features in voice for direct control of interactive applications. Traditional speech recognition interfaces are based on an indirect, conversational model. First the user gives a direction and then the system performs certain operation. Our goal is to achieve more direct, immediate interaction like using a button or joystick by using lower-level features of voice such as pitch and volume. We are developing several prototype interaction techniques based on this idea, such as "control by continuous voice", "rate-based parameter control by pitch," and "discrete parameter control by tonguing." We have implemented several prototype systems, and they suggest that voice-as-sound techniques can enhance traditional voice recognition approach.