Conference PaperPDF Available

Talking to a system and oneself: A study from a Speech-to-Speech, Machine Translation mediated Map Task

Authors:

Figures

Content may be subject to copyright.
Talking to a system and oneself: A study from a Speech-to-Speech, Machine
Translation mediated Map Task
Hayakawa Akira, Fasih Haider, Saturnino Luz,
Loredana Cerrato, Nick Campbell
ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin, Ireland
Usher Institute of Population Health Sciences & Informatics, University of Edinburgh, UK
{campbeak, haiderf, cerratol, nick}@tcd.ie, S.Luz@ed.ac.uk
Abstract
The results of a comparison between three different speech
types — On-Talk, speaking to a computer, Off-Talk Self , speak-
ing to oneself and Off-Talk Other, speaking to another person
— uttered by subjects in a collaborative interlingual task medi-
ated by an automatic speech-to-speech translation system, are
reported here. The characteristics of the three speech types
show significant differences in terms of speech rate (F2,2719 =
101.7; p < 2e16), and for this reason a detection method
was implemented to see if they could also be detected with good
accuracy based on their acoustic and biological characteristics.
Acoustic and biological measures provide good results in dis-
tinguish between On-Talk and Off-Talk, but have difficulty dis-
tinguishing the sub-criteria of Off-Talk:Self and Other.
Index Terms: speech recognition, human-computer interac-
tion, computational paralinguistics
1. Introduction
People talking to a computer can sometimes speak aside to
themselves. This may be to repeat what is displayed on the
computer screen, think out load, vent out frustration, or per-
sonify the computer and give it a hypothetical pat on the back.
This behaviour has been observed before in speakers interact-
ing with elaborate automatic dialogue systems [1], [2], and has
been referred to as Off-Talk following Oppermann et al. [3],
where Off-Talk is defined as comprising “every utterance that is
not directed to the system as a question, a feedback utterance
or as an instruction”. Batliner et al. [2] referred to On-Talk
as a default register for interaction with computers. How peo-
ple talk to computers has been proven, in several studies, to be
different from how they talk to other humans [4], [5]. In this
study, we do not try to define Computer Talk, but to simply
differentiate On-Talk and Off-Talk in a Speech-to-Speech (S2S)
Machine Translated (MT) task oriented interaction. In a pre-
vious study [6] with the ILMT-s2s corpus, we concluded that
subjects preferred a communication setup where they could not
see the interlocutor. A possible side effect of this setup is the
reduction of back channeling — metadata related to the under-
standing/completion of an instruction is not transmitted to the
interlocutor, since facial cues and gestures usually carry this in-
formation. We think that a system trained to recognise these ut-
terances could enhance its performance by either not reacting to
these utterances, or process them in a special way, for instance,
on a meta level, as an indication of the (mal-) functioning of the
system or as an additional feedback channel to the interlocutor.
Previous studies contrasting On-Talk and Off-Talk, focussing on
the phonetic and prosodic delivery of utterances [2], have shown
that generally Computer Talk (i.e. On-Talk) is similar to talking
to someone who is hard of hearing: more hyperarticulated with
higher energy. Branigan et al. [4] even mentions that commu-
nication to a computer is more exaggerated when compared to
a fellow human. Automatic speech recognition (ASR) systems
do not always work as they should and this can trigger differ-
ent repair strategies from speakers. These strategies are meant
to increase the understanding for the system, but actually end
up being even more difficult to process for ASR systems, caus-
ing an increase of the recognition error rates. What has been
less investigated is the speaker reaction in terms of production
of Off-Talk consisting of comments about the mal-functioning
of communication — due to the system or the difficulty of the
task. In our study we look at On-Talk and 2 variants of Off-Talk
produced by users of a computer system that mediates their in-
terlingual S2S interactions in a collaborative task.
2. Material
The data used in this study is part of the ILMT-s2s corpus [7]
and includes the speech of 30 subjects, with 15 annotated and
recorded dialogues between speaker of 2 different languages
(English and Portuguese) and biological signals recorded by
means of biosignal tracking devices.
2.1. The ILMT-s2s System
Two subjects, seated in two different rooms, used the ILMT-s2s
system (Figure 1) to communicate to each other. The ILMT-
s2s system, is a system that uses off-the-shelf components to
perform Speech-To-Speech Machine Translation. It is activated
by a “Push-to-talk” button that the subject will click-and-hold
for the duration of the utterance and release once the subject
has finished. Neither subject can hear the other’s voice since
the output of the ASR and MT is provided by a synthetic voice.
ASR MT TTS
S2S prototype
Status - Otherside:en -> pt
question mark_bs_
I'm arrow on the stones washed and stony Bay
one inch away from the left of the paper
Yes
have flagship
I don't have mountain
Christopher inch of a paper the left of the paper
Polegada de Christopher de um papel, à esquerda do papel
is closest you can get to the edge of the paper
é mais próximo que você pode obter para a borda do papel
go down to your level with the flagship
descer ao seu nível com o carro-chefe
go to the left side of the mountain until your level with the flagship
Vá para o lado esquerdo da montanha até seu nível com o carro-chefe
furry once again
peludo mais uma vez
how far from the bottom left corner fire
como longe do fogo do canto esquerdo inferior
Push to talk
What I saidTranslation
What the other said
S2S prototype
Status - Otherside:pt -> en
Sinto muito
das Filipinas
Vá para o lado esquerdo da página
Polegada de Christopher de um papel, à esquerda do papel
é mais próximo que você pode obter para a borda do papel
descer ao seu nível com o carro-chefe
inferior esquerdo da página
bottom left corner of the page
ponto de interrogação
question mark
seta estou sobre as pedras lavadas e bahia pedregoso
I'm arrow on the stones washed and stony Bay
uma polegada de distância da esquerda do papel
one inch away from the left of the paper
sim
Yes
tenho navio almirante
have flagship
Push to talk
What I saidTranslation
What the other said
Recording...
Click and hold to start
ASR
MT
S2S prototype
Status - Otherside:pt -> en Push to talk
What I saidTranslation
What the other said
Python script Microsoft Bing translationGoogle Speech API
Internet
TTS
Apple TTS system voice
Figure 1: ILMT-s2s system used to collect the data
Speech Prosody 2016, May 31 - June 3, 2016, Boston, MA, USA
776
2.2. Audio, Video and Biosignal Recordings
Two audio and five video sources are included in the ILMT-s2s
corpus. Of these, the audio from the two video cameras that
captured the images in Figure 2 were used for this study, since
they recorded the whole dialogue from start to end.
To record the biosignals, a Mind Media B.V., NeXus-4 was
used to collect the Heart Rate (HR) using the Blood-Volume
Pulse (BVP) readings, Skin Conductance (SC) and the brains
electrical activity through Electroencephelography (EGG). The
BVP sensor was placed on the index finder, with the SC sensor
placed on the middle and ring finger. EEG sensors were placed
in the F4, C4, P4 with a ground channel placed at A1 of the 10
– 20 location system. The sampling frequency for the SC, HR
and EEG were 32 kHz, 32, kHz and 1,024 kHz respectively.
2.3. The Subjects and Recording Environment
The subjects were recruited from the Trinity College Dublin
digital noticeboard or via personal connections. Fifteen record-
ings of fifteen native English speakers (5, 10), and fifteen
native Portuguese speakers (11, 4), between the ages of 18
and 45 were collected. Each recording session was conducted
in a working office and they last between 20 and 74 minutes and
contains between 43 and 219 transcribed utterances. One sub-
ject during each recording session was fitted with the biosignal
recording device, while the other subject was not (Figure 2).
w/o biosignal monitor w/ biosignal monitor
Figure 2: Subjects during recordings
2.4. The Map Task Technique
Maps from the HCRC Map Task corpus [8] were used to elicit
the task oriented conversation between the subjects. Of the six-
teen HCRC Map Task maps, map 01 and map 07 were used
– with a copy translated into Portuguese for the Portuguese
speaker. As with the HCRC Map Task, the subjects in each
recordings were given a role of either Information Giver (IG) or
Information Follower (IF), where the IG has a map with a route
drawn on it. The IG has to instruct the IF to draw the route on
his/her unmarked copy of the map. Each map contains a number
of landmarks (e.g., “white mountain”, “baboons”, “crest falls”)
which may or may not be common to both maps (Figure 3).
This difference between the IG’s and IF’s map, combined with
the fact that neither subject can see the other’s map adds to the
complexity of the task.
2.5. On-Talk, Off-Talk Labels
We used the dedicated annotation tool ELAN [9] to label the
transcription with On-Talk,Off-Talk (Self and Other). As men-
tioned in § 1, On-Talk are locations within the dialogue where
the subject is talking to the ILMT-s2s system to communicate
to the other subject and Off-Talk are utterances that are not di-
rected at the ILMT-s2s system. Off-Talk was further subcate-
gorised into Self and Other.Self being Off-Talk to oneself and
Map 1g
white mountain
abandoned truck
baboons
footbridge old temple
poisoned stream
crest falls
START
remote village
FINISH
cobbled street
pyramid
poisoned stream
Map 1f
white mountain
abandoned truck
slate mountain
footbridge old temple
banana tree crest falls
START
remote village
lemon grove
pyramid
poisoned stream
HCRC Map Task map 01
Information Giver (IG) Information Follower (IF)
Figure 3: Map 01, with differences highlighted
Other being Off-Talk to another person in the room – the other
person could be the technician of the experiment who entered
the room on the few occasions when the system crashed, or a
university member using the office for other purposes.
On-Talk locations were retrieved from the ILMT-s2s sys-
tem’s log and all other utterances were annotated manually for
Off-Talk Self and Off-Talk Other (Table 1).
Table 1: Total No. of utterances types in ILMT-s2s corpus.
Utterances w/ Biosignalsw/o Biosignals Total
On-Talk 1,110 2,439
Off-Talk 579 1,189
Self 370 848
Other 209 341
Total 1,689
1,329
610
478
132
1,939 3,628
3. Method and Results
Based on the following speech rate comparison of the data, a
significant difference between On-Talk and Off-Talk was ob-
served from the speech rate of the subjects (§ 3.2). Since there
was an overlap for all 3 talk type speech rates, we experimented
with the data to see if On-Talk and Off-Talk can be automatically
detected or not with other means (§ 3.3).
3.1. Method: Speech rate comparison
For this analysis, a 180 wpm TTS output of all the utterances
was made using the same synthetic voice as the ILMT-s2s sys-
tem, and then segmented using Praat [10] to obtain a reference
utterance duration, as used in our previous study [11]. The ref-
erence utterance duration was then used to calculate a percent-
age difference with the original subject utterance (1S/T ),
where Sis the duration of the speaker’s utterance and Tis
the duration of the TTS output, with a positive result indicat-
ing speech faster than the ILMT-s2s system TTS output and
a negative result indicating slower speech. However, due to
the higher ratio of single word utterances (e.g., “umm”, “ok”,
“yes”, “what?”, “ah”, etc.) in Off-Talk Self, single word utter-
ances have been removed from the data to reduce the standard
deviation difference that it causes (All utterances’ sd w/ 1 word:
74.14, w/o 1 word: 47.09). This resulted in 2,093 On-Talk, 629
Off-Talk (395 Self and 243 Other) utterance speech rates being
used for this analysis.
Preliminary tests of the dialogue show that within the first
thirty seconds of the dialogue, there is no significant differ-
ence between the speech rate difference of On-Talk and Off-Talk
(F1,44 = 0.031; p= 0.862). Even when the data is expanded
777
to the first one hundred seconds, the significance is still small
(F1,121 = 4.005; p= 0.048). This suggests that the subjects
started the dialogues with similar speech rates.
Furthermore, as previously studied in [12], [11], a corre-
lation between Word Error Rate (WER) and hyperarticulation
has been identified. However, it was observed that of the four-
teen subjects that start with one hundred percent accurate ASR
results, the onset of hyperarticulation precedes the ASR result
error. If it was a reaction of WER, then hyperarticulation should
start after the first ASR error. This is an indication that commu-
nication through the ILMT-s2s system was not the only cause
of hyperarticulation for the subjects.
To indicate that there is a difference between the talk types
the following null hypothesis is tested on all the individual sub-
jects, and the various categories that they can be divided into
within the corpus settings.
H0:The means of utterance speech rate differences are the
same for talk types.
3.2. Result: Speech rate comparison
Of the 30 subjects, 15 subjects have less than 12.22% of Off-
Talk utterances in their dialogues, of which 3 subjects have no
Off-Talk utterances at all — 1047 On-Talk, 52 Off-Talk (mean
% of Off-Talk within each dialogue: 5.03%, sd: 4.2%) (43 Self
and 9 Other). The remaining 15 subjects have between 12.24%
to 61.95% of Off-Talk utterances within the dialogues — 1046
On-Talk, 577 Off-Talk (mean % of Off-Talk within each dia-
logue: 34.83%, sd: 14.0%) (352 Self and 225 Other).
The ANOVA test results show that 15 subjects out of the
30 subjects have a significant difference between the 2 types,
On-Talk and Off-Talk. Of the 15 subjects with a significant dif-
ference, 2 subjects only have 8.05% and 10.81% of Off-Talk
within the dialogues, while the other 13 subjects have between
12.24% and 61.95% (mean 36.33%, sd: 14.2%).
When the test is run for the 3 types, On-Talk,Off-Talk Self ,
Off-Talk Other, 17 subjects out of the same 30 subjects have a
significant difference in the speech duration. Compared to the
2 type comparison above, a subject with 12.22% and 32.53% of
Off-Talk have shown significant differences.
Following the ANOVA test of individual subjects, to clarify
that this difference is not merely a characteristic of a specific
category within the corpus the test was performed with the sub-
jects divided by categories. The results show a significant dif-
ference being observed in all categories (Table 2 and Figure 4).
The removal of 2 word utterances revealed similar results.
Sig Eng PortMaleNo Sig
On-Talk Off-Talk Self Off-Talk Other
FemaleAll IG IF w Vid wo Vid w Bio wo Bio
−150%−100% −50% 0 50% 100%
Figure 4: The 3 talk types plotted with 0 indicating the same
speech rate as the TTS reference output, positive % points as
faster than the TTS output, and negative % points as slower
Table 2: H0of the 3 talk types in each category
Category ANOVA results
H0– All F2,2719 = 101.7; p < 2e16 (***)
H0– IG F2,1475 = 86.59; p < 2e16 (***)
H0– IF F2,1241 = 21.30; p < 8.06e10 (***)
H0F2,1465 = 84.24; p < 2e16 (***)
H0F2,1251 = 41.17; p < 2e16 (***)
H0– En F2,1457 = 89.27; p < 2e16 (***)
H0– Pt F2,1259 = 29.96; p < 1.94e13 (***)
H0– Pt-Pt F2,1131 = 8.15; p < 0.000305 (***)
H0– w/ Video F2,1574 = 79.15; p < 2e16 (***)
H0– w/o Video F2,1142 = 23.46; p < 1.03e10 (***)
H0– w/ Bio F2,1397 = 48.78; p < 2e16 (***)
H0– w/o Bio F2,1127 = 54.13; p < 2e16 (***)
3.3. Method: Detection of On-Talk & Off-Talk
For the following experiments, the start and end times of the
On-Talk,Off-Talk label annotation were used to segment the
synchronised audio and biosignal files. Two of the fifteen EEG
recordings provided faulty readings and were excluded from the
dataset. This resulted in 1,127 On-Talk, 554 Off-Talk (422 Self
and 132 Other) utterance locations being used for this experi-
ment.
For the detection of On-Talk and Off-Talk we extract fea-
tures from audio and biosignals and explore the potential use of
these features to identify On-Talk and Off-Talk.
Exp. 1: A 2-Class experiment where we only distinguish the
difference between On-Talk and Off-Talk.
Exp. 2: A 3-Class experiment where we distinguish the differ-
ence between On-Talk,Off-Talk Self and Off-Talk Other.
3.3.1. Feature Extraction
The following features were used for the classification task.
Audio features: For the classification task we use the IN-
TERSPEECH 2013 Computational Paralinguistics Challenge
(ComParE) feature set [13]. This contains energy, spectral, cep-
stral (MFCC) and voicing related low-level descriptors, as well
as other descriptors such as logarithmic harmonic-to-noise ra-
tio (HNR), spectral harmonicity, and psychoacoustic spectral
sharpness. To ignore the most irrelevant acoustic feature, K
Means clustering algorithm is employed. This divides the fea-
ture set into 9 clusters and of these only the cluster with highest
number of features is selected for classification. As a result, the
total number of acoustic features reduces from 6,373 to 6,356.
Biosignal features: For the biosignals (HR, SC and EEG)
we calculate Shannon entropy, mean, standard deviation, me-
dian, mode, maximum value, minimum value, maximum ratio,
minimum ratio, energy and power. This feature set is calculated
for each biosignal and its first and second order derivative. In
total we have 33 features for each biosignal. The EEG gamma
signals from sensor A and B (10 – 20 system: F4 – C4 and C4
– P4) are considered in this study due to their higher prediction
power for mental tasks classification [14]. The minimum ratio
of an observation is measured by counting the number of in-
stances which have a lower value compared to their preceding
and following instance and then dividing it by the total number
778
of instances in that observation. Similarly, the maximum ratio
of an observation is measured by counting the number of in-
stances which have a higher value compared to their preceding
and following instance and then dividing it by the total number
of instances in that observation.
3.3.2. Classification Method
The classification method was implemented in MATLAB1us-
ing Statistics and Machine Learning Toolbox and employed dis-
criminant analysis in 10-fold cross validation experiments. The
classification method works by assuming that the feature sets of
the classes to be discerned are drawn from different Gaussian
distributions and adopting a pseudo-linear discriminant analy-
sis (i.e. using the pseudo-inverse of the covariance matrix [15]).
3.4. Result: Detection of On-Talk & Off-Talk
The following results were obtained. See Table 3 and Figure 5
for details.
Table 3: Discriminative Analysis Method Results – F Score (%)
F Score (%)
20
40
60
80
100
Exp. 1: On Talk
Off Talk
On Talk
Off Talk Self
Off Talk Other
Exp. 2:
EEG
HR
SC
HR + SC
All Bio
Audio
Audio + EEG
Audio + HR + SC
Audio + All Bio
Figure 5: Discriminative Analysis Method Results
The results of experiment 1 show that the acoustic and bi-
ological measures significantly contribute to the prediction of
On-Talk and Off-Talk. The acoustic feature set provides the op-
timum performance with a maximum F scores of 94.14% for
1http://uk.mathworks.com/products/matlab/
On-Talk and 87.55% for Off-Talk. Also the SC feature set per-
forms better than other biological features but a fusion of the
bio feature sets cause an increase in prediction. However, a fu-
sion of acoustic and bio features improves the performance in
two cases, but has almost no effect as compared to audio feature
alone when audio features are fused with all the bio features.
From the results of experiment 2 we can see that the 3-Class
results for On-Talk are almost the same as the 2-Class On-Talk
results. Also results for Off-Talk Other are poor using bio fea-
tures alone (max. 13.33%) but significantly improve when com-
bined with the acoustic feature set (38.13%) — considering that
the dataset is imbalanced, with less instances for Off-Talk Other
(7.85%) these results can be regarded as quite good. The HR
is found to have more prediction power as compared to EEG
and SC and the fusion of biosignals improves the prediction.
A decrease in Off-Talk Other results is observed when audio
(36.64%) feature set is combined with EEG (33.60%) and with
HR and SC (34.63%) feature sets. This might be due to the
lower number of bio features since when we fuse them all to-
gether (All Bio: HR, SC, and EEG) and increase the number of
bio features, we get the highest F-Score (38.13%) as expected.
Although the acoustic feature set performs best as compared to
other signal sets, we believe there is still room for improvement
from the biosignals since they currently use a limited number
of features (only 33 features for each signal) and may contain
some noise components (head movements of subjects etc).
4. Discussion and Conclusion
The main motivation of this study, apart from it’s novelty, was
to verify if there was a distinguishable difference between On-
Talk,Off-Talk Self and Off-Talk Other for a interactive sys-
tem to provide better performance and a better understanding
of the interlocutor. This was achieved with the clear signifi-
cant difference, moderate Cohen’s destimate and good prosodic
prediction results. However the sub-finding that hyperarticu-
lation was not initiated by the ASR WER is of interest and
also the significant difference between the On-Talk of IG and
IF (m=47.05, sd = 44.43 and m=26.99, sd = 42.48),
and female and male (m=48.49, sd = 48.23 and m=
27.920, sd = 38.24) in Figure 4 needs further investigation.
It is easily imaginable that the perception of simplicity of the
map task with the actual complexity of providing understand-
able instruction caused the initial hyperarticulation. Combine
this with the difficulty of using the computer interaction sys-
tems may be the cause of this difference, and it will be interest-
ing to see if the speakers of the original HCRC map task also
displayed similar hyperarticulation differences.
It must also be mentioned that the method described in
§ 3.3, in general provides good results to predict On-Talk and
Off-Talk, but results from experiment 2 leaves the need to ex-
plore other prosodic and biological discriminative feature sets
(notably using the higher frequency band of the EEG signal).
5. Acknowledgements
This research is supported by Science Foundation Ireland
through the CNGL Programme (Grant 12/CE/I2267) in the
ADAPT Centre (www.adaptcentre.ie) at Trinity College Dublin,
and FP7-METALOGUE project under Grant No. 611073.
779
6. References
[1] A. Batliner, C. Hacker, and E. N¨
oth, “To talk or not to talk with
a computer: On-talk vs. off-talk,” How People Talk to Comput-
ers, Robots, and Other Artificial Communication Partners, p. 79,
2006.
[2] ——, “To talk or not to talk with a computer: Taking into ac-
count the user’s focus of attention,Journal on multimodal user
interfaces, vol. 2, no. 3-4, pp. 171–186, 2009.
[3] D. Oppermann, F. Schiel, S. Steininger, and N. Beringer, “Off-
talk-a problem for human-machine-interaction?” in Proceedings
of INTERSPEECH’01: the 2nd Annual Conference of the Interna-
tional Speech Communication Association. Aalborg, Denmark:
Citeseer, 2001, pp. 2197–2200.
[4] H. P. Branigan, M. J. Pickering, J. Pearson, and J. F. McLean,
“Linguistic alignment between people and computers,” Journal of
Pragmatics, vol. 42, no. 9, pp. 2355–2368, 2010.
[5] K. Fischer, “How people talk with robots: Designing dialog to
reduce user uncertainty,AI Magazine, vol. 32, no. 4, pp. 31–38,
2011.
[6] L. Cerrato, A. Hayakawa, N. Campbell, and S. Luz, “A speech-to-
speech, machine translation mediated map task: An exploratory
study,” in Proceedings of the Future and Emerging Trends in Lan-
guage Technology, Seville, Spain, 2015, in press.
[7] A. Hayakawa, S. Luz, L. Cerrato, and N. Campbell, “The ILMT-
s2s Corpus — A Multimodal Interlingual Map Task Corpus,” in
Proceedings of the Tenth International Conference on Language
Resources and Evaluation (LREC’16). Portoroˇ
z, Slovenia: Eu-
ropean Language Resources Association (ELRA), 2016, in press.
[8] A. H. Anderson, M. Bader, E. G. Bard, E. Boyle, G. Doherty,
S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo,
H. S. Thompson, and R. Weinert, “The HCRC Map Task Corpus,”
Language and Speech, vol. 34, no. 4, pp. 351–366, Oct. 1991.
[Online]. Available: http://las.sagepub.com/content/34/4/351
[9] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and
H. Sloetjes, “Elan: a professional framework for multimodality
research,” in Proceedings of the Fifth International Conference
on Language Resources and Evaluation (LREC’06), Genoa, Italy,
2006, pp. 1556–1559.
[10] P. Boersma and V. van Heuven, “Speak and unspeak with praat,
Glot International, vol. 5, no. 9-10, pp. 341–347, 2001.
[11] A. Hayakawa, L. Cerrato, N. Campbell, and S. Luz, “A Study
of Prosodic Alignment In Interlingual Map-Task Dialogues,” in
Proceedings of ICPhS XVIII (18th International Congress of Pho-
netic Sciences), The Scottish Consortium for ICPhS 2015, Ed., no.
0760. Glasgow, United Kingdom: University of Glasgow, 2015.
[12] A. J. Stent, M. K. Huffman, and S. E. Brennan, “Adapting speak-
ing after evidence of misrecognition: Local and global hyperar-
ticulation,” Speech Communication, vol. 50, no. 3, pp. 163–178,
2008.
[13] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,
F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi
et al., “The INTERSPEECH 2013 Computational Paralinguistics
Challenge: Social Signals, Conflict, Emotion, Autism,” in Pro-
ceedings of INTERSPEECH’13: the 14th Annual Conference of
the International Speech Communication Association. Lyon,
France: ISCA, 2013, pp. 148–152.
[14] H. Liu, J. Wang, C. Zheng, and P. He, “Study on the effect of
different frequency bands of EEG signals on mental tasks classi-
fication,” in Proceedings of the 27th Annual International Con-
ference of Engineering in Medicine and Biology Society, 2005.
IEEE-EMBS 2005. IEEE, 2006, pp. 5369–5372.
[15] S. Raudys and R. P. W. Duin, “Expected classification error of the
Fisher linear classifier with pseudo-inverse covariance matrix,
Pattern Recognition Letters, vol. 19, no. 5-6, pp. 385–392, Apr.
1998.
780
... Significant care was taken to measure the influence of each used feature set and to devise the optimal combination of them. Hayakawa et al. (2016a) utilized, for the first time, larger prosodic-acoustic feature sets. ...
... A dataset only used in one of the identified studies (cf. Hayakawa et al., 2016a) is the ILMT-s2s corpus (Hayakawa et al., 2016b). It represents a multimodal interlingual Map Task Corpus collected at Trinity College, Dublin. ...
... Afterwards, manual annotation was conducted according to different schemes, including On-Talk and Off-Talk (self and other). This differentiation into On-Talk (talking to the translation system and thus, in consequence, to the other user, 2,439 utterances) and Not-talking to that system (1,189 samples) is then used in the reported addressee detection studies by Hayakawa et al. (2016a). ...
Article
Full-text available
Objective Acoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural interaction on the same level as human interactions, many studies focused on the acoustic analyses of speech. The aim of this survey is to give an overview on the different studies and compare them in terms of utilized features, datasets, as well as classification architectures, which has so far been not conducted. Methods The survey followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines. We included all studies which were analyzing acoustic and/or acoustic characteristics of speech utterances to automatically detect the addressee. For each study, we describe the used dataset, feature set, classification architecture, performance, and other relevant findings. Results 1,581 studies were screened, of which 23 studies met the inclusion criteria. The majority of studies utilized German or English speech corpora. Twenty-six percent of the studies were tested on in-house datasets, where only limited information is available. Nearly 40% of the studies employed hand-crafted feature sets, the other studies mostly rely on Interspeech ComParE 2013 feature set or Log-FilterBank Energy and Log Energy of Short-Time Fourier Transform features. 12 out of 23 studies used deep-learning approaches, the other 11 studies used classical machine learning methods. Nine out of 23 studies furthermore employed a classifier fusion. Conclusion Speech-based automatic addressee detection is a relatively new research domain. Especially by using vast amounts of material or sophisticated models, device-directed speech is distinguished from non-device-directed speech. Furthermore, a clear distinction between in-house datasets and pre-existing ones can be drawn and a clear trend toward pre-defined larger feature sets (with partly used feature selection methods) is apparent.
... For example, an EEG-based voice activity detection helps in recording/processing the speech utterances of the system's user only (Von Borstel, Esquivel, & Meyer, 2015). It is also found that the right hemisphere of the brain is responsible for the speech prosodic characteristics (Shapiro & Danly, 1985;Weintraub, Mesulam, & Kramer, 1981;Ross & Mesulam, 1979) and Heart Rate (HR) and Skin Conductance (SC) also help in predicting the cognitive states (Hayakawa, Haider, Cerrato, Campbell, & Luz, 2015), emotions (Matejka et al., 2013), and On-Talk and Off-Talk (Hayakawa, Haider, et al., 2016a). ...
... In a previous study, Hayakawa et al (Hayakawa, Haider, et al., 2016a) (Hayakawa, Haider, et al., 2016a). However, the results from Hayakawa et al (Hayakawa, Haider, et al., 2016a) do not provide promising results using physiological signals alone and used more acoustic features (6,371 acoustic features) than those used in the method (988 acoustic features) reported in this paper. ...
... In a previous study, Hayakawa et al (Hayakawa, Haider, et al., 2016a) (Hayakawa, Haider, et al., 2016a). However, the results from Hayakawa et al (Hayakawa, Haider, et al., 2016a) do not provide promising results using physiological signals alone and used more acoustic features (6,371 acoustic features) than those used in the method (988 acoustic features) reported in this paper. ...
... This study extends our previous work ( Hayakawa et al., 2016a;Haider et al., 2018 ) where the EEG signal is analysed in overt speech (during articulation) rather than in covert (prior to articulation) as in the present study. We also analyse overt speech (during articulation) in combination with a very high dimensional set of acoustic features in previous studies ( Hayakawa et al., 2016a;Haider et al., 2018 ). ...
... This study extends our previous work ( Hayakawa et al., 2016a;Haider et al., 2018 ) where the EEG signal is analysed in overt speech (during articulation) rather than in covert (prior to articulation) as in the present study. We also analyse overt speech (during articulation) in combination with a very high dimensional set of acoustic features in previous studies ( Hayakawa et al., 2016a;Haider et al., 2018 ). This study proposes a new approach for automatic detection of on-and offtalk which could decrease the response time of an interactive speech driven system in accepting or rejecting a speech utterance. ...
... The duration of speech utterance is not constant and the duration statistic of speech utterances are shown in Table 1 . While in previous studies ( Hayakawa et al., 2016a;Haider et al., 2018 ) we evaluated the acoustic feature sets of the Compare challenge (6373 features) and Emobase (988 features) , in this study we limited acoustic features to fewer basic prosodic features (4 features). The motivation for this is to compare prosodic features of speech, which can be more easily interpreted than those large feature sets, to EEG features extracted from right brain-hemisphere, which controls speech prosody. ...
Article
Autonomous speech-enabled applications such as speech-to-speech machine translation, conversational agents, and spoken dialogue systems need to be able to distinguish system-directed user input from “off-talk” to function appropriately. “Off-talk” occurs when users speak to themselves or to others, often causing the system to mistakenly respond to speech that was not directed to it. Automatic detection of off-talk could help prevent such errors, and make the user’s interaction with the system more natural. It has been observed that speech in human-human dialogue and in soliloquy is prosodically different from speech directed at machines, and that the right hemisphere of the human brain is the locus of control of speech prosody. In this study, we explore human brain activity prior to speech articulation alone and in combination with prosodic features to create models for off-talk prediction. The proposed EEG based models are a step towards improving response time in detecting system-directed speech in comparison with audio-based methods of detection, opening new possibilities for the integration of brain-computer interface techniques into interactive speech systems.
... It has been observed that when people interact with computer systems, not only do they talk to the computer system but also, they tend to talk to themselves and to other people if present [1,2,3]. Oppermann et al [1] coined the term "Off-Talk" to denote speech that is not addressed to the computer system, as opposed to utterances that are directed to it, and therefore need to be understood by the system as "On-Talk". ...
... Previous studies by Oppermann et al [1] report that the loudness difference between On-Talk and Off-Talk can be used as a significant indicator of Off-Talk and Hayakawa et al [3] also suggest that the prosodic features can help the On-Talk and Off-Talk detection. One of the contributions of the present study is the demonstration of discrimination power of EEG frequency bands for On-Talk and Off-Talk detection. ...
... It is claimed that the right hemisphere of the brain is largely responsible for the speech prosodic characteristics [13,14,15] and Heart Rate (HR) and Skin Conductance (SC) also help in predicting the cognitive states [16], emotions [17], and On-Talk and Off-Talk [3]. From the literature as stated above, we conclude three things: (i) First, the prosodic characteristics are different for On-Talk and Off-Talk, (ii) Second, the right hemisphere of the brain largely determines the speech prosodic characteristics, and (iii) Third, the EEG signal is full of artefacts while someone is speaking, but the artefacts' range is between 15 Hz -300 Hz, and there are still some frequencies < 15 Hz which are not sensitive to muscle artefacts and contain the neural activity. ...
Conference Paper
Spoken interaction with a machine results in a behaviour that is not very common in face-to-face human communication: Off-Talk, which is defined as speech utterances that are not directed to an immediate interlocutor, the machine, but to another person or even oneself. It is our contention that a system which is able to detect the Off-Talk utterances can interact with a human in a more efficient manner by acknowledging that the utterances are not directed to the system and hence, not replying to Off-Talk utterances. In this paper, we demonstrate the discrimination power of a wide range of Electroencephalogram (EEG) frequency bands using wavelet transform analysis and propose models for On-Talk and Off-Talk detection using audio and EEG signals, and their fusion. Our study shows that the EEG signal can identify the occurrence of Off-Talk utterances with promising accuracy and its fusion with audio features adds a slight improvement in these results.
... Off- Talk Self ILMT-s2s Off- Talk Table IV, three types of utterances were distinguished within the corpus [18], [17]; On-Talk when the subject is using the S2S-MT system as a mediator to communicate with the interlocutor, Off-Talk Self when the subject is talking to him/herself, and Off-Talk Other when the subject is directly talking to a fellow human. In Figure 4, Since Off-Talk Other is defined as direct communication with a fellow human, the speech rate of Off-Talk Other was therefore compared with the speech rate values of the HCRC Map Task corpus. ...
... The frequency that dialogue acts used also differ. 4 The speech rate referred to here is the duration difference of the words uttered by the human subject compared to the duration of the same words uttered by the TTS system [16], [18], [17]. A negative rate indicates a duration that is longer (slower) than the TTS system voice speed setting of 180 wpm. ...
... From the observation of the user survey and the cognitive states reported in this paper, and adding the other results obtain from other papers published using the data of the ILMT-s2s corpus [16], [20], [21], [18], [19], [17], the conclusion is that simply adding video is not easing the task for the subject. ...
... The characteristics of the three speech types (On-Talk, Off-Talk Self and Off-Talk Other) show significant differences in terms of speech rate (F 2,2719 = 101.7; p < 2e − 16), and for this reason a detection method was implemented to see if they could also be detected with good accuracy based on their acoustic and biological characteristics (Hayakawa et al., 2016). ...
Conference Paper
Full-text available
This paper presents the multimodal Interlingual Map Task Corpus (ILMT-s2s corpus) collected at Trinity College Dublin, and discuss some of the issues related to the collection and analysis of the data. The corpus design is inspired by the HCRC Map Task Corpus which was initially designed to support the investigation of linguistic phenomena, and has been the focus of a variety of studies of communicative behaviour. The simplicity of the task, and the complexity of phenomena it can elicit, make the map task an ideal object of study. Although there are studies that used replications of the map task to investigate communication in computer mediated tasks, this ILMT-s2s corpus is, to the best of our knowledge, the first investigation of communicative behaviour in the presence of three additional "filters": Automatic Speech Recognition (ASR), Machine Translation (MT) and Text To Speech (TTS) synthesis, where the instruction giver and the instruction follower speak different languages. This paper details the data collection setup and completed annotation of the ILMT-s2s corpus, and outlines preliminary results obtained from the data.
... Each question follows a 7 point Likert scale ranging from '1 -Strongly disagree' to '7 -Strongly agree', designed in such a way that the more they agreed to the statement, the more positive their experience was. Due to the push-to-talk activation method of the system, subjects did not only talk to their interlocutor (On-Talk), but also spoke out loud to themselves and other people in the room (Off-Talk) (Hayakawa et al., 2016a). ...
Conference Paper
Full-text available
This work presents an assessment of interlocutor alignment using a semi-automated method in the context of multimodal interlingual (English-Portuguese) computer-mediated interactions. We study the adaptation phenomenon (also known as convergence behaviour and alignment behaviour) by looking at verbal repetition at different levels of linguistic representation. Since alignment behaviour has already been analysed in direct human-to-human and in human-to-agent dialogues, one may wonder whether the same behaviour is observed in interlingual computer-mediated communication. First, we compare repetitions patterns in task-oriented dialogues of human-to-human communication (HCRC Edinburgh Map Task corpus) and interlingual computer-mediated human-to-human communication (ILMT-s2s corpus), for eye-contact and no eye-contact scenarios. Secondly, we study the relation between the cognitive state of the subject, and the alignment process in interlingual computer-mediated communication settings. Results show that above chance repetitions, signalling verbal alignment, are present in both direct human-to-human communication and interlingual computer-mediated interactions, and that interlingual computer-mediated setting yields significantly more self-repetitions than direct human-to-human interactions. Also, in interlingual computer-mediated communication, a lack of alignment cues for long sequences correlated with a high amount of negative cognitive states in the eye-contact setting, implying a potential lack of mutual understanding.
... Sztahó et al. (2015) use a similar method of comparing the syllable duration it takes a subject to utter a specific phrase with previous recordings of the utterance by the same subject. The method explained in this paper has also been used to calculate the speech rate in previous publications by the author Hayakawa et al., 2016a;Hayakawa et al., 2017a), but continues to attract interest and has yet to be explained in detail. ...
Conference Paper
Full-text available
The motivation for this paper is to present a way to verify if an utterance within a corpus is pronounced at a fast or slow pace. An alternative method to the well-known Word-Per-Minute (wpm) method for cases where this approach is not applicable. For long segmentations, such as the full introduction section of a speech or presentation, the measurement of wpm is a viable option. For short comparisons of the same single word or multiple syllables, Syllables-Per-Second (sps) is also a viable option. However, when there are multiple short utterances that are frequent in task oriented dialogues or natural free flowing conversation, such as those of the direct Human-to-Human dialogues of the HCRC Map Task corpus or the computer mediated inter-lingual dialogues of the ILMT-s2s corpus, it becomes difficult to obtain a meaningful value for the utterance speech rate. In this paper we explain the method used to provide a alternative speech rate value to the utterance of the ILMT-s2s corpus and the HCRC Map Task corpus.
... A summary of the turn count in each corpus is listed in Table 3. Off- Talk 652 312 340 Off- Talk Self 483 256 227 Off- Talk Other 169 56 113 A previous study of the ILMT-s2s corpus has shown that the subjects of the ILMT-s2s corpus adapt their speech rate while speaking to the S2S-MT system at a relatively slower speed [8]. Also as differentiated in Table 3, three types of utterances were distinguished within the corpus [24]; On-Talk when the subject is using the S2S-MT system as a mediator to communicate with the interlocutor, Off-Talk Self when the subject is talking to him/herself, and Off-Talk Other when the subject is directly talking to a fellow human. The speech rates of the three talk types of the ILMT-s2s corpus and that of the HCRC Map Task corpus are plotted in Figure 4. ...
... These labels were based on the Dialogue Structure Coding scheme [5], but with modifications to "Acknowledgement" to differentiate the simple acknowledgement (CP) of the utterance from the acknowledgement with the actual understanding of the utterance (CPU) which is defined under the MUMIN coding scheme [6]. Also, a "Solo" variant of the dialogue acts, and "Interjection" were added due to the "Off-Talk" characteristics [7,8,9] of the collected data. ...
Conference Paper
Full-text available
This paper presents the multimodal Interlingual Map Task Corpus (ILMT-s2s corpus) collected at Trinity College Dublin, and discuss some of the issues related to the collection and analysis of the data. The corpus design is inspired by the HCRC Map Task Corpus which was initially designed to support the investigation of linguistic phenomena, and has been the focus of a variety of studies of communicative behaviour. The simplicity of the task, and the complexity of phenomena it can elicit, make the map task an ideal object of study. Although there are studies that used replications of the map task to investigate communication in computer mediated tasks, this ILMT-s2s corpus is, to the best of our knowledge, the first investigation of communicative behaviour in the presence of three additional "filters": Automatic Speech Recognition (ASR), Machine Translation (MT) and Text To Speech (TTS) synthesis, where the instruction giver and the instruction follower speak different languages. This paper details the data collection setup and completed annotation of the ILMT-s2s corpus, and outlines preliminary results obtained from the data.
Conference Paper
Full-text available
The aim of this study is to investigate how the language technologies of Automatic Speech Recognition (ASR), Machine Translation (MT), and Text To Speech (TTS) synthesis a�ect users during an interlingual interaction. In this paper, we describe the prototype system used for the data collection, we give details of the collected data and report the results of a usability test run to assess how the users of the interlingual system evaluate the interactions in a collaborative map task. We use widely adopted usability evaluation measures: ease of use, e�ectiveness and users satisfaction, and look at both qualitative and quantitative measures. Results indicate that both users taking part in the dialogues (instructions giver and follower) found the system similarly satisfactory in terms of ease of learning, ease of use, and pleasantness, even if they were less satis�ed with its e�ectiveness in supporting the task. Users employed di�erent strategies in order to adapt to the shortcomings of the technology, such as hyper-articulation, and rewording of utterances in relation to error of the ASR. We also report the results of a comparison of the map task in two di�erent settings { one that includes a constant video stream (\video-on") and one that does not (\no-video.") Surprisingly, users rated the no-video setting consistently better.
Conference Paper
Full-text available
This paper reports results from a study of how speakers adjust their speaking style in relation to errors from Automatic Speech Recognition (ASR), while performing an Interlingual map task. The dialogues we analysed were collected using a prototype speech-to-speech translation system which adds 3 elements to the communication which we think of as " filters " : Automatic Speech Recognition (ASR), Machine Translation (MT) and Text To Speech (TTS). Our belief is that these filters affect the speakers' performance in terms of cognitive load, resulting in adaptation of their communicative behaviour. The study shows that the participants do adjust their speaking style and speaking rate as a way of adapting to the errors made by the system. Specifically, the results show that (a) system errors influence speaking rate, and (b) the perceived level of cooperation by the interlocutors increases as system error increases.
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of overall twelve enacted emotional states. In this paper, we describe these four Sub-Challenges, their conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
Article
Full-text available
Utilization of computer tools in linguistic research has gained importance with the maturation of media frameworks for the handling of digital audio and video. The increased use of these tools in gesture, sign language and multimodal interaction studies has led to stronger requirements on the flexibility, the efficiency and in particular the time accuracy of annotation tools. This paper describes the efforts made to make ELAN a tool that meets these requirements, with special attention to the developments in the area of time accuracy. In subsequent sections an overview will be given of other enhancements in the latest versions of ELAN, that make it a useful tool in multimodality research.
Article
If no specific precautions are taken, people talking to a com- puter can - the same way as while talking to another human - speak aside, either to themselves or to another person. On the one hand, the computer should notice and process such utterances in a special way; on the other hand, such utterances provide us with unique data to contrast these two registers: talking vs. not talking to a computer. By that, we can get more insight into the register 'Computer-Talk'. In this paper, we present two dierent databases, SmartKom and SmartWeb, and classify and analyse On-Talk (addressing the computer) vs. O-Talk (addressing someone else) found in these two databases.
Article
This paper describes a corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels. The corpus uses the Map Task (Brown, Anderson, Yule, and Shillcock, 1983) in which speakers must collaborate verbally to reproduce on one participant's map a route printed on the other's. In all, the corpus includes four conversations from each of 64 young adults and manipulates the following variables: familiarity of speakers, eye contact between speakers, matching between landmarks on the participants’ maps, opportunities for contrastive stress, and phonological characteristics of landmark names. The motivations for the design are set out and basic corpus statistics are presented.
Article
There is strong evidence that when two people talk to each other, they tend to converge, or align, on common ways of speaking (e.g., Pickering and Garrod, 2004). In this paper, we discuss possible mechanisms that might lead to linguistic alignment, contrasting mechanisms that are encapsulated within the language processing system, and so unmediated by beliefs about the interlocutor, with mechanisms that are mediated by beliefs about the interlocutor and that are concerned with considerations of either communicative success or social affect. We consider how these mechanisms might be implicated in human–computer interaction (HCI), and then review recent empirical studies that investigated linguistic alignment in HCI. We argue that there is strong evidence that alignment occurs in HCI, but that it differs in important ways from that found in interactions between humans: It is generally stronger and has a larger mediated component that is concerned with enhancing communicative success.
Article
In this paper we examine the two-way relationship between hyperarticulation and evidence of misrecognition of computer-directed speech. We report the results of an experiment in which speakers spoke to a simulated speech recognizer and received text feedback about what had been “recognized”. At pre-determined points in the dialog, recognition errors were staged, and speakers made repairs. Each repair utterance was paired with the utterance preceding the staged recognition error and coded for adaptations associated with hyperarticulate speech: speaking rate and phonetically clear speech. Our results demonstrate that hyperarticulation is a targeted and flexible adaptation rather than a generalized and stable mode of speaking. Hyperarticulation increases after evidence of misrecognition and then decays gradually over several turns in the absence of further misrecognitions. When repairing misrecognized speech, speakers are more likely to clearly articulate constituents that were apparently misrecognized than those either before or after the troublesome constituents, and more likely to clearly articulate content words than function words. Finally, we found no negative impact of hyperarticulation on speech recognition performance.