ArticlePDF Available

Evaluating automatic laughter segmentation in meetings using acoustic and acousticphonetic features

Authors:

Abstract and Figures

In this study, we investigated automatic laughter seg-mentation in meetings. We first performed laughter-speech discrimination experiments with traditional spectral features and subsequently used acoustic-phonetic features. In segmentation, we used Gaus-sian Mixture Models that were trained with spec-tral features. For the evaluation of the laughter seg-mentation we used time-weighted Detection Error Tradeoff curves. The results show that the acoustic-phonetic features perform relatively well given their sparseness. For segmentation, we believe that incor-porating phonetic knowledge could lead to improve-ment. We will discuss possibilities for improvement of our automatic laughter detector.
Content may be subject to copyright.
Evaluating automatic laughter segmentation in meetings using acoustic and
acoustic-phonetic features
Khiet P. Truong and David A. van Leeuwen
TNO Human Factors
P.O. Box 23, 3769 ZG, Soesterberg, The Netherlands
{khiet.truong, david.vanleeuwen}@tno.nl
ABSTRACT
In this study, we investigated automatic laughter seg-
mentation in meetings. We first performed laughter-
speech discrimination experiments with traditional
spectral features and subsequently used acoustic-
phonetic features. In segmentation, we used Gaus-
sian Mixture Models that were trained with spec-
tral features. For the evaluation of the laughter seg-
mentation we used time-weighted Detection Error
Tradeoff curves. The results show that the acoustic-
phonetic features perform relatively well given their
sparseness. For segmentation, we believe that incor-
porating phonetic knowledge could lead to improve-
ment. We will discuss possibilities for improvement
of our automatic laughter detector.
Keywords: laughter detection, laughter
1. INTRODUCTION
Since laughter can be an important cue for identi-
fying interesting discourse events or emotional user-
states, laughter has gained interests from researchers
from multidisciplinary research areas. Although
there seems to be no unique relation between laugh-
ter and emotions [12, 11], we all agree that laugh-
ter is a highly communicative and social event in
human-human communication that can elicit emo-
tional reactions. Further, we have learned that it is a
highly variable acoustic signal [2]. We can chuckle,
giggle or make snort-like laughter sounds that may
sound differently for each person. Sometimes, peo-
ple can even identify someone just by hearing their
laughter. Due to its highly variable acoustic proper-
ties, laughter is expected to be difficult to model and
detect automatically.
In this study, we will focus on laughter recogni-
tion in speech in meetings. Previous studies [6, 13]
have reported relatively high classification rates, but
these were obtained with either given pre-segmented
segments or with a sliding n-second window. In our
study, we tried to localize spontaneous laughter in
meetings more accuractely on a frame basis. We
did not make distinctions between different types
of laughter, but we rather tried to build a generic
laughter model. Our goal is to automatically detect
laughter events for the development of affective sys-
tems. Laughter event recognition implies automat-
ically positioning the start and end time of laugh-
ter. One could use an automatic speech recognizer
(ASR) to recognize laughter which segments laugh-
ter as a by-product. However, since the aim of an
automatic speech recognizer is to recognize speech,
it is not specifically tuned for detection of non-verbal
speech elements such as laughter. Further, an ASR
system employing a full-blown transcription may
be a bit computationally inefficient for the detec-
tion of laughter events. Therefore, we rather built
a relatively simple detector based on a small num-
ber of acoustic models. We started with laughter-
speech discrimination (which was performed on pre-
segmented homogeneous trials), and subsequently,
performed laughter segmentation in meetings. After
inspection of some errors of the laughter segmenta-
tion in meetings, we believe that incorporating pho-
netic knowledge could improve performance.
In the following sections we desribe the material
used in this study (Section 2), our methods (Section
3) and we explain how we evaluated our results (Sec-
tion 4). Subsequently, we show our results (Section
5) and discuss how we can improve laughter seg-
mentation (Section 6).
2. DATABASE
We used spontaneous meetings from the ICSI Meet-
ing Recorder Corpus [8] to train and test our laugh-
ter detector (Table 1). The corpus consists of 75
recorded meetings with an average of 6 participants
per meeting and a total of 53 unique speakers. We
used the close-talk recordings of each participant.
The first 26 ICSI ‘Bmr’ (‘Bmr’ is a naming conven-
tion of the type of meeting at ICSI) meetings were
used for training and the last 3 ICSI ‘Bmr’ meet-
ings (10 unique speakers, 2 female and 8 male) were
used for testing. Some speakers in the training set
were also present in the test set. Note that the manu-
ally produced laughter annotations were not always
precise, e.g., onset and offset of laughter were not
always marked.
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 49
Table 1: Amount of data used in our analyses (du-
ration, numbers of segments in brackets).
Training Testing
26 Bmr meetings 3 Bmr meetings
Speech 81 min (2422) 10 min (300)
Laughter 83 min (2680) 10 min (279)
For training and testing, we used only audi-
ble laughter events (relatively clearly as perceived
by the first author). The segments consisted of
solely audible laughter which means that so-called
“speech-laughs” or “smiled speech” was not investi-
gated.
3. METHOD
3.1. Acoustic modeling
3.1.1. Laughter-speech discrimination
For laughter-speech discrimination, we used cepstral
and acoustic-phonetic features. Firstly, Gaussian
Mixture Models (GMMs) were trained with Percep-
tual Linear Prediction Coding (PLP) features [5].
Twelve PLP coefficients and one log energy com-
ponent, and their 13 firstorder derivatives (measured
over five consecutive frames) were extracted each 16
ms over a window with a length of 32 ms. A ‘soft
detector’ score is obtained by determining the log
likelihood ratio of the data given the laughter and
speech GMMs respectively.
Secondly, we used utterance-based acoustic-
phonetic features that were measured over the whole
utterance, such as mean log F0, standard deviation
of log F0, range of log F0, the mean slope of F0, the
slope of the Long-Term Average Spectrum (LTAS)
and the fraction of unvoiced frames (some of these
features have proven to be discriminative [13]).
These features were all extracted with PRAAT [4].
Linear Discriminant Analysis (LDA) was used as a
discrimination method which has as advantage that
we can obtain information about the contribution of
each feature to the discriminative power by examin-
ing the standardized discriminant coefficients which
can be interpreted as feature weights. The posterior
probabilities of the LDA classification were used as
‘soft detector’ scores. Statistics of F0were cho-
sen because some studies have reported significant
F0differences between laughter and speech [2] (al-
though contradictory results have been reported [3]).
A level of ‘effort’ can be measured by the slope of
the LTAS: the less negative the slope is, the more
vocal effort is expected [9]. And the fraction of un-
voiced frames was chosen since due to the character-
istic alternating voicing/unvoicing pattern which is
Figure 1: Example of laughter with typical
voiced/unvoiced alternating pattern, showing a
waveform (top) and a spectrogram (bottom).
Time (s)
01.86422
0
5000
Frequency (Hz)
Pitch (Hz)Pitch (Hz)
0
500
often present in laughter, it is expected that the per-
centage of unvoiced frames is larger in laughter than
in speech (which was suggested by [3]), see Fig. 1.
Note that measures of F0can only be measured
in the vocalized parts of laughter. A disadvantage
of such features is that they cannot easily be used
for a segmentation problem because these features
describe relatively slow-varying patterns in speech
that require a larger time-scale for feature extraction
(e.g., an utterance). In segmentation, a higher res-
olution of extracted features (e.g., frame-based) is
needed because accurate localization of boundaries
of events is important.
3.1.2. Laughter segmentation
For laughter segmentation, i.e., localizing laughter
in meetings, we used PLP features and trained three
GMMs: laughter, speech and silence. Silence was
added because we encountered much silence in the
meetings, and we needed a way to deal with it. In
order to determine the segmentation of the acous-
tic signal into segments representing the Ndefined
classes (in our case N= 3) we used a very simple
Viterbi decoder [10]. In an N-state parallel topol-
ogy the decoder finds the maximum likelihood state
sequence. We used the state sequence as the seg-
mentation result. We controlled the number of state
transitions, or the segment boundaries, by using a
small state transition probability. The state transi-
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 50
tion probability aij from state ito state j6=iwere
estimated on the basis of the average duration of the
segments iand the number of segments jfollowing
iin the training data. The self probabilities aii were
chosen so that Pjaij = 1. After the segmentation
into segments {si},i= 1,...Ns, we calculated the
average log-likelihoods Lim over each segment ifor
each of the models m. We defined a log-likelihood-
ratio as Llaugh max(Lspeech, Lsilence). These log-
likelihood-ratios determine final class-membership.
4. EVALUATION METRIC
For laughter-speech discrimination, we used the
Equal Error Rate (EER) as a single performance
measure, adopted from the detection framework. In
laughter-speech discrimination, we can identify two
types of errors: a false alarm, i.e., a speech seg-
ment is falsely detected as laughter, and a miss, i.e.,
a laughter segment is incorrectly detected as speech.
The EER is defined as the error rate where the false
alarm rate is equal to the miss rate.
The evaluation of the automatic laughter segmen-
tation was not so straightforward. One of the rea-
sons to define log-likelihood ratios for the segments
found by the detector, is to be able to compare the
current results based on segmentation to other re-
sults that were obtained with given pre-segmented
segments and that were evaluated with a trial-based
DET analysis (Detection Error Tradeoff [7]). In
this analysis we could analyze a detector in terms
of DET plots and post-evaluation measures such as
Equal Error Rate and minimum decision costs. In
order to make comparison possible we extended the
concept of the trial-based DET analysis to a time-
weighted DET analysis for two-class decoding [14].
The basic idea is (see Fig. 2) that each segment in
the hypothesis segmentation may have sub-segments
that are either
correctly classified (hits and correct rejects)
missed, i.e., classified as speech (or other),
while the reference says laughter
false alarm, i.e., classified as laughter, while the
reference says speech (or other)
We can now form tuples (λi, T e
i)where Te
iis the du-
ration of the sub-segment of segment iand eis the
evaluation over that sub-segment, either ‘correct’,
‘missed’ or ‘false alarm’. These tuples can now be
used in an analysis very similar to the DET analysis.
Define θas the treshold determining the operating
point in the DET plot. Then the false alarm prob-
ability is estimated from the set Tθof all tuples for
which λi> θ
(1) pFA =1
Tnon X
i∈Tθ
TFA
i
Figure 2: Definitions of correct classifications
and erroneous classifications in time.
1=laughter, 0=non−laughter
correct
time miss
time correct
time
Output
Reference
1
0
1
0
false alarm
time
and similarly the miss probability can be estimated
as
(2) pmiss =1
Ttar X
i6∈Tθ
Tmiss
i
Here Ttar and Tnon indicate the total time of target
class (laughter) and non-target class (e.g., speech) in
the reference segmentation.
5. RESULTS
We tested laughter-speech discrimination and laugh-
ter segmentation on a total of 27 individual chan-
nels of the close-talk recordings taken from three
ICSI ‘Bmr’ meetings. For laughter-speech discrim-
ination we tested with pre-segmented laughter and
speech segments, while for laughter segmentation,
full-length channels of whole meetings were ap-
plied. The scores (log-likelihood ratios or posterior
probabilities) obtained in these audio channels were
pooled together to obtain EERs, Table 2. In order
to enable better comparison between the laughter-
speech discrimination and the laughter segmenta-
tion results, we have also performed a segmenta-
tion experiment in which we concatenated the laugh-
ter and speech segments (used in the discrimina-
tion task) randomly to each other and subsequently
performed laughter segmentation on this chain of
laughter-speech segments. Thus the difference in
performance in Fig. 3 is mainly caused by the pres-
ence of other sounds, such as silence, in meetings. A
disadvantage of the time-weighted DET curve (used
for laughter-segmentation) is that it does not take
into account the absolute number of times there was
an error.
Many of the errors in laughter segmentation were
introduced by sounds like, e.g., breaths, coughs,
background noises or crosstalk (softer speech from
other participants). It seems that, especially, un-
voiced units in laughter can be confused with these
type of sounds (and vice versa).
The LDA analysis with the PRAAT- features
in the laughter-speech discrimination indicated that
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 51
Table 2: EERs of laughter-speech discrimination
and laughter segmentation (tested on 3 ICSI Bmr
meetings). The lower the EERs, the better the per-
formance.
Discrimination Segmentation
Pre-segmented Concatenated
laughter/speech Whole
meetings
GMM
PLP LDA
PRAAT GMM PLP GMM
PLP
0.060 0.118 0.082 0.109
Figure 3: Time-weighted DET curves of laughter
segmentation, tested on 3 ICSI Bmr meetings.
DET plot
false alarm probability (%)
miss probability (%)
0.1 0.5 1 2 5 10 20 40
0.1 0.5 1 2 5 10 20 40
8.2% concatenated pre−segmented
laughter/speech trials
10.9% automatic laughter segmentation
mean log F0and the fraction of unvoiced frames had
the highest weights, which means that these two fea-
tures contributed the most discriminative power to
the model. The LDA model in combination with
these features seem to perform relatively well, given
the small number of features used.
6. DISCUSSION AND CONCLUSIONS
We believe that the performance of the laughter seg-
menter can be improved by incorporating phonetic
knowledge into the models. In a previous study [13],
a fusion between spectral and acoustic-phonetic fea-
tures showed significant improvement in laughter-
speech discrimination. However, acoustic-phonetic
features are usually measured over a longer time-
scale which makes it difficult to use these for seg-
mentation. Currently, we are modeling laughter as
a whole with GMMs that are basically one-state
Hidden Markov Models (HMMs). The results of
the LDA analysis indicate that we could employ
phonetic information about the voiced (where we
can measure F0) and unvoiced parts of laughter
(the fraction of unvoiced frames appeared to be dis-
criminative). We could use HMMs to model sub-
components of laughter which are based on phonetic
units, e.g., a VU (voiced-unvoiced) syllable could be
such a phonetic unit. With HMMs, we can then bet-
ter model the time-varying patterns of laughter, such
as the characteristic repeating /haha/ pattern by the
HMM state topology and state transition probabil-
ities. However, for this purpose, a large database
containing different laughter sounds which are an-
notated on different phonetic levels is needed. In
addition, our laughter segmentation model may be
too generic. We could build more specific laugh-
ter models for, e.g., voiced laughter, which appears
to be perceived as ‘more positive’ by listeners [1].
Further, we have used a time-weighted DET analy-
sis which has as an important advantage that it has
a DET-like behavior so that comparisons between
other studies that use DET analyses are easier to
make. Disadvangtages are that it does not take into
account the number of times that a detector has made
an error, and our time-weighted evaluation could
have been too strict (it is not clear what exactly de-
fines the beginning and end of laughter).
We are currently implementing an online laugh-
ter detector which will be used in an interactive af-
fective application. Additional challenges arose dur-
ing the development of our online laughter detector,
such as how to perform online normalization. In the
future, we intend to improve our laughter detector
by employing more phonetic properties of laughter.
ACKNOWLEDGEMENTS
This work was supported by a Dutch Bsik project:
MultimediaN.
7. REFERENCES
[1] Bachorowski, J.-A., Owren, M. 2001. Not all
laughs are alike: Voiced but not unvoiced laughter
readily elicits positive affect. Psychological Sci-
ence 12, 252–257.
[2] Bachorowski, J.-A., Smoski, M., Owren, M.
2001. The acoustic features of human laughter.
J.Acoust.Soc.Am. 110, 1581–1597.
[3] Bickley, C., Hunnicutt, S. 1992. Acoustic analysis
of laughter. Proc. ICSLP 927–930.
[4] Boersma, P. 2001. Praat: system for doing phonet-
ics by computer. Glot International.
[5] Hermansky, H. 1990. Perceptual linear predictive
(PLP) analysis of speech. J.Acoust.Soc.Amer. 87,
1738–1752.
[6] Kennedy, L., Ellis, D. 2004. Laughter detection in
meetings. NIST ICASSP 2004 Meeting Recognition
Workshop 118–121.
[7] Martin, A., Doddington, G., Kamm, T., Ordowski,
M., Przybocki, M. 1997. The DET curve in as-
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 52
sessment of detection task performance. Proc. Eu-
rospeech 1895–1898.
[8] Morgan, N., Baron, D., Edwards, J., Ellis, D., Gel-
bart, D., Janin, A., Pfau, T., Shriberg, E., Stolcke,
A. 2001. The meeting project at ICSI. Proc. Human
Language Technologies Conference 1–7.
[9] Pittam, J., Gallois, C., Callan, V. 1990. The long-
term spectrum and perceived emotion. Speech
Communication 9, 177–187.
[10] Rabiner, L., Juang, B. 1986. An introduction to
Hidden Markov Models. IEEE ASSP Magazine 3,
4–16.
[11] Russell, J., Bachorowski, J., Fernandez-Dols, J.
2003. Facial and vocal expressions of emotion.
Annu.Rev.Psychology 54, 329–349.
[12] Schröder, M. 2003. Experimental study of affect
bursts. Speech Communication 40, 99–116.
[13] Truong, K., Van Leeuwen, D. 2005. Automatic de-
tection of laughter. Proc. Interspeech 485–488.
[14] Van Leeuwen, D., Huijbregts, M. 2006. The AMI
speaker diarization system for NIST RT06s meet-
ing data. Proc. MLMI 371–384.
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 53
... Its detection in conversational interaction presents an important challenge in meeting understanding, as laughter has been shown to be predictive of both emotional valence [2] and activa- tion/involvement [3, 4]. Group laughter detection was first explored in [5] , but its detection on nearfield channels and its correct attribution to specific participants has only recently been attempted [6, 7]. Authors of the latter reported that clearly audible laughter, sufficiently long in duration and temporally distant from the laugher's speech, can be detected with equal error rates below 10% when a priori channel activity knowledge is available. ...
... By way of motivation, we explore this claim further in Section 3. The remainder of this paper is organized as follows. We first describe in Section 2 the data; it is the same as that in [5, 6, 7, 10] . Section 4 describes our baseline laughter detector, whose performance is analyzed in Section 5. Experiments and a discussion of the results are presented in Sections 6 and 7, respectively. ...
... As in other work on laughter detection in naturally occurring meet- ings [5, 6, 7, 10], we use the ICSI Meeting Corpus [11]. We retain the same division of the Bmr meetings into TRAINSET and DEVSET as proposed therein; we also report numbers for unseen EVALSET data, consisting of all of the Bed and Bro meetings. ...
Conference Paper
The detection of laughter in conversational interaction presents an important challenge in meeting understanding, important primarily because laughter is predictive of the emotional state of participants. We present evidence which suggests that ignoring unvoiced laugh- ter improves the prediction of emotional involvement in collocated speech, making a case for the distinction between voiced and un- voiced laughter during laughter detection. Our experiments show that the exclusion of unvoiced laughter during laughter model train- ing as well as itsexplicit modeling lead to detection scores for voiced laughter which are much higher than those otherwise obtained for all laughter. Furthermore, duration modeling is shown to be a more effective means of improving precision than interaction modeling through joint-participant decoding. Taken together, the final detec- tion F-scores we present for voiced laughter on our development set comprise a 20% reduction of error, relative to F-scores for all laugh- ter reported in previous work, and 6% and 22% relative reductions in error on two larger datasets unseen during development.
... A recent study of the occurrence of laughter in meetings [6] has shown that speech-laughs account for less than 4% of all laughter by time, and therefore less than 0.4% of all vocalization effort by time. Although the detection of laughter is currently gaining attention [7] [8], this prior makes the successful acoustic treatment of speech-laughs in the near term unlikely. The current work explores the detection of intervals containing involved speech based on only the non-speech laughter produced by meeting participants. ...
... These findings, summarized in Section 6, present an important opportunity for future acoustic laughter detection efforts. At the present time, detection has focused on those instances of laughter which are transcribed as isolated utterances [7] [8]. ...
Conference Paper
Browsing through collections of audio recordings of conversation nominally relies on the processing of participants' lexical productions. The evolving verbal and non-verbal context of those productions, likely indicative of the degree of participant involvement, is often ignored. The present work explores the relevance of laughter to the retrieval of conversation intervals in which the speech of one or more participants is prosodically or pragmatically marked as involved. Experiments indicate that the relevance of laughter depends on its temporal distance to the laugher's speech. The results suggest that in order to be pertinent to downstream emotion recognition applications, laughter detection systems must first and foremost detect that laughter which is most temporally proximate to the laugher's speech.
... "sounds which would be characterized as laughs by an ordinary person if hears in everyday circumstances"). Gaussian Mixture Models have been used for training PLP features [13]. More recently, 13 MFCC trained with HMM have been used for filler/laughter/speech/silence segmentation [9]. ...
Conference Paper
Full-text available
This paper presents a study of laugh classification using a cross-corpus protocol. It aims at the automatic detection of laughs in a real-time human-machine interaction. Positive and negative laughs are tested with different classification tasks and different acoustic feature sets. F.measure results show an improvement on positive laughs classification from 59.5% to 64.5% and negative laughs recognition from 10.3% to 28.5%. In the context of the Chist-Era JOKER project, positive and negative laugh detection drives the policies of the robot Nao. A measure of engagement will be provided using also the number of positive laughs detected during the interaction.
... "sounds which would be characterized as laughs by an ordinary person if hears in everyday circumstances"). Gaussian Mixture Models have been used for training PLP features [13]. More recently, 13 MFCC trained with HMM have been used for filler/laughter/speech/silence segmentation [9]. ...
Conference Paper
Full-text available
Social Signal Processing such as laughter or emotion detection is a very important issue, particularly in the field of human-robot interaction (HRI). At the moment, very few studies exist on elderly-people’s voices and social markers in real-life HRI situations. This paper presents a cross-corpus study with two realistic corpora featuring elderly people (ROMEO2 and ARMEN) and two corpora collected in laboratory conditions with young adults (JEMO and OFFICE). The goal of this experiment is to assess how good data from one given corpus can be used as a training set for another corpus, with a specific focus on elderly people voices. First, clear differences between elderly people real-life data and young adults laboratory data are shown on acoustic feature distributions (such as F0 standard deviation or local jitter). Second, cross-corpus emotion recognition experiments show that elderly people real-life corpora are much more complex than laboratory corpora. Surprisingly, modeling emotions with an elderly people corpus do not generalize to another elderly people corpus collected in the same acoustic conditions but with different speakers. Our last result is that laboratory laughter is quite homogeneous across corpora but this is not the case for elderly people real-life laughter.
... Many other works focus on laughter detection using segmentation by classification scheme. In [19] a stream is segmented into laughter, speech and silence intervals using PLP features and GMM. A 3-state Viterbi decoder is first used to find the most likely sequence of states given a stream. ...
Conference Paper
Full-text available
Affect bursts play an important role in non-verbal social interaction. Laughter and smile are some of the most important social markers in human-robot social interaction. Not only do they contain affective information, they also may reveal the user’s communication strategy. In the context of human robot interaction, an automatic laughter and smile detection system may thus help the robot to adapt its behavior to a given user’s profile by adopting a more relevant communication scheme. While many interesting works on laughter and smile detection have been done, only few of them focused on elderly people. Elderly people data are relatively rare and often carry a significant challenge to a laughter and smile detection system due to face wrinkles and an often lower voice quality. In this paper, we address laughter and smile detection in the ROMEO2 corpus, a multimodal (audio and video) corpus of elderly people-robot interaction. We show that, while a single modality yields a given performance, a fair improvement can be reached by combining the two modalities.
... phrases which are made up of several syllables. Owren and Understanding (2007) recommend the term 'bout' for the longer sequence, and 'call' for the individual syllables; we will adopt that terminology in this study. Some earlier work on the automatic segmentation of laughter has been reported in the literature. Truong et al. (2007) reported automatic laughter segmentation in meetings. They performed laughter vs. speech discrimination experiments comparing traditional spectral features and acoustic phonetic features, and concluded that the performance of laughter segmentation can be improved by incorporating phonetic knowledge into the models. Scherer et al. (2012) ...
Article
Full-text available
We report progress towards developing a sensor module that categorizes types of laughter for application in dialogue systems or social-skills training situations. The module will also function as a component to measure discourse engagement in natural conversational speech. This paper presents the results of an analysis into the sounds of human laughter in a very large corpus of naturally occurring conversational speech and our classification of the laughter types according to social function. Various types of laughter were categorized into either polite or genuinely mirthful categories and the analysis of these laughs forms the core of this report. Statistical analysis of the acoustic features of each laugh was performed and a Principal Component Analysis and Classification Tree analysis were performed to determine the main contributing factors in each case. A statistical model was then trained using a Support Vector Machine to predict the most likely category for each laugh in both speaker-specific and speaker-independent manner. Better than 70% accuracy was obtained in automatic classification tests.
... In contrast, a considerable body of research exists on the acoustic detection of laughter in meetings [13], [14], [15], [16], [17], whose co-occurrence with humor-bearing talk appears self-evident but which, to our knowledge, has never been measured. This measurement, via a system which predicts attempts at humor from surrounding laughter, is the main goal of the current work. ...
Conference Paper
Systems designed for the automatic summarization of meetings have considered the propositional content of contributions by each speaker, but not the explicit techniques that speakers use to downgrade the perceived seriousness of those contributions. We analyze one such technique, namely attempts at humor. We find that speech spent on attempts at humor is rare by time but that it correlates strongly with laughter, which is more frequent. Contextual features describing the temporal and multiparticipant distribution of manually transcribed laughter yield error rates for the detection of attempts at humor which are 4 times lower than those obtained using oracle lexical information. Furthermore, we show that similar performance can be achieved by considering only the speaker's laughter, indicating that meeting participants explicitly signal their attempts at humor by laughing themselves. Finally, we present evidence which suggests that, on small time scales, the production of attempts at humor and their ratification via laughter often involves only two participants, belying the allegedly multiparty nature of the interaction.
Conference Paper
Full-text available
This article presents experiments on automatic detection of laughter and fillers, two of the most important nonverbal behavioral cues observed in spoken conversations. The proposed approach is fully automatic and segments audio recordings captured with mobile phones into four types of interval: laughter, filler, speech and silence. The segmentation methods rely not only on probabilistic sequential models (in particular Hidden Markov Models), but also on Statistical Language Models aimed at estimating the a-priori probability of observing a given sequence of the four classes above. The experiments are speaker independent and performed over a total of 8 hours and 25 minutes of data (120 people in total). The results show that F1 scores up to 0.64 for laughter and 0.58 for fillers can be achieved.
Article
Full-text available
The AVLaughterCycle project aims at developing an audiovisual laughing machine, able to detect and respond to user’s laughs. Laughter is an important cue to reinforce the engagement in human-computer interactions. As a first step toward this goal, we have implemented a system capable of recording the laugh of a user and responding to it with a similar laugh. The output laugh is automatically selected from an audiovisual laughter database by analyzing acoustic similarities with the input laugh. It is displayed by an Embodied Conversational Agent, animated using the audio-synchronized facial movements of the subject who originally uttered the laugh. The application is fully implemented, works in real time and a large audiovisual laughter database has been recorded as part of the project. This paper presents AVLaughterCycle, its underlying components, the freely available laughter database and the application architecture. The paper also includes evaluations of several core components of the application. Objective tests show that the similarity search engine, though simple, significantly outperforms chance for grouping laughs by speaker or type. This result can be considered as a first measurement for computing acoustic similarities between laughs. Asubjective evaluation has also been conducted to measure the influence of the visual cues on the users’ evaluation of similarity between laughs. KeywordsLaughter-Embodied Conversational Agent-Acoustic similarity-Facial motion tracking
Conference Paper
In this paper, we present the detailed phonetic annotation of the publicly available AVLaughterCycle database, which can readily be used for automatic laughter processing (analysis, classification, browsing, synthesis, etc.). The phonetic annotation is used here to analyze the database, as a first step. Unsurprisingly, we find that h-like phones and central vowels are the most frequent sounds in laughter. However, laughs can contain many other sounds. In particular, nareal fricatives (voiceless friction in the nostrils) are frequent both in inhalation and exhalation phases. We show that the airflow direction (inhaling or exhaling) changes significantly the duration of laughter sounds. Individual differences in the choice of phones and their duration are also examined. The paper is concluded with some perspectives the annotated database opens.
Article
Full-text available
A flurry of theoretical and empirical work concerning the production of and response to facial and vocal expressions has occurred in the past decade. That emotional expressions express emotions is a tautology but may not be a fact. Debates have centered on universality, the nature of emotion, and the link between emotions and expressions. Modern evolutionary theory is informing more models, emphasizing that expressions are directed at a receiver, that the interests of sender and receiver can conflict, that there are many determinants of sending an expression in addition to emotion, that expressions influence the receiver in a variety of ways, and that the receiver's response is more than simply decoding a message.
Conference Paper
Full-text available
In the context of detecting 'paralinguistic events' with the aim to make classification of the speaker's emotional state possible, a detector was developed for one of the most obvious 'para- linguistic events', namely laughter. Gaussian Mixture Mod- els were trained with Perceptual Linear Prediction features, pitch&energy, pitch&voicing and modulation spectrum features to model laughter and speech. Data from the ICSI Meeting Cor- pus and the Dutch CGN corpus were used for our classification experiments. The results showed that Gaussian Mixture Mod- els trained with Perceptual Linear Prediction features performed best with Equal Error Rates ranging from 7.1%-20.0%.
Conference Paper
Full-text available
Conference Paper
Full-text available
We describe the systems submitted to the NIST RT06s evaluation for the Speech Activity Detection (SAD) and Speaker Diarization (SPKR) tasks. For speech activity detection, a new analysis methodology is presented that generalizes the Detection Erorr Tradeoff analysis commonly used in speaker detection tasks. The speaker diarization systems are based on the TNO and ICSI system submitted for RT05s. For the conference room evaluation Single Distant Microphone condition, the SAD results perform well at 4.23 % error rate, and the ‘HMM-BIC’ SPKR results perform competatively at an error rate of 37.2 % including overlapping speech.
Article
We build a system to automatically detect laughter events in meetings, where laughter events are defined as points in the meeting where a number of the participants (more than just one) are laughing simultaneously. We implement our system using a support vector machine classifier trained on mel-frequency cepstral coefficients (MFCCs), delta MFCCs, modulation spectrum, and spatial cues from the time delay between two desktop microphones. We run our experiments on the 'Bmr' subset of the ICSI Meeting Recorder corpus using just two table-top microphones and obtain detection results with a correct accept rate of 87% and a false alarm rate of 13%.
Article
The study described here investigates the perceived emotional content of "affect bursts" for German. Affect bursts are defined as short emotional non-speech expressions interrupting speech. This study shows that affect bursts, presented without context, can convey a clearly identifiable emotional meaning. Affect bursts expressing ten emotions were produced by actors. After a pre-selection procedure, "good examples" for each emotion were presented in a perception test. The mean recognition score of 81% indicates that affect bursts seem to be an effective means of expressing emotions. Affect bursts are grouped into classes on the basis of phonetic similarity. Recognition and confusion patterns are examined for these classes.
Article
Studied systematic links between the long-term spectrum of voice (LTS) and the affective dimensions of control, arousal, and pleasure. 15 male and 15 female English-speaking Ss of Australian, British, and Italian ethnic background recorded spoken passages involving 3 different settings: an interviewee in a job interview, a parent enrolling a child at school, and a conversation with a friend about a tennis game. Components analysis showed that the passages were differentiated by the LTS. In a 2nd study, the 3 passages were perceived differently by 120 Anglo-Australian student judges on the 3 affective dimensions. The LTS was systematically related to the affective dimensions in certain frequency ranges. No significant sex or ethnic group effects were found.
Article
The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
Article
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.