Content uploaded by David Van Leeuwen
Author content
All content in this area was uploaded by David Van Leeuwen on Jun 26, 2014
Content may be subject to copyright.
Evaluating automatic laughter segmentation in meetings using acoustic and
acoustic-phonetic features
Khiet P. Truong and David A. van Leeuwen
TNO Human Factors
P.O. Box 23, 3769 ZG, Soesterberg, The Netherlands
{khiet.truong, david.vanleeuwen}@tno.nl
ABSTRACT
In this study, we investigated automatic laughter seg-
mentation in meetings. We first performed laughter-
speech discrimination experiments with traditional
spectral features and subsequently used acoustic-
phonetic features. In segmentation, we used Gaus-
sian Mixture Models that were trained with spec-
tral features. For the evaluation of the laughter seg-
mentation we used time-weighted Detection Error
Tradeoff curves. The results show that the acoustic-
phonetic features perform relatively well given their
sparseness. For segmentation, we believe that incor-
porating phonetic knowledge could lead to improve-
ment. We will discuss possibilities for improvement
of our automatic laughter detector.
Keywords: laughter detection, laughter
1. INTRODUCTION
Since laughter can be an important cue for identi-
fying interesting discourse events or emotional user-
states, laughter has gained interests from researchers
from multidisciplinary research areas. Although
there seems to be no unique relation between laugh-
ter and emotions [12, 11], we all agree that laugh-
ter is a highly communicative and social event in
human-human communication that can elicit emo-
tional reactions. Further, we have learned that it is a
highly variable acoustic signal [2]. We can chuckle,
giggle or make snort-like laughter sounds that may
sound differently for each person. Sometimes, peo-
ple can even identify someone just by hearing their
laughter. Due to its highly variable acoustic proper-
ties, laughter is expected to be difficult to model and
detect automatically.
In this study, we will focus on laughter recogni-
tion in speech in meetings. Previous studies [6, 13]
have reported relatively high classification rates, but
these were obtained with either given pre-segmented
segments or with a sliding n-second window. In our
study, we tried to localize spontaneous laughter in
meetings more accuractely on a frame basis. We
did not make distinctions between different types
of laughter, but we rather tried to build a generic
laughter model. Our goal is to automatically detect
laughter events for the development of affective sys-
tems. Laughter event recognition implies automat-
ically positioning the start and end time of laugh-
ter. One could use an automatic speech recognizer
(ASR) to recognize laughter which segments laugh-
ter as a by-product. However, since the aim of an
automatic speech recognizer is to recognize speech,
it is not specifically tuned for detection of non-verbal
speech elements such as laughter. Further, an ASR
system employing a full-blown transcription may
be a bit computationally inefficient for the detec-
tion of laughter events. Therefore, we rather built
a relatively simple detector based on a small num-
ber of acoustic models. We started with laughter-
speech discrimination (which was performed on pre-
segmented homogeneous trials), and subsequently,
performed laughter segmentation in meetings. After
inspection of some errors of the laughter segmenta-
tion in meetings, we believe that incorporating pho-
netic knowledge could improve performance.
In the following sections we desribe the material
used in this study (Section 2), our methods (Section
3) and we explain how we evaluated our results (Sec-
tion 4). Subsequently, we show our results (Section
5) and discuss how we can improve laughter seg-
mentation (Section 6).
2. DATABASE
We used spontaneous meetings from the ICSI Meet-
ing Recorder Corpus [8] to train and test our laugh-
ter detector (Table 1). The corpus consists of 75
recorded meetings with an average of 6 participants
per meeting and a total of 53 unique speakers. We
used the close-talk recordings of each participant.
The first 26 ICSI ‘Bmr’ (‘Bmr’ is a naming conven-
tion of the type of meeting at ICSI) meetings were
used for training and the last 3 ICSI ‘Bmr’ meet-
ings (10 unique speakers, 2 female and 8 male) were
used for testing. Some speakers in the training set
were also present in the test set. Note that the manu-
ally produced laughter annotations were not always
precise, e.g., onset and offset of laughter were not
always marked.
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 49
Table 1: Amount of data used in our analyses (du-
ration, numbers of segments in brackets).
Training Testing
26 Bmr meetings 3 Bmr meetings
Speech 81 min (2422) 10 min (300)
Laughter 83 min (2680) 10 min (279)
For training and testing, we used only audi-
ble laughter events (relatively clearly as perceived
by the first author). The segments consisted of
solely audible laughter which means that so-called
“speech-laughs” or “smiled speech” was not investi-
gated.
3. METHOD
3.1. Acoustic modeling
3.1.1. Laughter-speech discrimination
For laughter-speech discrimination, we used cepstral
and acoustic-phonetic features. Firstly, Gaussian
Mixture Models (GMMs) were trained with Percep-
tual Linear Prediction Coding (PLP) features [5].
Twelve PLP coefficients and one log energy com-
ponent, and their 13 firstorder derivatives (measured
over five consecutive frames) were extracted each 16
ms over a window with a length of 32 ms. A ‘soft
detector’ score is obtained by determining the log
likelihood ratio of the data given the laughter and
speech GMMs respectively.
Secondly, we used utterance-based acoustic-
phonetic features that were measured over the whole
utterance, such as mean log F0, standard deviation
of log F0, range of log F0, the mean slope of F0, the
slope of the Long-Term Average Spectrum (LTAS)
and the fraction of unvoiced frames (some of these
features have proven to be discriminative [13]).
These features were all extracted with PRAAT [4].
Linear Discriminant Analysis (LDA) was used as a
discrimination method which has as advantage that
we can obtain information about the contribution of
each feature to the discriminative power by examin-
ing the standardized discriminant coefficients which
can be interpreted as feature weights. The posterior
probabilities of the LDA classification were used as
‘soft detector’ scores. Statistics of F0were cho-
sen because some studies have reported significant
F0differences between laughter and speech [2] (al-
though contradictory results have been reported [3]).
A level of ‘effort’ can be measured by the slope of
the LTAS: the less negative the slope is, the more
vocal effort is expected [9]. And the fraction of un-
voiced frames was chosen since due to the character-
istic alternating voicing/unvoicing pattern which is
Figure 1: Example of laughter with typical
voiced/unvoiced alternating pattern, showing a
waveform (top) and a spectrogram (bottom).
Time (s)
01.86422
0
5000
Frequency (Hz)
Pitch (Hz)Pitch (Hz)
0
500
often present in laughter, it is expected that the per-
centage of unvoiced frames is larger in laughter than
in speech (which was suggested by [3]), see Fig. 1.
Note that measures of F0can only be measured
in the vocalized parts of laughter. A disadvantage
of such features is that they cannot easily be used
for a segmentation problem because these features
describe relatively slow-varying patterns in speech
that require a larger time-scale for feature extraction
(e.g., an utterance). In segmentation, a higher res-
olution of extracted features (e.g., frame-based) is
needed because accurate localization of boundaries
of events is important.
3.1.2. Laughter segmentation
For laughter segmentation, i.e., localizing laughter
in meetings, we used PLP features and trained three
GMMs: laughter, speech and silence. Silence was
added because we encountered much silence in the
meetings, and we needed a way to deal with it. In
order to determine the segmentation of the acous-
tic signal into segments representing the Ndefined
classes (in our case N= 3) we used a very simple
Viterbi decoder [10]. In an N-state parallel topol-
ogy the decoder finds the maximum likelihood state
sequence. We used the state sequence as the seg-
mentation result. We controlled the number of state
transitions, or the segment boundaries, by using a
small state transition probability. The state transi-
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 50
tion probability aij from state ito state j6=iwere
estimated on the basis of the average duration of the
segments iand the number of segments jfollowing
iin the training data. The self probabilities aii were
chosen so that Pjaij = 1. After the segmentation
into segments {si},i= 1,...Ns, we calculated the
average log-likelihoods Lim over each segment ifor
each of the models m. We defined a log-likelihood-
ratio as Llaugh −max(Lspeech, Lsilence). These log-
likelihood-ratios determine final class-membership.
4. EVALUATION METRIC
For laughter-speech discrimination, we used the
Equal Error Rate (EER) as a single performance
measure, adopted from the detection framework. In
laughter-speech discrimination, we can identify two
types of errors: a false alarm, i.e., a speech seg-
ment is falsely detected as laughter, and a miss, i.e.,
a laughter segment is incorrectly detected as speech.
The EER is defined as the error rate where the false
alarm rate is equal to the miss rate.
The evaluation of the automatic laughter segmen-
tation was not so straightforward. One of the rea-
sons to define log-likelihood ratios for the segments
found by the detector, is to be able to compare the
current results based on segmentation to other re-
sults that were obtained with given pre-segmented
segments and that were evaluated with a trial-based
DET analysis (Detection Error Tradeoff [7]). In
this analysis we could analyze a detector in terms
of DET plots and post-evaluation measures such as
Equal Error Rate and minimum decision costs. In
order to make comparison possible we extended the
concept of the trial-based DET analysis to a time-
weighted DET analysis for two-class decoding [14].
The basic idea is (see Fig. 2) that each segment in
the hypothesis segmentation may have sub-segments
that are either
•correctly classified (hits and correct rejects)
•missed, i.e., classified as speech (or other),
while the reference says laughter
•false alarm, i.e., classified as laughter, while the
reference says speech (or other)
We can now form tuples (λi, T e
i)where Te
iis the du-
ration of the sub-segment of segment iand eis the
evaluation over that sub-segment, either ‘correct’,
‘missed’ or ‘false alarm’. These tuples can now be
used in an analysis very similar to the DET analysis.
Define θas the treshold determining the operating
point in the DET plot. Then the false alarm prob-
ability is estimated from the set Tθof all tuples for
which λi> θ
(1) pFA =1
Tnon X
i∈Tθ
TFA
i
Figure 2: Definitions of correct classifications
and erroneous classifications in time.
1=laughter, 0=non−laughter
correct
time miss
time correct
time
Output
Reference
1
0
1
0
false alarm
time
and similarly the miss probability can be estimated
as
(2) pmiss =1
Ttar X
i6∈Tθ
Tmiss
i
Here Ttar and Tnon indicate the total time of target
class (laughter) and non-target class (e.g., speech) in
the reference segmentation.
5. RESULTS
We tested laughter-speech discrimination and laugh-
ter segmentation on a total of 27 individual chan-
nels of the close-talk recordings taken from three
ICSI ‘Bmr’ meetings. For laughter-speech discrim-
ination we tested with pre-segmented laughter and
speech segments, while for laughter segmentation,
full-length channels of whole meetings were ap-
plied. The scores (log-likelihood ratios or posterior
probabilities) obtained in these audio channels were
pooled together to obtain EERs, Table 2. In order
to enable better comparison between the laughter-
speech discrimination and the laughter segmenta-
tion results, we have also performed a segmenta-
tion experiment in which we concatenated the laugh-
ter and speech segments (used in the discrimina-
tion task) randomly to each other and subsequently
performed laughter segmentation on this chain of
laughter-speech segments. Thus the difference in
performance in Fig. 3 is mainly caused by the pres-
ence of other sounds, such as silence, in meetings. A
disadvantage of the time-weighted DET curve (used
for laughter-segmentation) is that it does not take
into account the absolute number of times there was
an error.
Many of the errors in laughter segmentation were
introduced by sounds like, e.g., breaths, coughs,
background noises or crosstalk (softer speech from
other participants). It seems that, especially, un-
voiced units in laughter can be confused with these
type of sounds (and vice versa).
The LDA analysis with the PRAAT- features
in the laughter-speech discrimination indicated that
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 51
Table 2: EERs of laughter-speech discrimination
and laughter segmentation (tested on 3 ICSI Bmr
meetings). The lower the EERs, the better the per-
formance.
Discrimination Segmentation
Pre-segmented Concatenated
laughter/speech Whole
meetings
GMM
PLP LDA
PRAAT GMM PLP GMM
PLP
0.060 0.118 0.082 0.109
Figure 3: Time-weighted DET curves of laughter
segmentation, tested on 3 ICSI Bmr meetings.
DET plot
false alarm probability (%)
miss probability (%)
0.1 0.5 1 2 5 10 20 40
0.1 0.5 1 2 5 10 20 40
8.2% concatenated pre−segmented
laughter/speech trials
10.9% automatic laughter segmentation
mean log F0and the fraction of unvoiced frames had
the highest weights, which means that these two fea-
tures contributed the most discriminative power to
the model. The LDA model in combination with
these features seem to perform relatively well, given
the small number of features used.
6. DISCUSSION AND CONCLUSIONS
We believe that the performance of the laughter seg-
menter can be improved by incorporating phonetic
knowledge into the models. In a previous study [13],
a fusion between spectral and acoustic-phonetic fea-
tures showed significant improvement in laughter-
speech discrimination. However, acoustic-phonetic
features are usually measured over a longer time-
scale which makes it difficult to use these for seg-
mentation. Currently, we are modeling laughter as
a whole with GMMs that are basically one-state
Hidden Markov Models (HMMs). The results of
the LDA analysis indicate that we could employ
phonetic information about the voiced (where we
can measure F0) and unvoiced parts of laughter
(the fraction of unvoiced frames appeared to be dis-
criminative). We could use HMMs to model sub-
components of laughter which are based on phonetic
units, e.g., a VU (voiced-unvoiced) syllable could be
such a phonetic unit. With HMMs, we can then bet-
ter model the time-varying patterns of laughter, such
as the characteristic repeating /haha/ pattern by the
HMM state topology and state transition probabil-
ities. However, for this purpose, a large database
containing different laughter sounds which are an-
notated on different phonetic levels is needed. In
addition, our laughter segmentation model may be
too generic. We could build more specific laugh-
ter models for, e.g., voiced laughter, which appears
to be perceived as ‘more positive’ by listeners [1].
Further, we have used a time-weighted DET analy-
sis which has as an important advantage that it has
a DET-like behavior so that comparisons between
other studies that use DET analyses are easier to
make. Disadvangtages are that it does not take into
account the number of times that a detector has made
an error, and our time-weighted evaluation could
have been too strict (it is not clear what exactly de-
fines the beginning and end of laughter).
We are currently implementing an online laugh-
ter detector which will be used in an interactive af-
fective application. Additional challenges arose dur-
ing the development of our online laughter detector,
such as how to perform online normalization. In the
future, we intend to improve our laughter detector
by employing more phonetic properties of laughter.
ACKNOWLEDGEMENTS
This work was supported by a Dutch Bsik project:
MultimediaN.
7. REFERENCES
[1] Bachorowski, J.-A., Owren, M. 2001. Not all
laughs are alike: Voiced but not unvoiced laughter
readily elicits positive affect. Psychological Sci-
ence 12, 252–257.
[2] Bachorowski, J.-A., Smoski, M., Owren, M.
2001. The acoustic features of human laughter.
J.Acoust.Soc.Am. 110, 1581–1597.
[3] Bickley, C., Hunnicutt, S. 1992. Acoustic analysis
of laughter. Proc. ICSLP 927–930.
[4] Boersma, P. 2001. Praat: system for doing phonet-
ics by computer. Glot International.
[5] Hermansky, H. 1990. Perceptual linear predictive
(PLP) analysis of speech. J.Acoust.Soc.Amer. 87,
1738–1752.
[6] Kennedy, L., Ellis, D. 2004. Laughter detection in
meetings. NIST ICASSP 2004 Meeting Recognition
Workshop 118–121.
[7] Martin, A., Doddington, G., Kamm, T., Ordowski,
M., Przybocki, M. 1997. The DET curve in as-
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 52
sessment of detection task performance. Proc. Eu-
rospeech 1895–1898.
[8] Morgan, N., Baron, D., Edwards, J., Ellis, D., Gel-
bart, D., Janin, A., Pfau, T., Shriberg, E., Stolcke,
A. 2001. The meeting project at ICSI. Proc. Human
Language Technologies Conference 1–7.
[9] Pittam, J., Gallois, C., Callan, V. 1990. The long-
term spectrum and perceived emotion. Speech
Communication 9, 177–187.
[10] Rabiner, L., Juang, B. 1986. An introduction to
Hidden Markov Models. IEEE ASSP Magazine 3,
4–16.
[11] Russell, J., Bachorowski, J., Fernandez-Dols, J.
2003. Facial and vocal expressions of emotion.
Annu.Rev.Psychology 54, 329–349.
[12] Schröder, M. 2003. Experimental study of affect
bursts. Speech Communication 40, 99–116.
[13] Truong, K., Van Leeuwen, D. 2005. Automatic de-
tection of laughter. Proc. Interspeech 485–488.
[14] Van Leeuwen, D., Huijbregts, M. 2006. The AMI
speaker diarization system for NIST RT06s meet-
ing data. Proc. MLMI 371–384.
Interdisciplinary Workshop on The Phonetics of Laughter, Saarbr¨
ucken, 4-5 August 2007 53