Figure 2 - uploaded by David Van Leeuwen
Content may be subject to copyright.
Definitions of correct classifications and erroneous classifications in time. 1=laughter, 0=non−laughter

Definitions of correct classifications and erroneous classifications in time. 1=laughter, 0=non−laughter

Source publication
Article
Full-text available
In this study, we investigated automatic laughter seg-mentation in meetings. We first performed laughter-speech discrimination experiments with traditional spectral features and subsequently used acoustic-phonetic features. In segmentation, we used Gaus-sian Mixture Models that were trained with spec-tral features. For the evaluation of the laughte...

Context in source publication

Context 1
... Error Tradeoff [7]). In this analysis we could analyze a detector in terms of DET plots and post-evaluation measures such as Equal Error Rate and minimum decision costs. In order to make comparison possible we extended the concept of the trial-based DET analysis to a time- weighted DET analysis for two-class decoding [14]. The basic idea is (see Fig. 2) that each segment in the hypothesis segmentation may have sub-segments that are ...

Similar publications

Article
Full-text available
Purpose: The authors investigated the effects of acoustic cues (i.e., pitch height, pitch contour, and pitch onset and offset) and phonetic context cues (i.e., syllable onsets and rimes) on lexical tone perception in Cantonese-speaking children. Method: Eight minimum pairs of tonal contrasts were presented in either an identical phonetic context...
Article
Full-text available
It is well known that gender-dependent (male/female) acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model in the case where the gender is successfully detected or a priory known. Speakers do not need to be split to two groups only. An algorithm to make higher number...
Article
Full-text available
Previous experiments in speech perception using the selective adaptation procedure have found a shift in the locus of the category boundary for a series of speech stimuli following repeated exposure to an adapting syllable. The locus of the boundary moves toward the category of the adapting syllable. Most investigators have interpreted these findin...
Article
Full-text available
In this study, we determine the acoustic correlates of primary and secondary stress in Tongan. Vowels with primary stress show differences in f0, intensity, duration, F1, and spectral measures compared to unstressed vowels, but a linear discriminant analysis suggests f0 and duration are the best cues for discriminating vowels with primary stress fr...
Article
Full-text available
The expression of bird song is expected to signal male quality to females. ‘Quality’ is determined by genetic and environmental factors, but, surprisingly, there is very limited evidence if and how genetic aspects of male quality are reflected in song. Here, we manipulated the genetic make-up of canaries (Serinus canaria) via inbreeding, and studie...

Citations

... Many other works focus on laughter detection using segmentation by classification scheme. In [19] a stream is segmented into laughter, speech and silence intervals using PLP features and GMM. A 3-state Viterbi decoder is first used to find the most likely sequence of states given a stream. ...
Conference Paper
Full-text available
Affect bursts play an important role in non-verbal social interaction. Laughter and smile are some of the most important social markers in human-robot social interaction. Not only do they contain affective information, they also may reveal the user’s communication strategy. In the context of human robot interaction, an automatic laughter and smile detection system may thus help the robot to adapt its behavior to a given user’s profile by adopting a more relevant communication scheme. While many interesting works on laughter and smile detection have been done, only few of them focused on elderly people. Elderly people data are relatively rare and often carry a significant challenge to a laughter and smile detection system due to face wrinkles and an often lower voice quality. In this paper, we address laughter and smile detection in the ROMEO2 corpus, a multimodal (audio and video) corpus of elderly people-robot interaction. We show that, while a single modality yields a given performance, a fair improvement can be reached by combining the two modalities.
... "sounds which would be characterized as laughs by an ordinary person if hears in everyday circumstances"). Gaussian Mixture Models have been used for training PLP features [13]. More recently, 13 MFCC trained with HMM have been used for filler/laughter/speech/silence segmentation [9]. ...
Conference Paper
Full-text available
Social Signal Processing such as laughter or emotion detection is a very important issue, particularly in the field of human-robot interaction (HRI). At the moment, very few studies exist on elderly-people’s voices and social markers in real-life HRI situations. This paper presents a cross-corpus study with two realistic corpora featuring elderly people (ROMEO2 and ARMEN) and two corpora collected in laboratory conditions with young adults (JEMO and OFFICE). The goal of this experiment is to assess how good data from one given corpus can be used as a training set for another corpus, with a specific focus on elderly people voices. First, clear differences between elderly people real-life data and young adults laboratory data are shown on acoustic feature distributions (such as F0 standard deviation or local jitter). Second, cross-corpus emotion recognition experiments show that elderly people real-life corpora are much more complex than laboratory corpora. Surprisingly, modeling emotions with an elderly people corpus do not generalize to another elderly people corpus collected in the same acoustic conditions but with different speakers. Our last result is that laboratory laughter is quite homogeneous across corpora but this is not the case for elderly people real-life laughter.
... "sounds which would be characterized as laughs by an ordinary person if hears in everyday circumstances"). Gaussian Mixture Models have been used for training PLP features [13]. More recently, 13 MFCC trained with HMM have been used for filler/laughter/speech/silence segmentation [9]. ...
Conference Paper
Full-text available
This paper presents a study of laugh classification using a cross-corpus protocol. It aims at the automatic detection of laughs in a real-time human-machine interaction. Positive and negative laughs are tested with different classification tasks and different acoustic feature sets. F.measure results show an improvement on positive laughs classification from 59.5% to 64.5% and negative laughs recognition from 10.3% to 28.5%. In the context of the Chist-Era JOKER project, positive and negative laugh detection drives the policies of the robot Nao. A measure of engagement will be provided using also the number of positive laughs detected during the interaction.
... phrases which are made up of several syllables. Owren and Understanding (2007) recommend the term 'bout' for the longer sequence, and 'call' for the individual syllables; we will adopt that terminology in this study. Some earlier work on the automatic segmentation of laughter has been reported in the literature. Truong et al. (2007) reported automatic laughter segmentation in meetings. They performed laughter vs. speech discrimination experiments comparing traditional spectral features and acoustic phonetic features, and concluded that the performance of laughter segmentation can be improved by incorporating phonetic knowledge into the models. Scherer et al. (2012) ...
Article
We report progress towards developing a sensor module that categorizes types of laughter for application in dialogue systems or social-skills training situations. The module will also function as a component to measure discourse engagement in natural conversational speech. This paper presents the results of an analysis into the sounds of human laughter in a very large corpus of naturally occurring conversational speech and our classification of the laughter types according to social function. Various types of laughter were categorized into either polite or genuinely mirthful categories and the analysis of these laughs forms the core of this report. Statistical analysis of the acoustic features of each laugh was performed and a Principal Component Analysis and Classification Tree analysis were performed to determine the main contributing factors in each case. A statistical model was then trained using a Support Vector Machine to predict the most likely category for each laugh in both speaker-specific and speaker-independent manner. Better than 70% accuracy was obtained in automatic classification tests.
... A recent study of the occurrence of laughter in meetings [6] has shown that speech-laughs account for less than 4% of all laughter by time, and therefore less than 0.4% of all vocalization effort by time. Although the detection of laughter is currently gaining attention [7] [8], this prior makes the successful acoustic treatment of speech-laughs in the near term unlikely. The current work explores the detection of intervals containing involved speech based on only the non-speech laughter produced by meeting participants. ...
... These findings, summarized in Section 6, present an important opportunity for future acoustic laughter detection efforts. At the present time, detection has focused on those instances of laughter which are transcribed as isolated utterances [7] [8]. ...
Conference Paper
Browsing through collections of audio recordings of conversation nominally relies on the processing of participants' lexical productions. The evolving verbal and non-verbal context of those productions, likely indicative of the degree of participant involvement, is often ignored. The present work explores the relevance of laughter to the retrieval of conversation intervals in which the speech of one or more participants is prosodically or pragmatically marked as involved. Experiments indicate that the relevance of laughter depends on its temporal distance to the laugher's speech. The results suggest that in order to be pertinent to downstream emotion recognition applications, laughter detection systems must first and foremost detect that laughter which is most temporally proximate to the laugher's speech.
... In contrast, a considerable body of research exists on the acoustic detection of laughter in meetings [13], [14], [15], [16], [17], whose co-occurrence with humor-bearing talk appears self-evident but which, to our knowledge, has never been measured. This measurement, via a system which predicts attempts at humor from surrounding laughter, is the main goal of the current work. ...
Conference Paper
Systems designed for the automatic summarization of meetings have considered the propositional content of contributions by each speaker, but not the explicit techniques that speakers use to downgrade the perceived seriousness of those contributions. We analyze one such technique, namely attempts at humor. We find that speech spent on attempts at humor is rare by time but that it correlates strongly with laughter, which is more frequent. Contextual features describing the temporal and multiparticipant distribution of manually transcribed laughter yield error rates for the detection of attempts at humor which are 4 times lower than those obtained using oracle lexical information. Furthermore, we show that similar performance can be achieved by considering only the speaker's laughter, indicating that meeting participants explicitly signal their attempts at humor by laughing themselves. Finally, we present evidence which suggests that, on small time scales, the production of attempts at humor and their ratification via laughter often involves only two participants, belying the allegedly multiparty nature of the interaction.
... Its detection in conversational interaction presents an important challenge in meeting understanding, as laughter has been shown to be predictive of both emotional valence [2] and activa- tion/involvement [3, 4]. Group laughter detection was first explored in [5] , but its detection on nearfield channels and its correct attribution to specific participants has only recently been attempted [6, 7]. Authors of the latter reported that clearly audible laughter, sufficiently long in duration and temporally distant from the laugher's speech, can be detected with equal error rates below 10% when a priori channel activity knowledge is available. ...
... By way of motivation, we explore this claim further in Section 3. The remainder of this paper is organized as follows. We first describe in Section 2 the data; it is the same as that in [5, 6, 7, 10] . Section 4 describes our baseline laughter detector, whose performance is analyzed in Section 5. Experiments and a discussion of the results are presented in Sections 6 and 7, respectively. ...
... As in other work on laughter detection in naturally occurring meet- ings [5, 6, 7, 10], we use the ICSI Meeting Corpus [11]. We retain the same division of the Bmr meetings into TRAINSET and DEVSET as proposed therein; we also report numbers for unseen EVALSET data, consisting of all of the Bed and Bro meetings. ...
Conference Paper
The detection of laughter in conversational interaction presents an important challenge in meeting understanding, important primarily because laughter is predictive of the emotional state of participants. We present evidence which suggests that ignoring unvoiced laugh- ter improves the prediction of emotional involvement in collocated speech, making a case for the distinction between voiced and un- voiced laughter during laughter detection. Our experiments show that the exclusion of unvoiced laughter during laughter model train- ing as well as itsexplicit modeling lead to detection scores for voiced laughter which are much higher than those otherwise obtained for all laughter. Furthermore, duration modeling is shown to be a more effective means of improving precision than interaction modeling through joint-participant decoding. Taken together, the final detec- tion F-scores we present for voiced laughter on our development set comprise a 20% reduction of error, relative to F-scores for all laugh- ter reported in previous work, and 6% and 22% relative reductions in error on two larger datasets unseen during development.
... Laughter detection in meetings has received some attention, beginning with [2] in which farfield group laughter was detected automatically, but not attributed to specific participants. Subsequent research has focused on laughter/speech classification [8, 9] and laughter/non-laughter segmentation [10, 11]. However, in both cases, only a subset of all laughter instances, those not occurring in the temporal proximity of the laugher's speech, was considered. ...
... However, in both cases, only a subset of all laughter instances, those not occurring in the temporal proximity of the laugher's speech, was considered. Furthermore , in segmentation work, some form of pre-segmentation was assumed to have eliminated long stretches of channel inactivity [10, 11]. These measures have led to significantly higher recall and precision rates than would be obtained by a fully automatic segmenter with no a priori channel activity knowledge. ...
... In constructing the proposed baseline system, we rely on several contrastive aspects of laughter and speech, including acoustics, duration, and the degree of vocalization overlap. This work begins with a description of the meeting data used in our experiments (Section 2), which was selected to be exactly the same as in previous work [2, 10, 8, 9, 11]. However, our aim is to detect all laughter-in-interaction, including laughter which is interspersed among lexical items produced by each participant. ...
Conference Paper
Full-text available
Laughter is a key element of human-human interaction, occurring surprisingly frequently in multi-party conversation. In meetings, laughter accounts for almost 10% of vocalization effort by time, and is known to be relevant for topic segmentation and the automatic characterization of affect. We present a system for the detection of laughter, and its attribution to specific participants, which relies on simultaneously decoding the vocal activity of all participants given multi-channel recordings. The proposed framework allows us to disambiguate laughter and speech not only acoustically, but also by constraining the number of simultaneous speakers and the number of simultaneous laughers independently, since participants tend to take turns speaking but laugh together. We present experiments on 57 hours of meeting data, containing almost 11000 unique instances of laughter.
Conference Paper
Full-text available
This article presents experiments on automatic detection of laughter and fillers, two of the most important nonverbal behavioral cues observed in spoken conversations. The proposed approach is fully automatic and segments audio recordings captured with mobile phones into four types of interval: laughter, filler, speech and silence. The segmentation methods rely not only on probabilistic sequential models (in particular Hidden Markov Models), but also on Statistical Language Models aimed at estimating the a-priori probability of observing a given sequence of the four classes above. The experiments are speaker independent and performed over a total of 8 hours and 25 minutes of data (120 people in total). The results show that F1 scores up to 0.64 for laughter and 0.58 for fillers can be achieved.
Article
Full-text available
The AVLaughterCycle project aims at developing an audiovisual laughing machine, able to detect and respond to user’s laughs. Laughter is an important cue to reinforce the engagement in human-computer interactions. As a first step toward this goal, we have implemented a system capable of recording the laugh of a user and responding to it with a similar laugh. The output laugh is automatically selected from an audiovisual laughter database by analyzing acoustic similarities with the input laugh. It is displayed by an Embodied Conversational Agent, animated using the audio-synchronized facial movements of the subject who originally uttered the laugh. The application is fully implemented, works in real time and a large audiovisual laughter database has been recorded as part of the project. This paper presents AVLaughterCycle, its underlying components, the freely available laughter database and the application architecture. The paper also includes evaluations of several core components of the application. Objective tests show that the similarity search engine, though simple, significantly outperforms chance for grouping laughs by speaker or type. This result can be considered as a first measurement for computing acoustic similarities between laughs. Asubjective evaluation has also been conducted to measure the influence of the visual cues on the users’ evaluation of similarity between laughs. KeywordsLaughter-Embodied Conversational Agent-Acoustic similarity-Facial motion tracking