Conference Paper

Using phonetic patterns for detecting social cues in natural conversations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Kinect TM motion capture technology was used to capture facial features, gaze direction and depth information (Figure 8.1). Synchronisation was achieved using the Social Signal Interpretation framework (SSI) (Wagner et al. [169]). The participants took turns at telling personal stories which they associated with an enjoyable emotion. ...
... Regarding the recording within a natural and mobile setting, we chose to port the Social Signal Interpretation framework (SSI) (Wagner et al. [169]) to run on Android TM based mobile phones. This way we are able to provide a flexible framework for recording and real-time interpretation of multiple wearable sensing devices without constricting the user's mobility. ...
Thesis
The cues that describe emotional conditions are encoded within multiple modalities and fusion of multi-modal information is a natural way to improve the automated recognition of emotions. Throughout many studies, we see traditional fusion approaches in which decisions are synchronously forced for fixed time segments across all considered modalities and generic combination rules are applied. Varying success is reported, sometimes performance is worse than uni-modal classification. Starting from these premises, this thesis investigates and compares the performance of various synchronous fusion techniques. We enrich the traditional set with custom and emotion adapted fusion algorithms that are tailored towards the affect recognition domain they are used in. These developments enhance recognition quality to a certain degree, but do not solve the sometimes occurring performance problems. To isolate the issue, we conduct a systematic investigation of synchronous fusion techniques on acted and natural data and conclude that the synchronous fusion approach shows a crucial weakness especially on non-acted emotions: The implicit assumption that relevant affective cues happen at the same time across all modalities is only true if emotions are depicted very coherent and clear - which we cannot expect in a natural setting. This implies a switch to asynchronous fusion approaches. This change can be realized by the application of classification models with memory capabilities (\eg recurrent neural networks), but these are often data hungry and non-transparent. We consequently present an alternative approach to asynchronous modality treatment: The event-driven fusion strategy, in which modalities decide when to contribute information to the fusion process in the form of affective events. These events can be used to introduce an additional abstraction layer to the recognition process, as provided events do not necessarily need to match the sought target class but can be cues that indicate the final assessment. Furthermore, we will see that the architecture of an event-driven fusion system is well suited for real-time usage and is very tolerant to temporarily missing input from single modalities and is therefore a good choice for affect recognition in the wild. We will demonstrate mentioned capabilities in various comparison and prototype studies and present the application of event-driven fusion strategies in multiple European research projects.
Article
Full-text available
Scientific disciplines, such as Behavioural Psychology, Anthropology and recently Social Signal Processing are concerned with the systematic exploration of human behaviour. A typical work-flow includes the manual annotation (also called coding) of social signals in multi-modal corpora of considerable size. For the involved annotators this defines an exhausting and time-consuming task. In the article at hand we present a novel method and also provide the tools to speed up the coding procedure. To this end, we suggest and evaluate the use of Cooperative Machine Learning (CML) techniques to reduce manual labelling efforts by combining the power of computational capabilities and human intelligence. The proposed CML strategy starts with a small number of labelled instances and concentrates on predicting local parts first. Afterwards, a session-independent classification model is created to finish the remaining parts of the database. Confidence values are computed to guide the manual inspection and correction of the predictions. To bring the proposed approach into application we introduce NOVA - an open-source tool for collaborative and machine-aided annotations. In particular, it gives labellers immediate access to CML strategies and directly provides visual feedback on the results. Our experiments show that the proposed method has the potential to significantly reduce human labelling efforts. Full text at: https://arxiv.org/abs/1802.02565
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of overall twelve enacted emotional states. In this paper, we describe these four Sub-Challenges, their conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
Article
Full-text available
It is essential for the advancement of human-centered multimodal interfaces to be able to infer the current user's state or communication state. In order to enable a system to do that, the recognition and interpretation of multimodal social signals (i.e., paralinguistic and nonverbal behavior) in real-time applications is required. Since we believe that laughs are one of the most important and widely understood social nonverbal signals indicating affect and discourse quality, we focus in this work on the detection of laughter in natural multiparty discourses. The conversations are recorded in a natural environment without any specific constraint on the discourses using unobtrusive recording devices. This setup ensures natural and unbiased behavior, which is one of the main foci of this work. To compare results of methods, namely Gaussian Mixture Model (GMM) supervectors as input to a Support Vector Machine (SVM), so-called Echo State Networks (ESN), and a Hidden Markov Model (HMM) approach, are utilized in online and offline detection experiments. The SVM approach proves very accurate in the offline classification task, but is outperformed by the ESN and HMM approach in the online detection (F1 scores: GMM SVM 0.45, ESN 0.63, HMM 0.72). Further, we were able to utilize the proposed HMM approach in a cross-corpus experiment without any retraining with respectable generalization capability (F1score: 0.49). The results and possible reasons for these outcomes are shown and discussed in the article. The proposed methods may be directly utilized in practical tasks such as the labeling or the online detection of laughter in conversational data and affect-aware applications.
Chapter
Full-text available
Laughter as a vocal expressive-communicative signal is one of the least understood and most frequently overlooked human behaviors. The chapter provides an overview of what we know about laughter in terms of respiration, vocalization, facial action, and body movement and attempts to illustrate the mechanisms of laughter and to de - fine its elements. The importance of discriminatin g between spontaneous and con- trived laughter is pointed out and it is argued that unrestrained spontaneous laughter involves inarticulate vocalization. It is argued that we need research integrating the different systems in laughter including experience. Laughter is a conspicuous but frequently overlooked human phenomenon. In ontogenetic development it emerges, later than smiling, around the fourth month; however, cases of gelastic epilepsy (from Greek; gelos = laughter) among neonates demonstrate that all structures are there and functional on date of birth. Further evidence for its innateness comes from twin studies as well as from the fact that laughter is easily observable among deaf-blind children (even among deaf-blind thalidomide children, who could not "learn" laughter by touching people's faces). Man is not the only animal that laughs. Like smiling, laughter has its equivalent in the repertoire of some nonhuman primates. Beginning with Darwin (1872), many writers have been struck by the notable acoustic, orofacial, and contextual similari- ties between chimpanzee and human laughter. Especially among juvenile chim- panzees a "play-face" with associated vocalization was noted to accompany ac- tions such as play, tickling, or play-biting (Preuschoft, 1995; van Hooff, 1972). 1. Laughter as an Inarticulate Utterance
Conference Paper
Full-text available
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
Conference Paper
Full-text available
Past research on automatic laughter detection has focused mainly on audio-based detection. Here we present an audio- visual approach to distinguishing laughter from speech and we show that integrating the information from audio and video channels leads to improved performance over single- modal approaches. Each channel consists of 2 streams (cues), facial expressions and head movements for video and spec- tral and prosodic features for audio. We used decision level fusion to integrate the information from the two channels and experimented using the SUM rule and a neural net- work as the integration functions. The results indicate that even a simple linear function such as the SUM rule achieves very good performance in audiovisual fusion. We also exper- imented with difierent combinations of cues with the most informative being the facial expressions and the spectral fea- tures. The best combination of cues is the integration of fa- cial expressions, spectral and prosodic features when a neu- ral network is used as the fusion method. When tested on 96 audiovisual sequences, depicting spontaneously displayed (as opposed to posed) laughter and speech episodes, in a person independent way the proposed audiovisual approach achieves over 90% recall rate and over 80% precision.
Conference Paper
Full-text available
Non-verbal vocalisations such as laughter, breathing, hesitation, and consent play an important role in the recognition and understanding of human conversational speech and spontaneous affect. In this contribution we discuss two different strategies for robust discrimination of such events: dynamic modelling by a broad selection of diverse acoustic Low-Level-Descriptors vs. static modelling by projection of these via statistical functionals onto a 0.6k feature space with subsequent de-correlation. As classifiers we employ Hidden Markov Models, Conditional Random Fields, and Support Vector Machines, respectively. For discussion of extensive parameter optimisation test-runs with respect to features and model topology, 2.9k non-verbals are extracted from the spontaneous Audio-Visual Interest Corpus. 80.7% accuracy can be reported with, and 92.6% without a garbage model for the discrimination of the named classes.
Conference Paper
Full-text available
Laughter is a key element of human-human interaction, occurring surprisingly frequently in multi-party conversation. In meetings, laughter accounts for almost 10% of vocalization effort by time, and is known to be relevant for topic segmentation and the automatic characterization of affect. We present a system for the detection of laughter, and its attribution to specific participants, which relies on simultaneously decoding the vocal activity of all participants given multi-channel recordings. The proposed framework allows us to disambiguate laughter and speech not only acoustically, but also by constraining the number of simultaneous speakers and the number of simultaneous laughers independently, since participants tend to take turns speaking but laugh together. We present experiments on 57 hours of meeting data, containing almost 11000 unique instances of laughter.
Article
Full-text available
Remarkably little is known about the acoustic features of laughter. Here, acoustic outcomes are reported for 1024 naturally produced laugh bouts recorded from 97 young adults as they watched funny video clips. Analyses focused on temporal features, production modes, source- and filter-related effects, and indexical cues to laugher sex and individual identity. Although a number of researchers have previously emphasized stereotypy in laughter, its acoustics were found now to be variable and complex. Among the variety of findings reported, evident diversity in production modes, remarkable variability in fundamental frequency characteristics, and consistent lack of articulation effects in supralaryngeal filtering are of particular interest. In addition, formant-related filtering effects were found to be disproportionately important as acoustic correlates of laugher sex and individual identity. These outcomes are examined in light of existing data concerning laugh acoustics, as well as a number of hypotheses and conjectures previously advanced about this species-typical vocal signal.
Conference Paper
The social nature of laughter invites people to laugh together. This joint vocal action often results in overlapping laughter. In this paper, we show that the acoustics of overlapping laughs are different from non-overlapping laughs. We found that overlapping laughs are stronger prosodically marked than non-overlapping ones, in terms of higher values for duration, mean F0, mean and maximum intensity, and the amount of voicing. This effect is intensified by the number of people joining in the laughter event, which suggests that entrainment is at work. We also found that group size affects the number of overlapping laughs which illustrates the contagious nature of laughter. Fi-nally, people appear to join laughter simultaneously at a delay of approximately 500 ms; a delay that must be considered when developing spoken dialogue systems that are able to respond to users' laughs.
Article
In Automatic Speech Recognition (ASR), the presence of Out Of Vocabulary (OOV) words or sounds, within the speech signal, can have a detrimental effect on recognition performance. One common method of solving this problem is to use filler models to absorb the unwanted OOV utterances. A balance between accepting In Vocabulary (IV) words and rejecting OOV words can be achieved by manipulating the values of Word Insertion Penalty and Filler Insertion Penalty. This paper investigates the ability of three different classes of HMM filler models, K-Means, Mean and Baum- Welch, to discriminate between IV and OOV words. The results show that using the Baum-Welch trained HMMs 97.0% accuracy is possible for keyword IV acceptance and OOV rejection. The K-Means filler models provide the highest IV acceptance score of 97.3% but lower overall accuracy. However, the computational complexity of the K- Means algorithm is significantly lower and requires no additional speech training data.
Article
The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence – the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement – in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for social signal processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially aware computing.
Article
Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker’s state and emotion can be revealed. This paper describes the development of a gender-independent laugh detector with the aim to enable automatic emotion recognition. Different types of features (spectral, prosodic) for laughter detection were investigated using different classification techniques (Gaussian Mixture Models, Support Vector Machines, Multi Layer Perceptron) often used in language and speaker recognition. Classification experiments were carried out with short pre-segmented speech and laughter segments extracted from the ICSI Meeting Recorder Corpus (with a mean duration of approximately 2 s). Equal error rates of around 3% were obtained when tested on speaker-independent speech data. We found that a fusion between classifiers based on Gaussian Mixture Models and classifiers based on Support Vector Machines increases discriminative power. We also found that a fusion between classifiers that use spectral features and classifiers that use prosodic information usually increases the performance for discrimination between laughter and speech. Our acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughter and speech.
Using uh and um in spontaneous speaking
  • H Clark
H. Clark, "Using uh and um in spontaneous speaking," Cognition, vol. 84, no. 1, pp. 73-111, May 2002.
Segmenting phonetic units in laughter
  • J Trouvain
J. Trouvain, "Segmenting phonetic units in laughter," in Proc. of the 15th International Congress of Phonetic Sciences, 2003, pp. 2793-2796.
A phonetic analysis of natural laughter, for use in automatic laughter processing systems
  • J Urbain
  • T Dutoit
J. Urbain and T. Dutoit, "A phonetic analysis of natural laughter, for use in automatic laughter processing systems," in Affective Computing and Intelligent Interaction -4th International Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011, Proceedings, Part I, ser. Lecture Notes in Computer Science, vol. 6974. Springer, 2011, pp. 397-406.
Laughter detection in meetings
  • L Kennedy
  • D Ellis
L. Kennedy and D. Ellis, "Laughter detection in meetings," in NIST Meeting Recognition Workshop at ICASSP, 2004, pp. 118-121.
Acoustic features of four types of laughter
  • H Tanaka
  • N Campbell
H. Tanaka and N. Campbell, "Acoustic features of four types of laughter," in The 17th International Congress of Phonetic Sciences (ICPhS XVII). City University of Hongkong, 2011, pp. 1958-1961.