Preprint

Predicting Depression and Emotions in the Cross-roads of Cultures, Para-linguistics, and Non-linguistics

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Cross-language, cross-cultural emotion recognition and accurate prediction of affective disorders are two of the major challenges in affective computing today. In this work, we compare several systems for Detecting Depression with AI Sub-challenge (DDS) and Cross-cultural Emotion Sub-challenge (CES) that are published as part of the AudioVisual Emotion Challenge (AVEC) 2019. For both sub-challenges, we benefit from the baselines, while introducing our own features and regression models. For the DDS challenge, where ASR transcripts are provided by the organizers, we propose simple linguistic and word-duration features. These ASR transcript-based features are shown to outperform the state of the art audio visual features for this task, reaching a test set Concordance Correlation Coefficient (CCC) performance of 0.344 in comparison to a challenge baseline of 0.120. Our results show that non-verbal parts of the signal are important for detection of depression, and combining this with linguistic information produces the best results. For CES, the proposed systems using unsupervised feature adaptation outperform the challenge baselines on emotional primitives, reaching test set CCC performances of 0.466 and 0.499 for arousal and valence, respectively.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... Rodrigues et al. [26] used audiotranslated texts with their hidden embedding extracted from pretrained BERT [6] model, and employ CNNs to obtain cross-modality information. Kaya et al. [16] designed a new Automatic Speech Recognizer (ASR) transcription based features while Ray [24] proposed a multi-layer attention network for estimating depressions. Aside from acoustic, visual, and textual features, Kroenke et al. [17] showed that body gestures have a positive contribution to the accuracy of depression estimation. ...
... Baseline [25] 0.111 6.37 Adaptive Fusion Transformer [29] 0.331 6.22 EF [16] 0.344 -Bert-CNN & Gated-CNN [26] 0.403 6.11 Multi-scale Temporal Dilated CNN [8] 0.430 4.39 Hierarchical BiLSTM [36] 0.442 5.50 CubeMLP(Ours) 0.583 4.37 studies according to their provided papers. As is shown in the table, CubeMLP achieves a MAE of 0.770 on CMU-MOSI and 0.529 on CMU-MOSEI, which is competitive with other state-of-theart approaches. ...
... Sun et al. [29] propose the adaptive fusion Transformer network to adaptively fuse the final predictions. EF [16] introduces simple linguistic and wordduration features to estimate the depression level. Bert-CNN & Gated-CNN [26] employs the gate mechanism to fuse the information attained from involved modalities. ...
Preprint
Multimodal sentiment analysis and depression estimation are two important research topics that aim to predict human mental states using multimodal data. Previous research has focused on developing effective fusion strategies for exchanging and integrating mind-related information from different modalities. Some MLP-based techniques have recently achieved considerable success in a variety of computer vision tasks. Inspired by this, we explore multimodal approaches with a feature-mixing perspective in this study. To this end, we introduce CubeMLP, a multimodal feature processing framework based entirely on MLP. CubeMLP consists of three independent MLP units, each of which has two affine transformations. CubeMLP accepts all relevant modality features as input and mixes them across three axes. After extracting the characteristics using CubeMLP, the mixed multimodal features are flattened for task predictions. Our experiments are conducted on sentiment analysis datasets: CMU-MOSI and CMU-MOSEI, and depression estimation dataset: AVEC2019. The results show that CubeMLP can achieve state-of-the-art performance with a much lower computing cost.
... In addition, for depression prediction, self-assessed PHQ-8 scores are provided. We use the breathing annotations presented in [12] for breath signal cross-dataset prediction evaluation. A total of 1478 breath event are annotated across 16 recordings. ...
... However, if the size is too small, it becomes difcult to catch breathing events. We evaluated the selected parameters on the annotated breath events introduced by Kaya et al. [12]. ...
... It is arguably the most popular in-the-wild public database available to date, featuring time-continuous, high resolution labels for multiple dimensions of affect. The participants of the AVEC challenges Kaya et al., 2019;Zhao, Li, Liang, Chen, & Jin, 2019) compete to correctly predict these affect labels for different cultures, based on the audio, textual, and video features provided. The bag-of-words representation computed using openXBOW has been shown to perform well across all the three modalities. ...
Article
A robust value- and time-continuous emotion recognition has enormous potential benefits within healthcare. For example, within mental health, a real-time patient monitoring system capable of accurately inferring a patient’s emotional state could help doctors make an appropriate diagnosis and treatment plan. Such interventions could be vital in terms of ensuring a higher quality of life for the patient involved. To make such tools a reality, the associated machine learning systems need to be fast, robust and generalisable. In this regard, we present herein, a novel emotion recognition system consisting of the shallowest realisable Convolutional Neural Network (CNN) architecture. We draw insights from visualisations of the trained filter weights and the facial action unit (FAU) activations, i. e. the inputs to the model, of the participants featured in the in-the-wild, spontaneous video-chat sessions of the SEWA corpus. Further, we demonstrate the generalisablity of this approach on the German, Hungarian, and Chinese cultures available in this corpus. The obtained cross-cultural performance is a testimony to the universality of FAUs in expression and understanding of the human affective behaviours. These learnings were moderately consistent with the human perception of emotional expression. The practicality of the proposed approach is also demonstrated in another key healthcare applications; pain intensity prediction. Key results from these experiments highlight the transparency of the shallow CNN structure. As FAU can be extracted in near real-time, and because the models we developed are exceptionally shallow, this study paves the way for a robust, cross-cultural, end-to-end, in-the-wild, explainable real-time affect and pain prediction, that is value- and time-continuous.
Article
Video-based automatic depression analysis provides a fast, objective and repeatable self-assessment solution, which has been widely developed in recent years. While depression cues may be reflected by human facial behaviours of various temporal scales, most existing approaches either focused on modelling depression from short-term or video-level facial behaviours. In this sense, we propose a two-stage framework that models depression severity from multi-scale short-term and video-level facial behaviours. The short-term depressive behaviour modelling stage first deep learns depression-related facial behavioural features from multiple short temporal scales, where a Depression Feature Enhancement (DFE) module is proposed to enhance the depression-related cues for all temporal scales and remove non-depression related noise. Two novel graph encoding strategies are proposed in the video-level depressive behavior modeling stage, i.e., Sequential Graph Representation (SEG) and Spectral Graph Representation (SPG), to re-encode all short-term features of the target video into a video-level graph representation, summarizing depression-related multi-scale video-level temporal information. As a result, the produced graph representations predict depression severity using both short-term and long-term facial behaviour patterns. The experimental results on AVEC 2013, AVEC 2014 and AVEC 2019 datasets show that the proposed DFE module constantly enhanced the depression severity estimation performance for various CNN models while the SPG is superior than other video-level modelling methods. More importantly, the result achieved for the proposed two-stage framework shows its promising and solid performance compared to widely-used one-stage modelling approaches. Our code is publicly available at https://github.com/jiaqi-pro/Depression-detection-Graph
Article
For depression severity assessment, we systematically analyze a modular deep learning pipeline that uses speech transcriptions as input for depression severity prediction. Through our pipeline, we investigate the role of popular deep learning architectures in creating representations for depression assessment. Evaluation of the proposed architectures is performed on the publicly available Extended Distress Analysis Interview Corpus dataset (E-DAIC). Through the results and discussions, we show that informative representations for depression assessment can be obtained without exploiting the temporal dynamics between descriptive text representations. More specifically, temporal pooling of latent representations outperforms the state of the art, which employs recurrent architectures, by 8.8% in terms of Concordance Correlation Coefficient (CCC).
Article
Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.
Article
Full-text available
Multimodal sentiment analysis (MSA) is an emerging field focused on interpreting complex human emotions and expressions by integrating various data types, including text, audio, and visuals. Addressing the challenges in this area, we introduce SentDep, a groundbreaking framework that merges cutting-edge fusion methods with modern deep learning structures. Designed to effectively blend the unique features of textual, acoustic, and visual data, SentDep offers a unified and potent representation of multimodal data. Our extensive tests on renowned datasets like CMU-MOSI and CMU-MOSEI demonstrate that SentDep surpasses current leading models, setting a new standard in MSA performance. We conducted thorough ablation studies and supplementary experiments to identify what drives SentDep’s success. These studies highlight the importance of the size of pre-training data, the effectiveness of various fusion techniques, and the critical role of temporal information in enhancing the model’s capabilities.
Article
Full-text available
As mental health (MH) disorders become increasingly prevalent, their multifaceted symptoms and comorbidities with other conditions introduce complexity to diagnosis, posing a risk of underdiagnosis. While machine learning (ML) has been explored to mitigate these challenges, we hypothesized that multiple data modalities support more comprehensive detection and that non-intrusive collection approaches better capture natural behaviors. To understand the current trends, we systematically reviewed 184 studies to assess feature extraction, feature fusion, and ML methodologies applied to detect MH disorders from passively sensed multimodal data, including audio and video recordings, social media, smartphones, and wearable devices. Our findings revealed varying correlations of modality-specific features in individualized contexts, potentially influenced by demographics and personalities. We also observed the growing adoption of neural network architectures for model-level fusion and as ML algorithms, which have demonstrated promising efficacy in handling high-dimensional features while modeling within and cross-modality relationships. This work provides future researchers with a clear taxonomy of methodological approaches to multimodal detection of MH disorders to inspire future methodological advancements. The comprehensive analysis also guides and supports future researchers in making informed decisions to select an optimal data source that aligns with specific use cases based on the MH disorder of interest.
Article
Full-text available
Bipolar disorder (BD) is one of the most common mental illnesses worldwide. In this study, a smartphone application was developed to collect digital phenotyping data of users, and an ensemble method combining the results from a model pool was established through heterogeneous digital phenotyping. The aim was to predict the severity of bipolar symptoms by using two clinician-administered scales, the Hamilton Depression Rating Scale (HAM-D) and the Young Mania Rating Scale (YMRS). The collected digital phenotype data included the user’s location information (GPS), self-report scales, daily mood, sleep patterns, and multimedia records (text, speech, and video). Each category of digital phenotype data was used for training models and predicting the rating scale scores (HAM-D and YMRS). Seven models were tested and compared, and different combinations of feature types were used to evaluate the performance of heterogeneous data. To address missing data, an ensemble approach was employed to increase flexibility in rating scale score prediction. This study collected heterogeneous digital phenotype data from 84 individuals with BD and 11 healthy controls. Five-fold cross-validation was employed for evaluation. The experimental results revealed that the Lasso and ElasticNet regression models were the most effective in predicting rating scale scores, and heterogeneous data performed better than homogeneous data, with a mean absolute error of 1.36 and 0.55 for HAM-D and YMRS, respectively; this margin of error meets medical requirements.
Article
Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.
Article
Bipolar disorder is a mental health disorder that causes mood swings that range from depression to mania. Clinical diagnosis of bipolar disorder is based on patient interviews and reports obtained from the relatives of the patients. Subsequently, the diagnosis depends on the experience of the expert, and there is co-morbidity with other mental disorders. Automated processing in the diagnosis of bipolar disorder can help providing quantitative indicators, and allow easier observations of the patients for longer periods. In this paper, we create a multimodal decision system for three level mania classification based on recordings of the patients in acoustic, linguistic, and visual modalities. The system is evaluated on the Turkish Bipolar Disorder corpus we have recently introduced to the scientific community. Comprehensive analysis of unimodal and multimodal systems, as well as fusion techniques, are performed. Using acoustic, linguistic, and visual features in a multimodal fusion system, we achieved a 64.8% unweighted average recall score, which advances the state-of-the-art performance on this dataset.
Article
Computational technologies have revolutionized the archival sciences field, prompting new approaches to process the extensive data in these collections. Automatic speech recognition and natural language processing create unique possibilities for analysis of oral history (OH) interviews, where otherwise the transcription and analysis of the full recording would be too time consuming. However, many oral historians note the loss of aural information when converting the speech into text, pointing out the relevance of subjective cues for a full understanding of the interviewee narrative. In this article, we explore various computational technologies for social signal processing and their potential application space in OH archives, as well as neighboring domains where qualitative studies is a frequently used method. We also highlight the latest developments in key technologies for multimedia archiving practices such as natural language processing and automatic speech recognition. We discuss the analysis of both visual (body language and facial expressions), and non-visual cues (paralinguistics, breathing, and heart rate), stating the specific challenges introduced by the characteristics of OH collections. We argue that applying social signal processing to OH archives will have a wider influence than solely OH practices, bringing benefits for various fields from humanities to computer sciences, as well as to archival sciences. Looking at human emotions and somatic reactions on extensive interview collections would give scholars from multiple fields the opportunity to focus on feelings, mood, culture, and subjective experiences expressed in these interviews on a larger scale.
Article
Full-text available
As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.
Article
Full-text available
Currently, AI-based assistive technologies, particularly those involving sensitive data, such as systems for detecting mental illness and emotional disorders, are full of confidentiality, integrity, and security compromises. In the aforesaid context, this work proposes an algorithm for detecting depressive states based on only three never utilized speech markers. This reduced number of markers offers a valuable protection of personal (sensitive) data by not allowing for the retrieval of the speaker’s identity. The proposed speech markers are derived from the analysis of pitch variations measured in speech data obtained through a tale reading task performed by typical and depressed subjects. A sample of 22 subjects (11 depressed and 11 healthy, according to both psychiatric diagnosis and BDI classification) were involved. The reading wave files were listened to and split into a sequence of intervals, each lasting two seconds. For each subject’s reading and each reading interval, the average pitch, the pitch variation (T), the average pitch variation (A), and the inversion percentage (also called the oscillation percentage O) were automatically computed. The values of the triplet (Ti, Ai, Oi) for the i-th subject provide, all together, a 100% correct discrimination between the speech produced by typical and depressed individuals, while requiring a very low computational cost and offering a valuable protection of personal data.
Article
Full-text available
In recent years the interest in automatic depression detection has grown within medical and scientific-technical communities. Depression is one of the most widespread mental illnesses that affects human life. In this review we present and analyze the latest researches devoted to depression detection. Basic notions related to the definition of depression were specified, the review includes both unimodal and multimodal corpora containing records of informants diagnosed with depression and control groups of non-depressed people. Theoretical and practical researches which present automated systems for depression detection were reviewed. The last ones include unimodal as well as multimodal systems. A part of reviewed systems addresses the challenge of regressive classification predicting the degree of depression severity (non-depressed, mild, moderate and severe), and another part solves a problem of binary classification predicting the presence of depression (if a person is depressed or not). An original classification of methods for computing of informative features for three communicative modalities (audio, video, text information) is presented. New methods for depression detection in every modality and all modalities in total are defined. The most popular methods for depression detection in reviewed studies are neural networks. The survey has shown that the main features of depression are psychomotor retardation that affects all communicative modalities and strong correlation with affective values of valency, activation and domination, also there has been observed an inverse correlation between depression and aggression. Discovered correlations confirm interrelation of affective disorders and human emotional states. The trend observed in many reviewed papers is that combining modalities improves the results of depression detection systems.
Article
Continuous affect recognition is becoming an increasingly attractive research topic in affective computing. Previous works mainly focused on modelling the temporal dependency within a sensor modality, or adopting early or late fusion for multi-modal affective state recognition. However, early fusion suffers from the curse of dimensionality, and late fusion ignores the complementarity and redundancy between multiple modal streams. In this paper, we first introduce the transformer-encoder with a self-attention mechanism and propose a Convolutional Neural Network-Transformer Encoder (CNN-TE) framework to model the temporal dependency for single modal affect recognition. Further, to effectively consider the complementarity and redundancy between multiple streams we propose a Transformer Encoder with Multi-modal Multi-head Attention (TEMMA) for multi-modal affect recognition. TEMMA allows to progressively and simultaneously refine the inter-modality interactions and intra-modality temporal dependency. The learned multi-modal representations are fed to an Inference Sub-network with fully connected layers to estimate the affective state. The proposed framework is trained in a nutshell and demonstrates its effectiveness on the AVEC2016 and AVEC2019 datasets. Compared to state-of-the-art models, our approach obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC) reaching 0.583 for arousal and 0.564 for valence on the AVEC2019 test set.
Chapter
In this paper, we study an application of transfer learning approach to speaker’s age and gender recognition task. Recently, speech analysis systems, which take images of log Mel-spectrograms or MFCCs as input for classification, are gaining popularity. Therefore, we used pretrained models that showed good performance on ImageNet task, such as AlexNet, VGG-16, ResNet18, ResNet34, ResNet50, as well as state-of-the-art EfficientNet-B4 from Google. Additionally, we trained 1D CNN and TDNN models for speaker’s age and gender recognition. We compared performance of these models in age (4 classes), gender (3 classes) and joint age and gender (7 classes) recognition. Despite high performance of pretrained models in ImageNet task, our TDNN models showed better UAR results in all tasks presented in this study: age (UAR = 51.719%), gender (UAR = 81.746%) and joint age and gender (UAR = 48.969%) recognition.
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) 'State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition' is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Conference Paper
Full-text available
The ninth Audio-Visual Emotion Challenge and workshop AVEC 2019 was held in conjunction with ACM Multimedia'19. This year, the AVEC series addressed major novelties with three distinct tasks: State-of-Mind Sub-challenge (SoMS), Detecting Depression with Artificial Intelligence Sub-challenge (DDS), and Cross-cultural Emotion Sub-challenge (CES). The SoMS was based on a novel dataset (USoM corpus) that includes self-reported mood (10-point Likert scale) after the narrative of personal stories (two positive and two negative). The DDS was based on a large extension of the DAIC-WOZ corpus (c.f. AVEC 2016) that includes new recordings of patients suffering from depression with the virtual agent conducting the interview being, this time, wholly driven by AI, i.e., without any human intervention. The CES was based on the SEWA dataset (c.f. AVEC 2018) that has been extended with the inclusion of new participants in order to investigate how emotion knowledge of Western European cultures (German, Hungarian) can be transferred to the Chinese culture. In this summary, we mainly describe participation and conditions of the AVEC Challenge.
Article
Full-text available
Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are becoming indispensable part of our life more and more. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2000 minutes of audio-visual data of 398 people coming from six cultures, 50% female, and uniformly spanning the age range of 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal and (dis)liking intensity estimation.
Chapter
Full-text available
In the wild emotion recognition requires dealing with large variances in input signals, multiple sources of noise that will distract the learners, as well as difficult annotation and ground truth acquisition conditions. In this chapter, we briefly survey the latest developments in multimodal approaches for video-based emotion recognition in the wild, and describe our approach to the problem. For the visual modality, we propose using summarizing functionals of complementary visual descriptors. For the audio modality, we propose a standard computational pipeline for paralinguistics. We combine audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for “in the wild” facial expression recognition. Our approach scales to other problems, and ranked top in two challenges; the ChaLearn-LAP First Impressions Challenge (ICPR’2016) and ChaLearn-LAP Job Interview Candidate Screening Challenge (CVPR’2017), respectively.
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) "Bipolar disorder, and cross-cultural affect recognition'' is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.
Conference Paper
Full-text available
This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception
Conference Paper
Full-text available
Acoustic emotion recognition is a popular and central research direction in paralinguistic analysis, due its relation to a wide range of affective states/traits and manifold applications. Developing highly generalizable models still remains as a challenge for researchers and engineers, because of multitude of nuisance factors. To assert generalization, deployed models need to handle spontaneous speech recorded under different acoustic conditions compared to the training set. This requires that the models are tested for cross-corpus robustness. In this work, we first investigate the suitability of Long-Short-Term-Memory (LSTM) models trained with time-and space-continuously annotated af-fective primitives for cross-corpus acoustic emotion recognition. We next employ an effective approach to use the frame level valence and arousal predictions of LSTM models for utterance level affect classification and apply this approach on the ComParE 2018 challenge corpora. The proposed method alone gives motivating results both on development and test set of the Self-Assessed Affect Sub-Challenge. On the development set, the cross-corpus prediction based method gives a boost to performance when fused with top components of the baseline system. Results indicate the suitability of the proposed method for both time-continuous and utterance level cross-corpus acoustic emotion recognition tasks.
Conference Paper
Full-text available
Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which requires participants to perform continuous emotion prediction on three affective dimensions: Arousal, Valence and Likability based on the audiovisual signals. We highlight three aspects of our solutions: 1) we explore and fuse different hand-crafted and deep learned features from all available modalities including acoustic, visual, and textual modalities, and we further consider the interlocutor influence for the acoustic features; 2) we compare the effectiveness of non-temporal model SVR and temporal model LSTM-RNN and show that the LSTM-RNN can not only alleviate the feature engineering efforts such as construction of contextual features and feature delay, but also improve the recognition performance significantly; 3) we apply multi-task learning strategy for collaborative prediction of multiple emotion dimensions with shared representations according to the fact that different emotion dimensions are correlated with each other. Our solutions achieve the CCC of 0.675, 0.756 and 0.509 on arousal, valence, and likability respectively on the challenge testing set, which outperforms the baseline system with corresponding CCC of 0.375, 0.466, and 0.246 on arousal, valence, and likability.
Conference Paper
Full-text available
Major depressive disorder is a common mental disorder that affects almost 7% of the adult U.S. population. The 2017 Audio/Visual Emotion Challenge (AVEC) asks participants to build a model to predict depression levels based on the audio, video, and text of an interview ranging between 7-33 minutes. Since averaging features over the entire interview will lose most temporal information, how to discover, capture, and preserve useful temporal details for such a long interview are significant challenges. Therefore, we propose a novel topic modeling based approach to perform context-aware analysis of the recording. Our experiments show that the proposed approach outperforms context-unaware methods and the challenge baselines for all metrics.
Conference Paper
Full-text available
The seventh Audio-Visual Emotion Challenge and workshop AVEC 2017 was held in conjunction with ACM Multimedia'17. This year, the AVEC series addresses two distinct sub-challenges: emotion recognition and depression detection. The Affect Sub-Challenge is based on a novel dataset of human-human interactions recorded 'in-the-wild', whereas the Depression Sub-Challenge is based on the same dataset as the one used in AVEC 2016, with human-agent interactions. In this summary, we mainly describe participation and its conditions.
Article
Full-text available
A human being’s cognitive system can be simulated by artificial intelligent systems. Machines and robots equipped with cognitive capability can automatically recognize a humans mental state through their gestures and facial expressions. In this paper, an artificial intelligent system is proposed to monitor depression. It can predict the scales of Beck Depression Inventory (BDI-II) from vocal and visual expressions. Firstly, different visual features are extracted from facial expression images. Deep Learning method is utilized to extract key visual features from the facial expression frames. Secondly, Spectral Low-level Descriptors (LLDs) and Mel-frequency cepstral coefficients (MFCCs) features are extracted from short audio segments to capture the vocal expressions. Thirdly, Feature Dynamic History Histogram (FDHH) is proposed to capture the temporal movement on the feature space. Finally these FDHH and Audio features are fused using regression techniques for the prediction of the BDI-II scales. The proposed method has been tested on the public AVEC2014 dataset as it is tuned to be more focused on the study of depression. The results outperform all the other existing methods on the same dataset.
Article
Full-text available
We collected and Facial Action Coding System (FACS) coded over 2,600 free-response facial and body displays of 22 emotions in China, India, Japan, Korea, and the United States to test 5 hypotheses concerning universals and cultural variants in emotional expression. New techniques enabled us to identify cross-cultural core patterns of expressive behaviors for each of the 22 emotions. We also documented systematic cultural variations of expressive behaviors within each culture that were shaped by the cultural resemblance in values, and identified a gradient of universality for the 22 emotions. Our discussion focused on the science of new expressions and how the evidence from this investigation identifies the extent to which emotional displays vary across cultures.
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Conference Paper
Full-text available
The sixth Audio-Visual Emotion Challenge and workshop AVEC 2016 was held in conjunction ACM Multimedia'16. This year the AVEC series addresses two distinct sub-challenges, multi-modal emotion recognition and audio-visual depression detection. Both sub-challenges are in a way a return to AVEC's past editions: the emotion sub-challenge is based on the same dataset as the one used in AVEC 2015, and depression analysis was previously addressed in AVEC 2013/2014. In this summary, we mainly describe participation and its conditions.
Conference Paper
Full-text available
Depression is a typical mood disorder, which affects people in mental and even physical problems. People who suffer depression always behave abnormal in visual behavior and the voice. In this paper, an audio visual based multimodal depression scale prediction system is proposed. Firstly, features are extracted from video and audio are fused in feature level to represent the audio visual behavior. Secondly, long short memory recurrent neural network (LSTM-RNN) is utilized to encode the dynamic temporal information of the abnormal audio visual behavior. Thirdly, emotion information is utilized by multi-task learning to boost the performance further. The proposed approach is evaluated on the Audio-Visual Emotion Challenge (AVEC2014) dataset. Experiments results show the dimensional emotion recognition helps to depression scale prediction.
Article
Full-text available
Mental illness is one of the most pressing public health issues of our time. While counseling and psychotherapy can be effective treatments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with labeled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis methods to measure how various linguistic aspects of conversations are correlated with conversation outcomes. Applying techniques such as sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses, we discover actionable conversation strategies that are associated with better conversation outcomes.
Article
Full-text available
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evaluation of a patient's level of depression the observation of facial expressions and vocal cues. It is in this context that we present the fourth Audio-Visual Emotion recognition Challenge (AVEC 2014). This edition of the challenge uses a subset of the tasks used in a previous challenge, allowing for more focussed studies. In addition, labels for a third dimension (Dominance) have been added and the number of annotators per clip has been increased to a minimum of three, with most clips annotated by 5. The challenge has two goals logically organised as sub-challenges: the first is to predict the continuous values of the affective dimensions valence, arousal and dominance at each moment in time. The second is to predict the value of a single self-reported severity of depression indicator for each recording in the dataset. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Article
Full-text available
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
Article
Full-text available
This paper is the first review into the automatic analysis of speech for use as an objective predictor of depression and suicidality. Both conditions are major public health concerns; depression has long been recognised as a prominent cause of disability and burden worldwide, whilst suicide is a misunderstood and complex course of death that strongly impacts the quality of life and mental health of the families and communities left behind. Despite this prevalence the diagnosis of depression and assessment of suicide risk, due to their complex clinical characterisations, are difficult tasks, nominally achieved by the categorical assessment of a set of specific symptoms. However many of the key symptoms of either condition, such as altered mood and motivation, are not physical in nature; therefore assigning a categorical score to them introduces a range of subjective biases to the diagnostic procedure. Due to these difficulties, research into finding a set of biological, physiological and behavioural markers to aid clinical assessment is gaining in popularity. This review starts by building the case for speech to be considered a key objective marker for both conditions; reviewing current diagnostic and assessment methods for depression and suicidality including key non-speech biological, physiological and behavioural markers and highlighting the expected cognitive and physiological changes associated with both conditions which affect speech production. We then review the key characteristics; size associated clinical scores and collection paradigm, of active depressed and suicidal speech databases. The main focus of this paper is on how common paralinguistic speech characteristics are affected by depression and suicidality and the application of this information in classification and prediction systems. The paper concludes with an in-depth discussion on the key challenges – improving the generalisability through greater research collaboration and increased standardisation of data collection, and the mitigating unwanted sources of variability – that will shape the future research directions of this rapidly growing field of speech processing research.
Article
Full-text available
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
Conference Paper
Many people experience a traumatic event during their lifetime. In some extraordinary situations, such as natural disasters, war, massacres, terrorism or mass migration, the traumatic event is shared by a community and the effects go beyond those directly affected. Today, thanks to recorded interviews and testimonials, many archives and collections exist that are open to researchers of trauma studies, holocaust studies, historians among others. These archives act as vital testimonials for oral history, politics and human rights. As such, they are usually either transcribed, or meticulously indexed. In this project, we look at the nonverbal signals emitted by victims of various traumatic events and seek to render these for novel representations that are capable of representing the trauma without the explicit (and often highly politicized) content. In particular, we propose to detect breathing and silence patterns during the speeches of trauma patients for visualization and sonification. We are hoping to glean into cultural and contextual differences of bodily expression of trauma through automatic processing of thousands of hours of testimonials from all over the world.
Conference Paper
This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.
Article
Recent evidence in mental health assessment have demonstrated that facial appearance could be highly indicative of depressive disorder. While previous methods based on the facial analysis promise to advance clinical diagnosis of depressive disorder in a more efficient and objective manner, challenges in visual representation of complex depression pattern prevent widespread practice of automated depression diagnosis. In this paper, we present a deep regression network termed DepressNet to learn a depression representation with visual explanation. Specifically, a deep convolutional neural network equipped with a global average pooling layer is first trained with facial depression data, which allows for identifying salient regions of input image in terms of its severity score based on the generated depression activation map (DAM). We then propose a multi-region DepressNet, with which multiple local deep regression models for different face regions are jointly leaned and their responses are fused to improve the overall recognition performance. We evaluate our method on two benchmark datasets, and the results show that our method significantly boosts state-of-the-art performance of the visual-based depression recognition. Most importantly, the DAM induced by our learned deep model may help reveal the visual depression pattern on faces and understand the insights of automated depression diagnosis.
Conference Paper
Audio/visual and mood disorder cues have been recently explored to assist psychologists and psychiatrists in Depression Diagnosis. In this paper, we propose a random forest method with a Selected-Text feature which is according to the analysis on the transcript in different depressive levels. The files are consisted of sleep quality, PTSD/Depression Diagnostic, successive treatment, personal preference and feeling. Experiments are carried out on the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) database[6]. Comparing with results obtained with audio based, video based or multi-feature based cascade decision-level fusion features, Selected-Text feature based method has obtained very promising results on the development and test sets. The root mean square error (RMSE) reaches 4.7, and mean absolute error (MAE) reaches 3.9, which are better than the baseline result, i.e. 7.05/5.66.
Article
Despite the widespread use of supervised learning methods for speech emotion recognition, they are severely restricted due to the lack of sufficient amount of labelled speech data for the training. Considering the wide availability of unlabelled speech data, therefore, this paper proposes semi-supervised autoencoders to improve speech emotion recognition. The aim is to reap the benefit from the combination of labelled data and unlabelled data. The proposed model extends a popular unsupervised autoencoder by carefully adjoining a supervised learning objective. We extensively evaluate the proposed model on the INTERSPEECH 2009 Emotion Challenge database and other four public databases in different scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance with a very small number of labelled data on the challenge task and other tasks, and significantly outperforms other alternative methods.
Article
An important research direction in speech technology is robust cross-corpus and cross-language emotion recognition. In this paper, we propose computationally efficient and performance effective feature normalization strategies for the challenging task of cross-corpus acoustic emotion recognition. We particularly deploy a cascaded normalization approach, combining linear speaker level, nonlinear value level and feature vector level normalization to minimize speaker- and corpus-related effects as well as to maximize class separability with linear kernel classifiers. We use extreme learning machine classifiers on five corpora representing five languages from different families, namely Danish, English, German, Russian and Turkish. Using a standard set of suprasegmental features, the proposed normalization strategies show superior performance compared to benchmark normalization approaches commonly used in the literature.
Article
As a severe psychiatric disorder disease, depression is a state of low mood and aversion to activity, which prevents a person from functioning normally in both work and daily lives. The study on automated mental health assessment has been given increasing attentions in recent years. In this paper, we study the problem of automatic diagnosis of depression. A new approach to predict the Beck Depression Inventory II (BDI-II) values from video data is proposed based on the deep networks. The proposed framework is designed in a two stream manner, aiming at capturing both the facial appearance and dynamics. Further, we employ joint tuning layers that can implicitly integrate the appearance and dynamic information. Experiments are conducted on two depression databases, AVEC2013 and AVEC2014. The experimental results show that our proposed approach significantly improve the depression prediction performance, compared to other visual-based approaches.
Article
In this letter, a novel cross-corpus speech emotion recognition (SER) method using domain-adaptive least-squares regression (DaLSR) model is proposed. In this method, an additional unlabeled data set from target speech corpus is used to serve as an auxiliary data set and combined with the labeled training data set from source speech corpus for jointly training the DaLSR model. In contrast to the traditional least-squares regression (LSR) method, the major novelty of DaLSR is that it is able to handle the mismatch problem between source and target speech corpora. Hence, the proposed DaLSR method is very suitable for coping with cross-corpus SER problem. For evaluating the performance of the proposed method in dealing with the cross-corpus SER problem, we conduct extensive experiments on three emotional speech corpora and compare the results with several state-of-the-art transfer learning methods that are widely used for cross-corpus SER problem. The experimental results show that the proposed method achieves better recognition accuracies than the state-of-the-art methods.
Article
Affect detection is an important pattern recognition problem that has inspired researchers from several areas. The field is in need of a systematic review due to the recent influx of Multimodal (MM) affect detection systems that differ in several respects and sometimes yield incompatible results. This article provides such a survey via a quantitative review and meta-analysis of 90 peer-reviewed MM systems. The review indicated that the state of the art mainly consists of person-dependent models (62.2% of systems) that fuse audio and visual (55.6%) information to detect acted (52.2%) expressions of basic emotions and simple dimensions of arousal and valence (64.5%) with feature- (38.9%) and decision-level (35.6%) fusion techniques. However, there were also person-independent systems that considered additional modalities to detect nonbasic emotions and complex dimensions using model-level fusion techniques. The meta-analysis revealed that MM systems were consistently (85% of systems) more accurate than their best unimodal counterparts, with an average improvement of 9.83% (median of 6.60%). However, improvements were three times lower when systems were trained on natural (4.59%) versus acted data (12.7%). Importantly, MM accuracy could be accurately predicted (cross-validated R2 of 0.803) from unimodal accuracies and two system-level factors. Theoretical and applied implications and recommendations are discussed.
Article
Objective: We describe the sampling, initial evaluation, and final diagnostic classification of subjects enrolled in a natural history study of Alzheimer's disease (AD). Design: Volunteer cohort study. Setting: Multidisciplinary behavioral neurology research clinic. Patients or Other Participants: Three-hundred nineteen individuals were enrolled in the Alzheimer Research Program between March 1983 and March 1988. Of these, 204 were originally classified with AD, 102 were normal elderly control subjects, and 13 were considered special cases. Main Outcome Measures: Final consensus clinical diagnosis, final neuropathologic diagnosis, and death. Results: Of the 204 patients enrolled in the study, re-review after as many as 5 years of follow-up resulted in a final clinical classification of 188 with probable AD. Seven patients were believed to have a significant vascular component to the dementia, three were found to have developed depression, and six were excluded on other clinical grounds. Neuropathologic examination of 50 brains indicated definite AD in 43. After removing these seven misdiagnosed patients, the final group of probable/definite AD totaled 181 individuals. Accuracy of the baseline clinical diagnosis relative to neuropathology was 86%, and when follow-up clinical data were considered, 91.4%. Detailed neuropsychological testingyielded high sensitivity (0.988) and specificity (0.983) to dementia. Analyses of survival time from study entry until death revealed that older patients were significantly more likely to die during follow-up, but neither sex, years of education, nor pattern of cognitive impairment were related to survival. Conclusions: These data provide the descriptive basis for future studies of this cohort. They indicate that longitudinal follow-up of demented cases increases accuracy of diagnosis, and that detailed cognitive testing aids in early classification.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on an unsupervised domain adaptation method, where prior knowledge learned from a target set is used to regularize the training on a source set. Our goal is to achieve a matched feature space representation for the target and source sets while ensuring target domain knowledge transfer. The method has been successfully evaluated on the 2009 INTERSPEECH Emotion Challenge’s FAU Aibo Emotion Corpus as target corpus and two other publicly available speech emotion corpora as sources. The experimental results show that our method significantly improves over the baseline performance and outperforms related feature domain adaptation methods.
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.