Conference Paper

Vocal and Facial Biomarkers of Depression based on Motor Incoordination and Timing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In individuals with major depressive disorder, neurophysiological changes often alter motor control and thus affect the mechanisms controlling speech production and facial expression. These changes are typically associated with psychomotor retardation, a condition marked by slowed neuromotor output that is behaviorally manifested as altered coordination and timing across multiple motor-based properties. Changes in motor outputs can be inferred from vocal acoustics and facial movements as individuals speak. We derive novel multi-scale correlation structure and timing feature sets from audio-based vocal features and video-based facial action units from recordings provided by the 4th International Audio/Video Emotion Challenge (AVEC). The feature sets enable detection of changes in coordination, movement, and timing of vocal and facial gestures that are potentially symptomatic of depression. Combining complementary features in Gaussian mixture model and extreme learning machine classifiers, our multivariate regression scheme predicts Beck depression inventory ratings on the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31. Future work calls for continued study into detection of neurological disorders based on altered coordination and timing across audio and video modalities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Williamson et al. (23) examined changes in motor output in people with depression from vocal acoustics and facial movements (23). Using the 4th International Audio/Video Emotion Challenge (AVEC), which consists of a read passage and free-response speech segment from subjects with varying depression levels according to their self-reported Beck depression inventory assessment, they developed a multimodal analysis pipeline that leverages complementary information in audio and video signals including structure and timing features for estimating depression severity. ...
... Williamson et al. (23) examined changes in motor output in people with depression from vocal acoustics and facial movements (23). Using the 4th International Audio/Video Emotion Challenge (AVEC), which consists of a read passage and free-response speech segment from subjects with varying depression levels according to their self-reported Beck depression inventory assessment, they developed a multimodal analysis pipeline that leverages complementary information in audio and video signals including structure and timing features for estimating depression severity. ...
... Using the 4th International Audio/Video Emotion Challenge (AVEC), which consists of a read passage and free-response speech segment from subjects with varying depression levels according to their self-reported Beck depression inventory assessment, they developed a multimodal analysis pipeline that leverages complementary information in audio and video signals including structure and timing features for estimating depression severity. Using the identified features of changes in coordination, movement, and timing of vocal and facial movements, the developed algorithm was able to predict the Beck depression inventory ratings from the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31 (23). ...
Full-text available
Article
Affective computing (also referred to as artificial emotion intelligence or emotion AI) is the study and development of systems and devices that can recognize, interpret, process, and simulate emotion or other affective phenomena. With the rapid growth in the aging population around the world, affective computing has immense potential to benefit the treatment and care of late-life mood and cognitive disorders. For late-life depression, affective computing ranging from vocal biomarkers to facial expressions to social media behavioral analysis can be used to address inadequacies of current screening and diagnostic approaches, mitigate loneliness and isolation, provide more personalized treatment approaches, and detect risk of suicide. Similarly, for Alzheimer's disease, eye movement analysis, vocal biomarkers, and driving and behavior can provide objective biomarkers for early identification and monitoring, allow more comprehensive understanding of daily life and disease fluctuations, and facilitate an understanding of behavioral and psychological symptoms such as agitation. To optimize the utility of affective computing while mitigating potential risks and ensure responsible development, ethical development of affective computing applications for late-life mood and cognitive disorders is needed.
... For speech-based depression detection, many approaches have been proposed, and recent years have witnessed a recent shift from conventional acoustic features [2,[11][12][13] to deep learning [14,15]. Another category of effective features is based on speech articulation, such as vowel space area [16], vocal tract coordination features [17], and speech landmark-based features [18], because such features are less impacted by environmental noise and handset variability than typical prosodic features [18]. This may be because landmarks are detected based on multiple frequency bands (i.e., of which only some may be affected by noise) and differences in energy from one frame to the next (i.e., so absolute offset in energy across frames due to noise would have a small effect); however, further experimental work would be needed to confirm this. ...
... Figure 2: Depression severity distributions for training and test partitions per dataset. The PHQ-9 has minimal (0-9), mild (10)(11)(12)(13)(14), moderate (15)(16)(17)(18)(19), and severe (20-27) and BDI-II has minimal (0-13), mild (14)(15)(16)(17)(18)(19), moderate (20)(21)(22)(23)(24)(25)(26)(27)(28), and severe (29-63) depression severity score range outcomes. As indicated by the figure key, the lighter the shade the more severe the PHQ-9 depression score. ...
... Figure 2: Depression severity distributions for training and test partitions per dataset. The PHQ-9 has minimal (0-9), mild (10)(11)(12)(13)(14), moderate (15)(16)(17)(18)(19), and severe (20-27) and BDI-II has minimal (0-13), mild (14)(15)(16)(17)(18)(19), moderate (20)(21)(22)(23)(24)(25)(26)(27)(28), and severe (29-63) depression severity score range outcomes. As indicated by the figure key, the lighter the shade the more severe the PHQ-9 depression score. ...
Conference Paper
Detecting depression from the voice in naturalistic environments is challenging, particularly for short-duration audio recordings. This enhances the need to interpret and make optimal use of elicited speech. The rapid consonant-vowel syllable combination ‘pataka’ has frequently been selected as a clinical motor-speech task. However, there is significant variability in elicited recordings, which remains to be investigated. In this multi-corpus study of over 25,000 ‘pataka’ utterances, it was discovered that speech landmark- based features were sensitive to the number of ‘pataka’ utterances per recording. This landmark feature sensitivity was newly exploited to automatically estimate ‘pataka’ count and rate, achieving root mean square errors nearly three times lower than chance-level. Leveraging count-rate knowledge of the elicited speech for depression detection, results show that the estimated ‘pataka’ number and rate are important for normalizing evaluative ‘pataka’ speech data. Count and/or rate normalized ‘pataka’ models produced relative reductions in depression classification error of up to 26% compared with non-normalized models.
... The formants are directly related to the articulation of the vocal tract and can therefore be interpreted directly. Formant tracks have also been used successfully as features before, for example when modelling emotion and depression [47]. This method was adopted in our previous work for cognitive workload monitoring in [16]. ...
... The formant features were extracted using the Kalman-based auto-regressive moving average smoothing (KARMA) algorithm [48] as in [47]. The main advantage of using KARMA is that the algorithm produces smoother formant tracks than other methods and it provides a sensible interpolation during non-voiced periods. ...
... The features used to characterise the cardiovascular system are the well defined blood pressure measures obtained from a Finometer from Finapress [43,44]. The novel voice features presented in this work are derived from the formant track features developed in previous work [16,47] and characterise the vocal tract shape and change in shape. The reason why the formant track features were chosen in contrast to of voice source features (e.g. ...
Full-text available
Article
Monitoring cognitive workload has the potential to improve both the performance and fidelity of human decision making. However, previous efforts towards discriminating further than binary levels (e.g., low/high or neutral/high) in cognitive workload classification have not been successful. This lack of sensitivity in cognitive workload measurements might be due to individual differences as well as inadequate methodology used to analyse the measured signal. In this paper, a method that combines the speech signal with cardiovascular measurements for screen and heartbeat classification is introduced. For validation, speech and cardiovascular signals from 97 university participants and 20 airline pilot participants were collected while cognitive stimuli of varying difficulty level were induced with the Stroop colour/word test. For the trinary classification scheme (low, medium, high cognitive workload) the prominent result using classifiers trained on each participant achieved 15.17 ± 0.79% and 17.38 ± 1.85% average misclassification rates indicating good discrimination at three levels of cognitive workload. Combining cardiovascular and speech measures synchronized to each heartbeat and consolidated with short-term dynamic measures might therefore provide enhanced sensitivity in cognitive workload monitoring. The results show that the influence of individual differences is a limiting factor for a generic classification and highlights the need for research to focus on methods that incorporate individual differences to achieve even better results. This method can potentially be used to measure and monitor workload in real time in operational environments.
... Features derived from short speech samples based on voice characteristics (21,22) smartphones. The details of the assessment schedule can be found in the Supplementary Material. ...
... Voice analytics has shown promise for detecting symptoms of depression (21,22). Study participants entered sound data through the smartphone application twice per week. ...
... kg/m 2 . The mean (range) of HAM-D total score at screening was 20.4 (17)(18)(19)(20)(21)(22)(23)(24)(25) for the patients, and it was 1.2 (0-3) for the healthy controls. The mean (range) of C-SSRS total score at screening was 32.1 (0-77) for the patients, and it was 1.1 (0-8) for the healthy controls. ...
Full-text available
Article
Background: Digital technologies have the potential to provide objective and precise tools to detect depression-related symptoms. Deployment of digital technologies in clinical research can enable collection of large volumes of clinically relevant data that may not be captured using conventional psychometric questionnaires and patient-reported outcomes. Rigorous methodology studies to develop novel digital endpoints in depression are warranted. Objective: We conducted an exploratory, cross-sectional study to evaluate several digital technologies in subjects with major depressive disorder (MDD) and persistent depressive disorder (PDD), and healthy controls. The study aimed at assessing utility and accuracy of the digital technologies as potential diagnostic tools for unipolar depression, as well as correlating digital biomarkers to clinically validated psychometric questionnaires in depression. Methods: A cross-sectional, non-interventional study of 20 participants with unipolar depression (MDD and PDD/dysthymia) and 20 healthy controls was conducted at the Centre for Human Drug Research (CHDR), the Netherlands. Eligible participants attended three in-clinic visits (days 1, 7, and 14), at which they underwent a series of assessments, including conventional clinical psychometric questionnaires and digital technologies. Between the visits, there was at-home collection of data through mobile applications. In all, seven digital technologies were evaluated in this study. Three technologies were administered via mobile applications: an interactive tool for the self-assessment of mood, and a cognitive test; a passive behavioral monitor to assess social interactions and global mobility; and a platform to perform voice recordings and obtain vocal biomarkers. Four technologies were evaluated in the clinic: a neuropsychological test battery; an eye motor tracking system; a standard high-density electroencephalogram (EEG)-based technology to analyze the brain network activity during cognitive testing; and a task quantifying bias in emotion perception. Results: Our data analysis was organized by technology – to better understand individual features of various technologies. In many cases, we obtained simple, parsimonious models that have reasonably high diagnostic accuracy and potential to predict standard clinical outcome in depression. Conclusion: This study generated many useful insights for future methodology studies of digital technologies and proof-of-concept clinical trials in depression and possibly other indications.
... Features derived from short speech samples based on voice characteristics (21,22) smartphones. The details of the assessment schedule can be found in the Supplementary Material. ...
... Voice analytics has shown promise for detecting symptoms of depression (21,22). Study participants entered sound data through the smartphone application twice per week. ...
... kg/m 2 . The mean (range) of HAM-D total score at screening was 20.4 (17)(18)(19)(20)(21)(22)(23)(24)(25) for the patients, and it was 1.2 (0-3) for the healthy controls. The mean (range) of C-SSRS total score at screening was 32.1 (0-77) for the patients, and it was 1.1 (0-8) for the healthy controls. ...
... Beyond security, emotion classification is important in computer vision applications used for video indexing and retrieval, robot motion, entertainment, monitoring of smart home systems [16,17], and neuro-physiological and psychological studies. For instance, emotion classification is important in monitoring the psychological and neuro-physiological condition of individuals with personality trait disorders [18,19], and to monitor and identify people with autism spectral disorders [20]. ...
... ), (17)(18)(16)(17)(18), (dt2, nn2), (this, jewel), (complementary, complementary)] ...
... ), (17)(18)(16)(17)(18), (dt2, nn2), (this, jewel), (complementary, complementary)] ...
Full-text available
Article
Emotion classification is a research area in which there has been very intensive literature production concerning natural language processing, multimedia data, semantic knowledge discovery, social network mining, and text and multimedia data mining. This paper addresses the issue of emotion classification and proposes a method for classifying the emotions expressed in multimodal data extracted from videos. The proposed method models multimodal data as a sequence of features extracted from facial expressions, speech, gestures, and text, using a linguistic approach. Each sequence of multimodal data is correctly associated with the emotion by a method that models each emotion using a hidden Markov model. The trained model is evaluated on samples of multimodal sentences associated with seven basic emotions. The experimental results demonstrate a good classification rate for emotions.
... Region Units (RU) are used to represent the regions of the face that enclose AUs. Various works have adopted AU to estimate the scale of depression and obtained promising performances [93,94,95,107,108,109,110,111,112,113,114,115,116,117,118,119]. Later, it was found [102] that head pose and movement also contained discriminative patterns for assessing the severity of depression [93,94,95,17,99,89,110,96,97,101,103,104,120,121,122,123]. ...
... Regarding modality, speech and video samples [14,94,113,17,96,97,99,100,101,104,121,122,123,149,150,16,151,152,153], physiological signals [100,154,154,155,156,157,158,159,160], and text [100,108] have been employed to improve the performance of depression assessment. However, different modalities were determined by the device used in the data collection stage. ...
Full-text available
Preprint
With the acceleration of the pace of work and life, people have to face more and more pressure, which increases the possibility of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. Promisingly, physiological and psychological studies have indicated some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, many scholars have used deep learning to extract a representation of depression cues in audio and video for automatic depression detection. To sort out and summarize these works, this review introduces the databases and describes objective markers for automatic depression estimation (ADE). Furthermore, we review the deep learning methods for automatic depression detection to extract the representation of depression from audio and video. Finally, this paper discusses challenges and promising directions related to automatic diagnosing of depression using deep learning technologies.
... Acoustic Representations for Depression Detection: Depression is shown to degrade cognitive planning and psycho-motor functioning thus affecting the human speech production mechanism [10]. These effects manifest as variations in the speech voice quality [20] and several features were proposed to capture these variations in speech for depression detection. Spectral features such as formants, mel-frequency cepstral coefficients (MFCCs), prosodic features such as F0, jitter, shimmer and glottal features were initially considered for depression detection [9,21,22]. ...
... Spectral, prosodic and other voice quality related features extracted using OpenSMILE [23] and COVAREP [24] toolkits were also used for depression analysis [12,17]. Further, features developed based on speech articulation such as vocal tract coordination features were considered for depression detection [20,18,25]. Recently, sentiment and emotion embeddings, representing non-verbal characteristics of speech, are considered for depression severity estimation [26]. ...
Full-text available
Preprint
Depression detection from speech has attracted a lot of attention in recent years. However, the significance of speaker-specific information in depression detection has not yet been explored. In this work, we analyze the significance of speaker embeddings for the task of depression detection from speech. Experimental results show that the speaker embeddings provide important cues to achieve state-of-the-art performance in depression detection. We also show that combining conventional OpenSMILE and COVAREP features, which carry complementary information, with speaker embeddings further improves the depression detection performance. The significance of temporal context in the training of deep learning models for depression detection is also analyzed in this paper.
... Another set of features that have been proposed are formants, which are resonant frequencies resulting from the shape of the vocal tract. Statistics of the first three formants and their coordination have also been demonstrated as a method of detecting depression severity [48,68,144,211]. Additionally, estimates of the glottal source signal have been used as biomarkers successfully [144,182]. ...
... Deviations from the baseline vowel pronunciations have also shown to correlate with depression in vowel space analysis [181]. Several methods have shown that the frequency domain may carry important information with power spectral density analysis and MFCC statistics and their coordination as features [12,48,49,68,211]. The use of GMMs, SVMs, ANNs, and Hierarchical Fuzzy Signatures (HFS) have been used to classify depressed mood [11]. ...
Thesis
Bipolar disorder is a chronic mental illness, affecting 4% of Americans, that is characterized by periodic mood changes ranging from severe depression to extreme compulsive highs. Both mania and depression profoundly impact the behavior of affected individuals, resulting in potentially devastating personal and social consequences. Bipolar disorder is managed clinically with regular interactions with care providers, who assess mood, energy levels, and the form and content of speech. Recent work has proposed smartphones for automatically monitoring mood using speech. Much of the early work in speech-centered mood detection has been done in the laboratory or clinic and is not reflective of the variability found in real-world conversations and conditions. Outside of these settings, automatic mood detection is hard, as the recordings include environmental noise, differences in recording devices, and variations in subject speaking patterns. Without addressing these issues, it is difficult to move towards a passive mobile health system. My research works to address this variability present in speech so that such a system can be created, allowing for interventions to mitigate the life-changing effects of mood transitions. However detecting mood directly from speech is difficult, as mood varies over the course of days or weeks, while speech fluctuates rapidly. To address this, my thesis explores how an intermediate step can be used to aid in this prediction. For example, one of the major symptoms of bipolar disorder is emotion dysregulation - changes in the way emotions are perceived and a lack of inhibition in their expression. My work has supported the relationship between automatically extracted emotion estimates and mood. Because of this, my thesis explores how to mitigate the variability found when detecting emotion from speech. The remainder of my thesis is focused on employing these emotion-based features, as well as features based on language content, to real-world applications. This dissertation is divided into the following parts: Part I: I address the direct classification of mood from speech. This is accomplished by addressing variability due to recording device using preprocessing and multi-task learning. I then show how both subject-specific and population-general information can be combined to significantly improve mood detection. Part II: I explore the automatic detection of emotion from speech and how to control for the other factors of variability present in the speech signal. I use progressive networks as a method to augment emotion with other paralinguistic data including gender and speaker, as well as other datasets. Additionally, I introduce a novel domain generalization method for cross-corpus detection. Part III: I demonstrate real-world applications of speech mood monitoring using everyday conversations. I show how the previously introduced generalized model can predict emotion from the speech of individuals with suicidal ideation, demonstrating its effectiveness across domains. Furthermore, I use these predictions to distinguish individuals with suicidal thoughts from healthy controls. Lastly, I introduce a novel framework for intervention detection in individuals with bipolar disorder. I then create a natural speech mood monitoring system based on features derived from measures of emotion and automatic speech recognition (ASR) transcripts and show effective intervention detection. I conclude this dissertation with the following future directions: (1) Extending my emotion generalization system to include multiple modalities and factors of variability; (2) Expanding natural speech mood monitoring by including more devices, exploring other data besides speech, and investigating mood rating causality.
... Vocal Tract Coordination (VTC) features are strong candidates for this task by quantifying articulatory coordination using correlations across low-level acoustic features. Although previous research has used these features to capture psychomotor articulation patterns in depression [12,13], these features have not been explored at capturing the broad range of motor symptoms present in HD speech. ...
... The second format applies eigendecomposition across the channels of FVTC, which we call eigen-VTC (EVTC). EVTC was originally used in previous works [22,12] and results in a N ×D feature vector. Our feature extraction code will be made publicly available 1 . ...
... Vocal Tract Coordination (VTC) features are strong candidates for this task by quantifying articulatory coordination using correlations across low-level acoustic features. Although previous research has used these features to capture psychomotor articulation patterns in depression [12,13], these features have not been explored at capturing the broad range of motor symptoms present in HD speech. ...
... The second format applies eigendecomposition across the channels of FVTC, which we call eigen-VTC (EVTC). EVTC was originally used in previous works [22,12] and results in a N ×D feature vector. Our feature extraction code will be made publicly available 1 . ...
Preprint
Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attractive avenue for tracking HD severity because speech is easy to collect remotely and provides insight into motor changes. HD speech is typically characterized as having irregular articulation. With this in mind, acoustic features that can capture vocal tract movement and articulatory coordination are particularly promising for characterizing motor symptom progression in HD. In this paper, we present an experiment that uses Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor score. When using an elastic-net regression model, we find that VTC features significantly outperform other acoustic features across varied-length audio segments, which highlights the effectiveness of these features for both short- and long-form reading tasks. Lastly, we analyze the F-value scores of VTC features to visualize which channels are most related to motor score. This work enables future research efforts to consider VTC features for acoustic analyses which target HD motor symptomatology tracking.
... Therefore, in the early stage of the studies of SDR, the main work is to learn acoustic features related with depression and explore feature set for better performance [22,23]. In the meantime, traditional machine learning algorithms are employed in SDR such as Support Vector Machine (SVM) [24][25][26][27], Hidden Markov Model [28], Gaussian Mixture Model (GMM) [ [27,29,30], K-means [31,32], Boosting Logistic Regression [33][34][35], multi-layer perceptron [30,35], etc. ...
... As a clustering algorithm, it is employed in early research of SDR [27,30,59,69]. Moreover, GMM-based regression methods such as Gaussian Staircase Regression (GSR) are proposed, where each GMM consists of an ensemble of Gaussian classifiers [29,54,61,62]. In specific, firstly, speech features are mapped to different partitions of clinical depression score, then the mapping results are used as the basis of regression analysis. ...
Full-text available
Article
Depression has become one of the most common mental illnesses in the world. For better prediction and diagnosis, methods of automatic depression recognition based on speech signal are constantly proposed and updated, with a transition from the early traditional methods based on hand-crafted features to the application of architectures of deep learning. This paper systematically and precisely outlines the most prominent and up-to-date research of automatic depression recognition by intelligent speech signal processing so far. Furthermore, methods for acoustic feature extraction, algorithms for classification and regression, as well as end to end deep models are investigated and analysed. Finally, general trends are summarised and key unresolved issues are identified to be considered in future studies of automatic speech depression recognition.
... Time-delay embedded correlation (TDEC) analysis has shown promising results in assessing neuromotor coordination in Major Depressive Disorder (MDD), and the eigenspectra derived from the correlation matrices have been used effectively for classification of MDD subjects from healthy [17,22,24]. Recently, new multi-scale full vocal tract coordination (FVTC) features generated with a dilated CNN have shown further improvement in classification for selected datasets of MDD subjects [9]. ...
... The averaged eigenspectra and difference plots in figure 1 shows that the low-rank eigenvalues are smaller for schizophrenic subjects relative to the healthy controls, and this trend is reversed towards the high-rank eigenvalues. A key observation associated with depression severity [6,22,24] is that low-rank eigenvalues are larger for MDD subjects relative to healthy controls where as they are smaller for high-rank eigenvalues. The magnitude of highrank eigenvalues indicates the dimensionality of the time-delay embedded feature space. ...
Full-text available
Preprint
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is observed in healthy subjects. This observation is in contrast to what previous studies have shown in Major Depressive Disorder (MDD), where subjects with MDD show a simpler coordination pattern with respect to healthy controls or subjects in remission. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data during speech to distinguish schizophrenic patients with strong positive symptoms from healthy subjects. We also show that the vocal tract variables (TVs) which correspond to place of articulation and glottal source outperform the Mel-frequency Cepstral Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed multimodal network. For the clinical dataset we collected, our best performing multimodal network improves the mean F1 score for detecting schizophrenia by around 18% with respect to the full vocal tract coordination (FVTC) baseline method implemented with fusing FAUs and MFCCs
... Acoustic Representations for Depression Detection: Depression is shown to degrade cognitive planning and psychomotor functioning, thus affecting the human speech production mechanism (Cummins et al. 2015). These effects manifest as variations in the speech voice quality (Williamson et al. 2014) and several features have been proposed to capture these variations in speech for depression detection. Spectral features such as formants and mel-frequency cepstral coefficients (MFCCs), prosodic features such as F 0 , jitter, shimmer and glottal features were initially used for depression detection (Low et al. 2010;Cummins et al. 2011;Simantiraki et al. 2017). ...
... Spectral, prosodic and other voice quality related features extracted using OpenSMILE (Eyben, Wöllmer, and Schuller 2010) and COVAREP (Degottex et al. 2014) toolkits were also used for depression analysis (Valstar et al. 2016;Al Hanai, Ghassemi, and Glass 2018). Further, features developed based on speech articulation such as vocal tract coordination features were analyzed for depression detection (Williamson et al. 2014;Huang, Epps, and Joachim 2020;Seneviratne et al. 2020). Recently, sentiment and emotion embeddings, representing non-verbal characteristics of speech, were used for depression severity estimation (Dumpala et al. 2021a). ...
Full-text available
Conference Paper
Depression detection from speech has attracted a lot of attention in recent years. However, the significance of speaker-specific information in depression detection has not yet been explored. In this work, we introduce—and analyze the significance of—speaker embeddings in a temporal context for the task of depression detection from speech. Experimental results show that the speaker embeddings provide important cues to achieve state-of-the-art performance in depression detection. We also show that combining conventional OpenSMILE and COVAREP features, which carry complementary information, with speaker embeddings further improves the depression detection performance. The significance of the temporal context in the training of deep learning models for depression detection is also analyzed in this paper.
... More recently, case-control studies of adult populations have used audio-based feature extraction techniques to classify cognitive state following head injury [10]- [13]. Related studies have used acoustic features to examine similar types of speech degradation due to ALS [7], [8], [14], depression [15]- [18], Parkinson's disease [7], cognitive load [19] and diagnosed dysarthria and dysphonia [7], [20]- [22]. ...
... Changes over time in the coupling strengths among the formant tracks cause changes in the eigenvalue spectra of the resulting correlation matrices; weakly coupled formant-tracks may indicate more complex interactions between the articulators. Williamson et al. first applied this multivariate correlation approach to epileptic seizure prediction from multichannel EEG [33] and subsequently to the tracking and prediction of major depressive disorder from audio-based vocal signals [15], [34]. ...
Full-text available
Preprint
Recommendations for common outcome measures following pediatric traumatic brain injury (TBI) support the integration of instrumental measurements alongside perceptual assessment in recovery and treatment plans. A comprehensive set of sensitive, robust and non-invasive measurements is therefore essential in assessing variations in speech characteristics over time following pediatric TBI. In this article, we study the changes in the acoustic speech patterns of a pediatric cohort of ten subjects diagnosed with severe TBI. We extract a diverse set of both well-known and novel acoustic features from child speech recorded throughout the year after the child produced intelligible words. These features are analyzed individually and by speech subsystem, within-subject and across the cohort. As a group, older children exhibit highly significant (p<0.01) increases in pitch variation and phoneme diversity, shortened pause length, and steadying articulation rate variability. Younger children exhibit similar steadied rate variability alongside an increase in formant-based articulation complexity. Correlation analysis of the feature set with age and comparisons to normative developmental data confirm that age at injury plays a significant role in framing the recovery trajectory. Nearly all speech features significantly change (p<0.05) for the cohort as a whole, confirming that acoustic measures supplementing perceptual assessment are needed to identify efficacious treatment targets for speech therapy following TBI.
... In voice assessment, these parameters are often used because they are measures of perturbation and noise related to the production of sound at the glottal source. 47,48 Regarding the CPPS and the spectral tilt, reduced values were found, which was also a discriminant factor between depressed patients in the CAG and COG groups. In the literature, CPPS detected changes in laryngeal coordination due to its strong relationship with the perception of roughness and breathiness, which suggests that the change in voice production may be among the symptoms of depression. ...
Article
Objective To analyze whether voice acoustic parameters are discriminant and predictive in patients with and without depression. Methods Observational case-control study. The following instruments were administered to the participants: Self-Reporting Questionnaire (SRQ-20), Beck Depression Inventory-Second Edition (BDI-II), Voice Symptom Scale (VoiSS) and voice collection for subsequent extraction of the following acoustic parameters: mean, mode and standard deviation (SD) of the fundamental frequency (F0); jitter; shimmer; glottal to noise excitation ratio (GNE); cepstral peak prominence-smoothed (CPPS); and spectral tilt. A total of 144 individuals participated in the study: 54 patients diagnosed with depression (case group) and 90 without a diagnosis of depression (control group). Results The means of the acoustic parameters showed differences between the groups: F0 (SD), jitter, and shimmer values were high, while values for GNE, CPPS and spectral tilt were lower in the case group than in the control group. There was a significant association between BDI-II and jitter, shimmer, CPPS, and spectral tilt and between CPPS and the class of antidepressants used. The multiple linear regression model showed that jitter and CPPS were predictors of depression, as measured by the BDI-II. Conclusion Acoustic parameters were able to discriminate between patients with and without depression and were associated with BDI-II scores. The class of antidepressants used was associated with CPPS, and the jitter and CPPS parameters were able to predict the presence of depression, as measured by the BDI-II clinical score.
... Specifically, in the areas of mental wellbeing and health, speech, text, and visual information have been used for screening different types of disorders, including developmental, cognitive, behavioral, emotional, and psychological. Moreover, these modalities have also been used as assistive technologies to screen for depression [21,23,25,26], schizophrenia [27][28][29], Alzheimer's disease [18,20,30,31], bipolar disorder [32,33], and autism spectrum disorders (ASD) [16,[34][35][36]. ...
Full-text available
Article
This paper explores the automatic prediction of public trust in politicians through the use of speech, text, and visual modalities. It evaluates the effectiveness of each modality individually, and it investigates fusion approaches for integrating information from each modality for prediction using a multimodal setting. A database was created consisting of speech recordings, twitter messages, and images representing fifteen American politicians, and labeling was carried out per a publicly available ranking system. The data were distributed into three trust categories, i.e., the low-trust category, mid-trust category, and high-trust category. First, unimodal prediction using each of the three modalities individually was performed using the database; then, using the outputs of the unimodal predictions, a multimodal prediction was later performed. Unimodal prediction was performed by training three independent logistic regression (LR) classifiers, one each for speech, text, and images. The prediction vectors from the individual modalities were then concatenated before being used to train a multimodal decision-making LR classifier. We report that the best performing modality was speech, which achieved a classification accuracy of 92.81%, followed by the images, achieving an accuracy of 77.96%, whereas the best performing model for text-modality achieved a 72.26% accuracy. With the multimodal approach, the highest classification accuracy of 97.53% was obtained when all three modalities were used for trust prediction. Meanwhile, in a bimodal setup, the best performing combination was that combining the speech and image visual modalities by achieving an accuracy of 95.07%, followed by the speech and text combination, showing an accuracy of 94.40%, whereas the text and images visual modal combination resulted in an accuracy of 83.20%.
... To overcome potential concealments of real emotional status in rating depression and other emotions using common scales, novel techniques on facial expression and text content are commonly used. Severity of depression was correlated with facial expression features which were identified via movements of expression muscles (7). A machine learning model for the assessment of depression severity was built by extracting facial expression features (8). ...
Full-text available
Article
Background Emotional disturbance is an important risk factor of suicidal behaviors. To ensure speech emotion recognition (SER), a novel technique to evaluate emotional characteristics of speech, precision in labeling emotional words is a prerequisite. Currently, a list of suicide-related emotional word is absent. The aims of this study were to establish an Emotional Words List for Suicidal Risk Assessment (EWLSRA) and test the reliability and validity of the list in a suicide-related SER task.Methods Suicide-related emotion words were nominated and discussed by 10 suicide prevention professionals. Sixty-five tape-recordings of calls to a large psychological support hotline in China were selected to test psychometric characteristics of the EWLSRA.ResultsThe results shows that the EWLSRA consists of 11 emotion words which were highly associated with suicide risk scores and suicide attempts. Results of exploratory factor analysis support one-factor model of this list. The Fleiss’ Kappa value of 0.42 indicated good inter-rater reliability of the list. In terms of criteria validities, indices of despair (Spearman ρ = 0.54, P < 0.001), sadness (ρ = 0.37, P = 0.006), helplessness (ρ = 0.45, P = 0.001), and numbness (ρ = 0.35, P = 0.009) were significantly associated with suicidal risk scores. The index of the emotional word of numbness in callers with suicide attempt during the 12-month follow-up was significantly higher than that in callers without suicide attempt during the follow-up (P = 0.049).Conclusion This study demonstrated that the EWLSRA has adequate psychometric performance in identifying suicide-related emotional words of recording of hotline callers to a national wide suicide prevention line. This list can be useful for SER in future studies on suicide prevention.
... In the healthcare context, the extraction of behavioural features can be important for monitoring individuals' health conditions with personality trait disorders. These people often present changes in vocal acoustics and facial movements associated with psychomotor problems, which are behaviorally expressed as altered coordination and timing across motor-based properties [Williamson, 2014]. Clinically, attention paid to these behavioural changes can help monitor the disorder's course and responses to treatment, with a relatively low computational cost. ...
Full-text available
Article
The paper aims to provide a method to analyse and observe the characteristics that distinguish the individual communication style such as the voice intonation, the size and slant used in handwriting and the trait, pressure and dimension used for sketching. These features are referred to as Communication Extensional Features. Observing from the Communication Extensional Features, the user’s behavioural features, such as the communicative intention, the social style and personality traits can be extracted. These behavioural features are referred to as Communication Intentional Features. For the extraction of Communication Intentional Features, a method based on Hidden Markov Models is provided in the paper. The Communication Intentional Features have been extracted at the modal and multimodal level; this represents an important novelty provided by the paper. The accuracy of the method was tested both at modal and multimodal levels. The evaluation process results indicate an accuracy of 93.3% for the Modal layer (handwriting layer) and 95.3% for the Multimodal layer.
... To be more precise, the interests are localized on images, facial landmark points (Stratou et al., 2014;Morency et al., 2015;Nasir et al., 2016;Pampouchidou et al., 2016b), and/or facial 1 Available at: http://medlineplus.gov/depress.html. action units (AUs) (Cohn et al., 2009;McIntyre et al., 2009;Williamson et al., 2014). However, the methods that adopt image analysis (the essence of the video-based method are still images analysis where videos are converted into images) are affected by environmental factors and instrument parameters, such as illumination, angle, skin color, and resolution power. ...
Full-text available
Article
The proportion of individuals with depression has rapidly increased along with the growth of the global population. Depression has been the currently most prevalent mental health disorder. An effective depression recognition system is especially crucial for the early detection of potential depression risk. A depression-related dataset is also critical while evaluating the system for depression or potential depression risk detection. Due to the sensitive nature of clinical data, availability and scale of such datasets are scarce. To our knowledge, there are few extensively practical depression datasets for the Chinese population. In this study, we first create a large-scale dataset by asking subjects to perform five mood-elicitation tasks. After each task, subjects' audio and video are collected, including 3D information (depth information) of facial expressions via a Kinect. The constructed dataset is from a real environment, i.e., several psychiatric hospitals, and has a specific scale. Then we propose a novel approach for potential depression risk recognition based on two kinds of different deep belief network (DBN) models. One model extracts 2D appearance features from facial images collected by an optical camera, while the other model extracts 3D dynamic features from 3D facial points collected by a Kinect. The final decision result comes from the combination of the two models. Finally, we evaluate all proposed deep models on our built dataset. The experimental results demonstrate that (1) our proposed method is able to identify patients with potential depression risk; (2) the recognition performance of combined 2D and 3D features model outperforms using either 2D or 3D features model only; (3) the performance of depression recognition is higher in the positive and negative emotional stimulus, and females' recognition rate is generally higher than that for males. Meanwhile, we compare the performance with other methods on the same dataset. The experimental results show that our integrated 2D and 3D features DBN is more reasonable and universal than other methods, and the experimental paradigm designed for depression is reasonable and practical.
... Articulatory Coordination Features (ACFs) have yielded successful results in distinguishing depressed speech from nondepressed speech by quantifying the changes in timing of speech gestures [5,6,7,8]. These changes in articulatory coordination happens as a result of neurological condition called psychomotor slowing, a necessary feature of MDD that is used to evaluate the severity of MDD [9,10,11]. ...
Preprint
Speech based depression classification has gained immense popularity over the recent years. However, most of the classification studies have focused on binary classification to distinguish depressed subjects from non-depressed subjects. In this paper, we formulate the depression classification task as a severity level classification problem to provide more granularity to the classification outcomes. We use articulatory coordination features (ACFs) developed to capture the changes of neuromotor coordination that happens as a result of psychomotor slowing, a necessary feature of Major Depressive Disorder. The ACFs derived from the vocal tract variables (TVs) are used to train a dilated Convolutional Neural Network based depression classification model to obtain segment-level predictions. Then, we propose a Recurrent Neural Network based approach to obtain session-level predictions from segment-level predictions. We show that strengths of the segment-wise classifier are amplified when a session-wise classifier is trained on embeddings obtained from it. The model trained on ACFs derived from TVs show relative improvement of 27.47% in Unweighted Average Recall (UAR) at the session-level classification task, compared to the ACFs derived from Mel Frequency Cepstral Coefficients (MFCCs).
... These articulatory coordination features can be used to characterize the level of articulatory coordination and timing. To measure the coordination, assessments of the multi-scale structure of correlations among the time series signals were used [12,13,14]. This was extensively done using acoustic features consisting of the first three resonances of the vocal tract (formants). ...
... The efficacy of speech-based diagnostic models has been explored in a range of conditions, including Parkinson's disease (Benba et al., 2015;Orozco-Arroyave et al., 2016;Williamson et al., 2015), cognitive impairment (Garrard et al., 2014;Orimaye et al., 2017;Roark et al., 2011;Yu et al., 2014), and depression (Cummins et al., 2011;Sturim et al., 2011;Williamson et al., 2014). Generally speaking, these models collect some type of sensor data (acoustic, kinematic, etc.), apply signal processing to extract salient information from the raw signals, then use machine learning to map from the extracted features to clinically relevant information (e.g., whether or not a person has a given disease). ...
Full-text available
Article
Purpose Kinematic measurements of speech have demonstrated some success in automatic detection of early symptoms of amyotrophic lateral sclerosis (ALS). In this study, we examined how the region of symptom onset (bulbar vs. spinal) affects the ability of data-driven models to detect ALS. Method We used a correlation structure of articulatory movements combined with a machine learning model (i.e., artificial neural network) to detect differences between people with ALS and healthy controls. The performance of this system was evaluated separately for participants with bulbar onset and spinal onset to examine how region of onset affects classification performance. We then performed a regression analysis to examine how different severity measures and region of onset affects model performance. Results The proposed model was significantly more accurate in classifying the bulbar-onset participants, achieving an area under the curve of 0.809 relative to the 0.674 achieved for spinal-onset participants. The regression analysis, however, found that differences in classifier performance across participants were better explained by their speech performance (intelligible speaking rate), and no significant differences were observed based on region of onset when intelligible speaking rate was accounted for. Conclusions Although we found a significant difference in the model's ability to detect ALS depending on the region of onset, this disparity can be primarily explained by observable differences in speech motor symptoms. Thus, when the severity of speech symptoms (e.g., intelligible speaking rate) was accounted for, symptom onset location did not affect the proposed computational model's ability to detect ALS.
... Abnormal SCG distinguish the MDD and control subjects. [110] x x x Vocal and Facial bio markers It is less reliable in comparison to EEG biomarker. [90,89] x x x 3p25-26 This genes is found in more than 800 depressed families but still immature. ...
Full-text available
Article
Mental disorders represent critical public health challenges as they are leading contributors to the global burden of disease and intensely influence social and financial welfare of individuals. The present comprehensive review concentrate on the two mental disorders: Major depressive Disorder (MDD) and Bipolar Disorder (BD) with noteworthy publications during the last ten years. There is a big need nowadays for phenotypic characterization of psychiatric disorders with biomarkers. Electroencephalography (EEG) signals could offer a rich signature for MDD and BD and then they could improve understanding of pathophysiological mechanisms underling these mental disorders. In this review, we focus on the literature works adopting neural networks fed by EEG signals. Among those studies using EEG and neural networks, we have discussed a variety of EEG based protocols, biomarkers and public datasets for depression and bipolar disorder detection. We conclude with a discussion and valuable recommendations that will help to improve the reliability of developed models and for more accurate and more deterministic computational intelligence based systems in psychiatry. This review will prove to be a structured and valuable initial point for the researchers working on depression and bipolar disorders recognition by using EEG signals.
... 4. Discussion Figure 2 shows that the low rank eigenvalues are larger for MDD subjects relative to the schizophrenic patients and the healthy controls, and this trend is reversed towards the high rank eigenvalues. This pattern is a key observation associated with depression severity (Williamson, Quatieri, et al. 2014,Williamson, Young, et al. 2019,Espy-Wilson et al. 2019. The magnitude of high rank eigenvalues indicates the dimensionality of the time-delay embedded feature space. ...
Full-text available
Preprint
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g.hallucinations and delusions), using a time delay embedded correlation analysis. We show that the schizophrenia subjects with strong positive symptoms and who are markedly ill pose complex coordination patterns in facial and speech gestures than what is observed in healthy subjects. This observation is in contrast to what previous studies have shown in Major Depressive Disorder (MDD), where subjects with MDD show a simpler coordination pattern with respect to healthy controls or subjects in remission. This difference is not surprising given MDD is necessarily accompanied by Psychomotor slowing (i.e.,negative symptoms) which affects speech, ideation and motility. With respect to speech, psychomotor slowing results in slowed speech with more and longer pauses than what occurs in speech from the same speaker when they are in remission and from a healthy subject. Time delay embedded correlation analysis has been used to quantify the differences in coordination patterns of speech articulation. The current study is based on 17 Facial Action Units (FAUs) extracted from video data and 6 Vocal Tract Variables (TVs) obtained from simultaneously recorded audio data. The TVs are extracted using a speech inversion system based on articulatory phonology that maps the acoustic signal to vocal tract variables. The high-level time delay embedded correlation features computed from TVs and FAUs are used to train a stacking ensemble classifier fusing audio and video modalities. The results show that there is a promising distinction between healthy and schizophrenia subjects (with strong positive symptoms) in terms of neuromotor coordination in speech.
... 4. Discussion Figure 2 shows that the low rank eigenvalues are larger for MDD subjects relative to the schizophrenic patients and the healthy controls, and this trend is reversed towards the high rank eigenvalues. This pattern is a key observation associated with depression severity (Williamson, Quatieri, et al. 2014,Williamson, Young, et al. 2019,Espy-Wilson et al. 2019. The magnitude of high rank eigenvalues indicates the dimensionality of the time-delay embedded feature space. ...
Full-text available
Conference Paper
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using a time delay embedded correlation analysis. We show that the schizophrenia subjects with strong positive symptoms and who are markedly ill pose complex coordination patterns in facial and speech gestures than what is observed in healthy subjects. This observation is in contrast to what previous studies have shown in Major Depressive Disorder (MDD), where subjects with MDD show a simpler coordination pattern with respect to healthy controls or subjects in remission. This difference is not surprising given MDD is necessarily accompanied by Psychomotor slowing (i.e., negative symptoms) which affects speech, ideation and motility. With respect to speech, psychomotor slowing results in slowed speech with more and longer pauses than what occurs in speech from the same speaker when they are in remission and from a healthy subject. Time delay embedded correlation analysis has been used to quantify the differences in coordination patterns of speech articulation. The current study is based on 17 Facial Action Units (FAUs) extracted from video data and 6 Vocal Tract Variables (TVs) obtained from simultaneously recorded audio data. The TVs are extracted using a speech inversion system based on articulatory phonology that maps the acoustic signal to vocal tract variables. The high-level time delay embedded correlation features computed from TVs and FAUs are used to train a stacking ensemble classifier fusing audio and video modalities. The results show that there is a promising distinction between healthy and schizophrenia subjects (with strong positive symptoms) in terms of neuromotor coordination in speech.
... In a previous research, we have already examined the recognition of these three disease classes (depression, Parkinson's disease, and dysphonia) using auto and cross-correlation structures from a limited set of acoustic-phonetic features [10]. The correlation structure was created following the work of Williamson et al., who have already successfully applied this solution in several researches [11][12]. The eigenvalues of the structures were used as input in the classification process created in RapidMiner Studio. ...
Full-text available
Article
There is already a lot of research in the literature on the binary separation of healthy people and people with some illnesses that affects speech. However, there are only a few examinations where more illnesses are recognized together. The examination of the latter is justified by the fact that a person may suffer from several illnesses at the same time to a certain extent. In the present study, multiclass classification of depression, Parkinson’s disease, and general voice disorders (organic and functional dysphonia) was performed using speech samples. Foremost, several acoustic features were examined as input (such as Mel-Frequency Cepstral Coefficients (MFCCs), mel-band energy values, formants and their bandwidths). Using the inputs, auto- and cross-correlation structures were formed as image representations and fed to a convolutional neural network (CNN). Parameter optimization of the correlation structures and the CNN model was applied to achieve the highest accuracy. Moreover, the result of the tuned process was compared to the result of a baseline process. Finally, multiclass (5 and 4 classes) classification was performed with the best parameters. The prominent feature set was the MFCCs (55.9% accuracy, 52.2% macro F-score) for 5 class classification. 64.3% accuracy and 60.0% macro F1-score was obtained for 5 classes after parameter optimization. For classifying 4 classes (merging dysphonic ones together), 74.9% accuracy and 71.7% macro F1-score was achieved.
... Several recent studies found that successful results can be achieved by quantifying the changes in articulatory coordination to distinguish depressed speech from non-depressed speech [7,8,9,10]. These differences in the timing of speech gestures are caused by a neurological phenomenon called psychomotor slowing, which is identified as a major characteristic of depression [11]. ...
... In recent years, the problem of automatically detecting and monitoring depression using behavioural markers, such as speech, extracted using affective computing and social signal processing techniques [4]- [6], has gained interest. These systems are based on the analysis of depression-induced changes in muscle tension and control, for example laryngeal coordination [4], vocal tract behaviour [5] and facial activity and head movements [6]. ...
... Williamson et al. [64] derived the facial coordination features from the facial action unit signal. The dimensionality of obtained features was reduced using Principal Component Analysis (PCA) to enhance the prediction of the BDI score for depression. ...
Full-text available
Article
Presently, while automated depression diagnosis has made great progress, most of the recent works have focused on combining multiple modalities rather than strengthening a single one. In this research work, we present a unimodal framework for depression detection based on facial expressions and facial motion analysis. We investigate a wide set of visual features extracted from different facial regions. Due to high dimensionality of the obtained feature sets, identification of informative and discriminative features is a challenge. This paper suggests a hybrid dimensionality reduction approach which leverages the advantages of the filter and wrapper methods. First, we use a univariate filter method, Fisher Discriminant Ratio, to initially reduce the size of each feature set. Subsequently, we propose an Incremental Linear Discriminant Analysis (ILDA) approach to find an optimal combination of complementary and relevant feature sets. We compare the performance of the proposed ILDA with the batch-mode LDA and also the Composite Kernel based Support Vector Machine (CKSVM) method. The experiments conducted on the Distress Analysis Interview Corpus Wizard-of-Oz (DAIC-WOZ) dataset demonstrate that the best depression classification performance is obtained by using different feature extraction methods in combination rather than individually. ILDA generates better depression classification results in comparison to the CKSVM. Moreover, ILDA based wrapper feature selection incurs lower computational cost in comparison to the CKSVM and the batch-mode LDA methods. The proposed framework significantly improves the depression classification performance, with an F1 Score of 0.805, which is better than all the video based depression detection models suggested in literature, for the DAIC-WOZ dataset. Salient facial regions and well performing visual feature extraction methods are also identified.
... More importantly, most of the predicted values approached the true labels. (Gupta et al., 2014;Jain, Crowley, Dey, & Lux, 2014;Jan, Meng, Gaus, Zhang, & Turabzadeh, 2014;Mitra et al., 2014;Pérez Espinosa et al., 2014;Sidorov & Minker, 2014;Williamson, Quatieri, Helfer, Ciccarelli, & Mehta, 2014). A denotes audio modality, and V represents the video modality. ...
Full-text available
Article
Depression has been considered the most dominant mental disorder over the past few years. To help clinicians effectively and efficiently estimate the severity scale of depression, various automated systems based on deep learning have been proposed. To estimate the severity of depression, i.e., the depression severity score (Beck Depression Inventory–II), various deep architectures have been designed to perform regression using the Euclidean loss. However, they do not consider the label distribution, and they do not learn the relationships between the facial images and BDI–II scores, which can be resulting in the noisy labeling for automatic depression estimation (ADE). To mitigate this problem, we propose an automated deep architecture, namely the self-adaptation network (SAN), to improve this uncertain labeling for ADE. Specifically, the architecture consists of four modules: (1) ResNet-18 and ResNet-50 are adopted in the deep feature extraction module (DFEM) to extract informative deep features; (2) a self-attention module (SAM) is adopted to learn the weights from the mini-batch; (3) a square ranking regularization module (SRRM) to create high partitions and low partitions is proposed; and (4) a re-label module (RM) is used to re-label the uncertain annotations for ADE in the low partitions. We conduct extensive experiments on depression databases (i.e., AVEC2013 and AVEC2014) and obtain a performance comparable to the performances of other ADE methods in assessing the severity of depression. More importantly, the proposed method can learn valuable depression patterns from facial videos and obtain a performance comparable to the performances of other methods for depression recognition.
... Williamson et al. [59] utilized feature sets derived from facial movements and acoustic verbal cues to detect psychomotor retardation. They employed Principal component analysis for dimensionality reduction and then applied the Gaussian mixture model to classify the combination of principal feature vectors. ...
Full-text available
Article
Depression has become a global concern, and COVID-19 also has caused a big surge in its incidence. Broadly, there are two primary methods of detecting depression: Task-based and Mobile Crowd Sensing (MCS) based methods. These two approaches, when integrated, can complement each other. This paper proposes a novel approach for depression detection that combines real-time MCS and task-based mechanisms. We aim to design an end-to-end machine learning pipeline, which involves multimodal data collection, feature extraction, feature selection, fusion, and classification to distinguish between depressed and non-depressed subjects. For this purpose, we created a real-world dataset of depressed and non-depressed subjects. We experimented with: various features from multi-modalities, feature selection techniques, fused features, and machine learning classifiers such as Logistic Regression, Support Vector Machines (SVM), etc. for classification. Our findings suggest that combining features from multiple modalities perform better than any single data modality, and the best classification accuracy is achieved when features from all three data modalities are fused. Feature selection method based on Pearson's correlation coefficients improved the accuracy in comparison with other methods. Also, SVM yielded the best accuracy of 86%. Our proposed approach was also applied on benchmarking dataset, and results demonstrated that the multimodal approach is advantageous in performance with state-of-the-art depression recognition techniques.
... Speech analysis is widely also used to study and precisely measure mental illness, including depression, suicide, and anxiety (Alghowinem et al., 2013;Alonso et al., 2015;Bedi et al., 2015;Cummins et al., 2013Cummins et al., , 2015France et al., 2000;Helfer et al., 2013;Hönig et al., 2014;Laske et al., 2015;Lopez-de-Ipina & Barroso, 2017;Lopez-de-Ipina et al., 2013;Moore et al., 2004;Mundt et al., 2007;Ozdas et al., 2004;Quatieri & Malyska, 2012;Williamson et al., 2014). ...
Full-text available
Article
Secondary prisonization refers to the impact of the incarceration of a relative on the members of their family. This study aimed to analyze the psychological effects of secondary prisonization on older parents. Specifically, levels of depression, anxiety, stress, and well-being (emotional, psychological, and social) were analyzed by means of quantitative and automatic speech analysis methods in a sample of over 65-year-old parents of Basque prisoners incarcerated in remote prisons. The statistical analysis of data and the automatic spontaneous speech analysis showed that secondary prisonization has a negative impact on older parents’ levels of depression, anxiety, stress, and well-being. These results lead us to conclude that remote imprisonment of adult children has negative psychological effects on older parents.
... Image processing methods, Text Analysis & Understanding are the cornerstones of computer vision. Speech recognition, pattern recognition and autonomous cars are all aspects of speech recognition (Nasir et al, 2016;Williamson et al, 2013). This is just one example of a potentially exciting advance in predicting mental disorders by studying the relationships between brain structures by looking at neuroimaging data on brain function. ...
Full-text available
Article
Depression is a serious mental health condition that may lead to poor mental and emotional functioning at work, at school and in the family causing the mental imbalance. In worst scenarios, depression may lead to severe anxiety or suicide. Hence, it is necessary to diagnose depression at early stages. This paper elaborates the development of a novel approach for a convolutional neural network model that can examine facial images from the recorded interview sessions to discover facial patterns that could indicate depression level. The user‐generated data helps to distinguish between different depressive groups with depression symptoms that can manifest people with various mental illnesses in different ways. In particular, we want to automatically predict the depression scale and differentiate depression from other mental disorders using the patient's psychiatric illness history and dynamic textual descriptions extracted from the user inputs. We apply the k‐nearest neighbour algorithm on the dynamic textual descriptors to make a linguistic analysis for classifying mental illness into different classes. We apply dimensionality reduction and regression using the Random Forest algorithm to predict the depression scale. The proposed framework is an extension to pre‐existing frameworks, replacing the handcrafted feature extraction technique with the deep feature extraction. The model performs 2.7% better than existing frameworks in facial detection and feature extraction.
... Articulatory Coordination Features (ACFs) have yielded successful results in distinguishing depressed speech from nondepressed speech by quantifying the changes in timing of speech gestures [5,6,7,8]. These changes in articulatory coordination happens as a result of neurological condition called psychomotor slowing, a necessary feature of MDD that is used to evaluate the severity of MDD [9,10,11]. ...
Chapter
Through the advancement of a new generation of information and communication technologies, such as 5G, IoMTInternet of Medical Things (IoMT), machine learning, etc., the scientific community has already extensively explored the possibilities of utilizing such technologies in varied healthcare processes. From the perspective of process management, it could be argued that every single process of the cycle of health is evolving with such trends, from health monitoring and online health consultation to in-hospital diagnosis and surgery, eventually follow-up examinations and rehabilitations. For example, the process of health monitoring and assessment could be enhanced with wearable or non-contact devices to achieve 24/7 monitoring.
Full-text available
Article
Mental disorders are closely related to deficits in cognitive control. Such cognitive impairments may result in aberrations in mood, thinking, work, body functions, emotions, social engagements and general behaviour. Mental disorders may affect the phenotypic behaviour like eye movements, facial expressions and speech. Furthermore, a close association has been observed within mental disorders and physiological responses emanating from the brain, muscles, heart, eyes, skin, etc. Mental disorders disrupt higher cognitive function, social cognition, control of complex behaviours and regulation of emotion. Cognitive computation may help understand such disruptions for improved decision-making with the help of computers. This study presents a systematic literature review to promulgate state of art computational methods and technologies facilitating automated detection of mental disorders. For this survey, the relevant literature between 2010 and 2021 has been studied. Recommendations of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) model were adopted for identification, screening, validating and inclusion of research literature. The self-diagnosis tools for detection of mental disorders like questionnaires and rating scales are inconsistent and static in nature. They cannot encompass the diversity of mental disorders, inter-individual variability and impact of emotional state of an individual. Furthermore, there are no standard baselines for mental disorders. This situation mandates a multi-faceted approach which may utilise data from physiological signals, behavioural patterns and even data obtained from various online portals like social media to efficiently and effectively detect the prevalence, type and severity of mental disorders.
Article
Effective and efficient automatic depression diagnosis is a challenging subject in the field of affective computing. Since speech signals provide useful information for diagnosing depression, in this paper, we propose to extract deep speaker recognition (SR) and speech emotion recognition (SER) features using pretrained models, and combine the two deep speech features to take advantage of the complementary information between the vocal and emotional differences of speakers. In addition, due to the small amount of data for depression recognition and the cost sensitivity of the diagnosis results, we propose a hierarchical depression detection model, in which multiple classifiers are set up prior to a regressor to guide the prediction of depression severity. We test our method on the AVEC 2013 and AVEC 2014 benchmark databases. The results demonstrate that the fusion of deep SR and SER features can improve the prediction performance of the model. The proposed method, using only audio features, can avoid the overfitting problem and achieves better performance than the previous audio-based methods on both databases. It also provides results comparable to those of video-based and multimodal-based methods for depression detection.
Article
Background In contrast to all other areas of medicine, psychiatry is still nearly entirely reliant on subjective assessments such as patient self-report and clinical observation. The lack of objective information on which to base clinical decisions can contribute to reduced quality of care. Behavioral health clinicians need objective and reliable patient data to support effective targeted interventions. Objective We aimed to investigate whether reliable inferences—psychiatric signs, symptoms, and diagnoses—can be extracted from audiovisual patterns in recorded evaluation interviews of participants with schizophrenia spectrum disorders and bipolar disorder. Methods We obtained audiovisual data from 89 participants (mean age 25.3 years; male: 48/89, 53.9%; female: 41/89, 46.1%): individuals with schizophrenia spectrum disorders (n=41), individuals with bipolar disorder (n=21), and healthy volunteers (n=27). We developed machine learning models based on acoustic and facial movement features extracted from participant interviews to predict diagnoses and detect clinician-coded neuropsychiatric symptoms, and we assessed model performance using area under the receiver operating characteristic curve (AUROC) in 5-fold cross-validation. Results The model successfully differentiated between schizophrenia spectrum disorders and bipolar disorder (AUROC 0.73) when aggregating face and voice features. Facial action units including cheek-raising muscle (AUROC 0.64) and chin-raising muscle (AUROC 0.74) provided the strongest signal for men. Vocal features, such as energy in the frequency band 1 to 4 kHz (AUROC 0.80) and spectral harmonicity (AUROC 0.78), provided the strongest signal for women. Lip corner–pulling muscle signal discriminated between diagnoses for both men (AUROC 0.61) and women (AUROC 0.62). Several psychiatric signs and symptoms were successfully inferred: blunted affect (AUROC 0.81), avolition (AUROC 0.72), lack of vocal inflection (AUROC 0.71), asociality (AUROC 0.63), and worthlessness (AUROC 0.61). Conclusions This study represents advancement in efforts to capitalize on digital data to improve diagnostic assessment and supports the development of a new generation of innovative clinical tools by employing acoustic and facial data analysis.
Full-text available
Article
Background Major Depressive Disorder (MDD) is prevalent, often chronic, and requires ongoing monitoring of symptoms to track response to treatment and identify early indicators of relapse. Remote Measurement Technologies (RMT) provide an opportunity to transform the measurement and management of MDD, via data collected from inbuilt smartphone sensors and wearable devices alongside app-based questionnaires and tasks. A key question for the field is the extent to which participants can adhere to research protocols and the completeness of data collected. We aimed to describe drop out and data completeness in a naturalistic multimodal longitudinal RMT study, in people with a history of recurrent MDD. We further aimed to determine whether those experiencing a depressive relapse at baseline contributed less complete data. Methods Remote Assessment of Disease and Relapse – Major Depressive Disorder (RADAR-MDD) is a multi-centre, prospective observational cohort study conducted as part of the Remote Assessment of Disease and Relapse – Central Nervous System (RADAR-CNS) program. People with a history of MDD were provided with a wrist-worn wearable device, and smartphone apps designed to: a) collect data from smartphone sensors; and b) deliver questionnaires, speech tasks, and cognitive assessments. Participants were followed-up for a minimum of 11 months and maximum of 24 months. Results Individuals with a history of MDD ( n = 623) were enrolled in the study,. We report 80% completion rates for primary outcome assessments across all follow-up timepoints. 79.8% of people participated for the maximum amount of time available and 20.2% withdrew prematurely. We found no evidence of an association between the severity of depression symptoms at baseline and the availability of data. In total, 110 participants had > 50% data available across all data types. Conclusions RADAR-MDD is the largest multimodal RMT study in the field of mental health. Here, we have shown that collecting RMT data from a clinical population is feasible. We found comparable levels of data availability in active and passive forms of data collection, demonstrating that both are feasible in this patient group.
Full-text available
Article
Major depressive disorder (MDD) is one of the most common modern ailments affected huge population throughout the world. The electroencephalogram (EEG) signal is widely used to screen the MDD. The manual diagnosis of MDD using EEG is time consuming, subjective and may cause human errors. Therefore, nowadays various automated systems have been developed to diagnose MDD accurately and rapidly. In this work, we have proposed a novel automated MDD detection system using EEG signals. Our proposed model has three steps: (i) Melamine pattern and discrete wavelet transform (DWT)- based multileveled feature generation, (ii) selection of most relevant features using neighborhood component analysis (NCA) and (iii) classification using support vector machine (SVM) and k nearest neighbor (kNN) classifiers. The novelty of this work is the application of melamine pattern. The molecular structure of melamine (also named chemistry spider- ChemSpider) is used to generate 1536 features. Also, various statistical features are extracted from DWT coefficients. The NCA is used to select the most relevant features and these selected features are classified using SVM and kNN classifiers. The presented model attained greater than 95% accuracies using all channels with quadratic SVM classifier. Our results obtained highest classification accuracy of 99.11% and 99.05% using Weighted kNN and Quadratic SVM respectively using A2A1 EEG channel. We have developed the automated depression model using a big dataset and yielded high classification accuracies. These results indicate that our presented model can be used in mental health clinics to confirm the manual diagnosis of psychiatrists.
Article
In this paper, we have focused to improve the performance of a speech-based uni-modal depression detection system, which is non-invasive, involves low cost and computation time in comparison to multi-modal systems. The performance of a decision system mainly depends on the choice of feature selection method and the classifier. We have investigated the combination of four well-known multivariate filter methods (minimum Redundancy Maximum Relevance, Scatter Ratio, Mahalanobis Distance, Fast Correlation Based feature selection) and four well-known classifiers (k-Nearest Neighbour, Linear Discriminant classifier, Decision Tree, Support Vector Machine) to obtain a minimal set of relevant and non-redundant features to improve the performance. This will speed up the acquisition of features from speech and build the decision system with low cost and complexity. Experimental results on the high and low-level features of recent work on the DAICWOZ dataset demonstrate the superior performance of the combination of Scatter Ratio and LDC as well as that of Mahalanobis Distance and LDC, in comparison to other combinations and existing speech-based depression results, for both gender independent and gender-based studies. Further, these combinations have also outperformed a few multimodal systems. It was noted that low-level features are more discriminatory and provide a better f1 score.
Article
With the acceleration of the pace of work and life, people are facing more and more pressure, which increases the probability of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. A promising development is that physiological and and psychological studies have found some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, Deep Learning (DL) has been used to extract a representation of depression cues from audio and video for automatic depression detection. To classify and summarize such research, we introduce the databases and describe objective markers for automatic depression estimation. We also review the DL methods for automatic detection of depression to extract a representation of depression from audio and video. Lastly, we discuss challenges and promising directions related to the automatic diagnoses of depression using DL.
Full-text available
Preprint
Background Major Depressive Disorder (MDD) is prevalent, often chronic, and requires ongoing monitoring of symptoms to track response to treatment and identify early indicators of relapse. Remote Measurement Technologies (RMT) provide an exciting opportunity to transform the measurement and management of MDD, via data collected from inbuilt smartphone sensors and wearable devices alongside app-based questionnaires and tasks. Our aim is to describe the amount of data collected during a multimodal longitudinal RMT study, in an MDD population. Methods The Remote Assessment of Disease and Relapse – Central Nervous System (RADAR-CNS) program explores the potential to use RMT across a range of central nervous system disorders. Remote Assessment of Disease and Relapse – Major Depressive Disorder (RADAR-MDD) is a multi-centre, prospective observational cohort study conducted as part of the RADAR-CNS program. People with a history of MDD were provided with a wrist-worn wearable, and several apps designed to: a) collect data from smartphone sensors; and b) deliver questionnaires, speech tasks and cognitive assessments. Participants were followed-up for a maximum of 2 years. Results A total of 623 individuals with a history of MDD were enrolled in the study. We report 80% completion rates for primary outcome assessments across all follow-up timepoints. 79.8% of people participated for the maximum amount of time available and 20.2% withdrew prematurely. Data availability across all RMT data types varied depending on the source of data and the participant-burden for each data type. We found no evidence of an association between the severity of depression symptoms at baseline and the availability of data. In total, 110 participants had > 50% data available across all data types, and thus able to contribute to multiparametric analyses. Conclusions RADAR-MDD is the largest multimodal RMT study in the field of mental health. Here, we have shown that collecting RMT data from a clinical population is feasible. We found comparable levels of data availability in active and passive forms of data collection, demonstrating that both are feasible in this patient group. Our next steps are to illustrate the predictive value of these data, which will be the focus of our future data analysis aims.
Article
As a common mental disorder, depression has attracted many researchers from affective computing field to estimate the depression severity. However, existing approaches based on Deep Learning (DL) are mainly focused on single facial image without considering the sequence information for predicting the depression scale. In this paper, an integrated framework, termed DepNet, for automatic diagnosis of depression that adopts facial images sequence from videos is proposed. Specifically, several pretrained models are adopted to represent the low‐level features, and Feature Aggregation Module is proposed to capture the high‐level characteristic information for depression analysis. More importantly, the discriminative characteristic of depression on faces can be mined to assist the clinicians to diagnose the severity of the depressed subjects. Multiscale experiments carried out on AVEC2013 and AVEC2014 databases have shown the excellent performance of the intelligent approach. The root mean‐square error between the predicted values and the Beck Depression Inventory‐II scores is 9.17 and 9.01 on the two databases, respectively, which are lower than those of the state‐of‐the‐art video‐based depression recognition methods.
Article
There is an urgent need to detect depression using a non-intrusive approach that is reliable and accurate. In this paper, a simple and efficient unimodal depression detection approach based on speech is proposed, which is non-invasive, cost-effective and computationally inexpensive. A set of spectral, temporal and spectro-temporal features is derived from the speech signal of healthy and depressed subjects. To select a minimal subset of the relevant and non-redundant speech features to detect depression, a two-phase approach based on the nature-inspired wrapper-based feature selection Quantum-based Whale Optimization Algorithm (QWOA) is proposed. Experiments are performed on the publicly available Distress Analysis Interview Corpus Wizard-of-Oz (DAICWOZ) dataset and compared with three established univariate filtering techniques for feature selection and four well-known evolutionary algorithms. The proposed model outperforms all the univariate filter feature selection techniques and the evolutionary algorithms. It has low computational complexity in comparison to traditional wrapper-based evolutionary methods. The performance of the proposed approach is superior in comparison to existing unimodal and multimodal automated depression detection models. The combination of spectral, temporal and spectro-temporal speech features gave the best result with the LDA classifier. The performance achieved with the proposed approach, in terms of F1-score for the depressed class and the non-depressed class and error is 0.846, 0.932 and 0.094 respectively. Statistical tests demonstrate that the acoustic features selected using the proposed approach are non-redundant and discriminatory. Statistical tests also establish that the performance of the proposed approach is significantly better than that of the traditional wrapper-based evolutionary methods.
Full-text available
Article
Depression is a widespread mental health problem around the world with a significant burden on economies. Its early diagnosis and treatment are critical to reduce the costs and even save lives. One key aspect to achieve that goal is to use technology and monitor depression remotely and relatively inexpensively using automated agents. There has been numerous efforts to automatically assess depression levels using audiovisual features as well as text-analysis of conversational speech transcriptions. However, difficulty in data collection and the limited amounts of data available for research present challenges that are hampering the success of the algorithms. One of the two novel contributions in this paper is to exploit databases from multiple languages for acoustic feature selection. Since a large number of features can be extracted from speech, given the small amounts of training data available, effective data selection is critical for success. Our proposed multi-lingual method was effective at selecting better features than the baseline algorithms, which significantly improved the depression assessment accuracy. The second contribution of the paper is to extract text-based features for depression assessment and use a novel algorithm to fuse the text- and speech-based classifiers which further boosted the performance.
Full-text available
Article
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evaluation of a patient's level of depression the observation of facial expressions and vocal cues. It is in this context that we present the fourth Audio-Visual Emotion recognition Challenge (AVEC 2014). This edition of the challenge uses a subset of the tasks used in a previous challenge, allowing for more focussed studies. In addition, labels for a third dimension (Dominance) have been added and the number of annotators per clip has been increased to a minimum of three, with most clips annotated by 5. The challenge has two goals logically organised as sub-challenges: the first is to predict the continuous values of the affective dimensions valence, arousal and dominance at each moment in time. The second is to predict the value of a single self-reported severity of depression indicator for each recording in the dataset. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Full-text available
Article
Neurophysiological changes in the brain associated with early dementia can disrupt articulatory timing and precision in speech production. Motivated by this observation, we address the hypothesis that speaking rate and articulatory coordination, as manifested through formant frequency tracks, can predict performance on an animal fluency task administerd to the elderly. Specifically, using phoneme-based measures of speaking rate and articulatory coordination derived from formant cross-correlation measures, we investigate the capability of speech features, estimated from paragraph-recall and naturalistic free speech, to predict animal fluency assessment scores. Using a database consisting of audio from elderly subjects over a 4 year period, we develop least-squares regression models of our cognitive performance measures. The best performing model combined speaking rate and formant features, resulting in a correlation (R) of 0.61 and a root mean squared error (RMSE) of 5.07 with respect to a 9- 34 score range. Vocal features thus provide a reduction by about 30% in MSE from a baseline (mean score) in predicting cognitive performance derived from the animal fluency assessment.
Full-text available
Article
The purpose of this study was to evaluate the effectiveness of several acoustic measures in predicting breathiness ratings. Recordings were made of eight normal men and seven normal women producing normally phonated, moderately breathy, and very breathy sustained vowels. Twenty listeners rated the degree of breathiness using a direct magnitude estimation procedure. Acoustic measures were made of: (a) signal periodicity, (b) first harmonic amplitude, and (c) spectral tilt. Periodicity measures provided the most accurate predictions of perceived breathi-ness, accounting for approximately 80% of the variance in breathiness ratings. The relative amplitude of the first harmonic correlated moderately with breathiness ratings, and two measures of spectral tilt correlated weakly with perceived breathiness.
Full-text available
Article
In Major Depressive Disorder (MDD), neurophysiologic changes can alter motor control [1, 2] and therefore alter speech production by influencing the characteristics of the vocal source, tract, and prosodics. Clinically, many of these characteristics are associated with psychomotor retardation, where a patient shows sluggishness and motor disorder in vocal articulation, affecting coordination across multiple aspects of production [3, 4]. In this paper, we exploit such effects by selecting features that reflect changes in coordination of vocal tract motion associated with MDD. Specifically, we investigate changes in correlation that occur at different time scales across formant frequencies and also across channels of the delta-mel-cepstrum. Both feature domains provide measures of coordination in vocal tract articulation while reducing effects of a slowly-varying linear channel, which can be introduced by time-varying microphone placements. With these two complementary feature sets, using the AVEC 2013 depression dataset, we design a novel Gaussian mixture model (GMM)-based multivariate regression scheme, referred to as Gaussian Staircase Regression, that provides a root-mean-squared-error (RMSE) of 7.42 and a mean-absolute-error (MAE) of 5.75 on the standard Beck depression rating scale. We are currently exploring coordination measures of other aspects of speech production, derived from both audio and video signals.
Full-text available
Article
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Full-text available
Article
A seizure prediction algorithm is proposed that combines novel multivariate EEG features with patient-specific machine learning. The algorithm computes the eigenspectra of space-delay correlation and covariance matrices from 15-s blocks of EEG data at multiple delay scales. The principal components of these features are used to classify the patient's preictal or interictal state. This is done using a support vector machine (SVM), whose outputs are averaged using a running 15-minute window to obtain a final prediction score. The algorithm was tested on 19 of 21 patients in the Freiburg EEG data set who had three or more seizures, predicting 71 of 83 seizures, with 15 false predictions and 13.8h in seizure warning during 448.3h of interictal data. The proposed algorithm scales with the number of available EEG signals by discovering the variations in correlation structure among any given set of signals that correlate with seizure risk.
Full-text available
Conference Paper
A patient-specific seizure prediction algorithm is proposed that extracts novel multivariate signal coherence features from ECoG recordings and classifies a patient's pre-seizure state. The algorithm uses space-delay correlation and covariance matrices at several delay scales to extract the spatiotemporal correlation structure from multichannel ECoG signals. Eigen spectra and amplitude features are extracted from the correlation and covariance matrices, followed by dimensionality reduction using principal components analysis, classification using a support vector machine, and temporal integration to produce a seizure prediction score. Evaluation on the Freiburg EEG database produced a sensitivity of 90.8% and false positive rate of .094.
Full-text available
Conference Paper
In 2000, the Cohn-Kanade (CK) database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK database has become one of the most widely used test-beds for algorithm development and evaluation. During this period, three limitations have become apparent: 1) While AU codes are well validated, emotion labels are not, as they refer to what was requested rather than what was actually performed, 2) The lack of a common performance metric against which to evaluate new algorithms, and 3) Standard protocols for common databases have not emerged. As a consequence, the CK database has been used for both AU and emotion detection (even though labels for the latter have not been validated), comparison with benchmark algorithms is missing, and use of random subsets of the original database makes meta-analyses difficult. To address these and other concerns, we present the Extended Cohn-Kanade (CK+) database. The number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded and emotion labels have been revised and validated. In addition to this, non-posed sequences for several types of smiles and their associated metadata have been added. We present baseline results using Active Appearance Models (AAMs) and a linear support vector machine (SVM) classifier using a leave-one-out subject cross-validation for both AU and emotion detection for the posed data. The emotion and AU labels, along with the extended image data and tracked landmarks will be made available July 2010.
Full-text available
Conference Paper
In this paper, we report the influence that classification accuracies have in speech analysis from a clinical dataset by adding acoustic low-level descriptors (LLD) belonging to prosodic (i.e. pitch, formants, energy, jitter, shimmer) and spectral features (i.e. spectral flux, centroid, entropy and roll-off) along with their delta (Δ) and delta-delta (Δ-Δ) coefficients to two baseline features of Mel frequency cepstral coefficients and Teager energy critical-band based autocorrelation envelope. Extracted acoustic low-level descriptors (LLD) that display an increase in accuracy after being added to these baseline features were finally modeled together using Gaussian mixture models and tested. A clinical data set of speech from 139 adolescents, including 68 (49 girls and 19 boys) diagnosed as clinically depressed, was used in the classification experiments. For male subjects, the combination of (TEO-CB-Auto-Env + Δ + Δ-Δ) + F0 + (LogE + Δ + Δ-Δ) + (Shimmer + Δ) + Spectral Flux + Spectral Roll-off gave the highest classification rate of 77.82% while for the female subjects, using TEO-CB-Auto-Env gave an accuracy of 74.74%.
Full-text available
Conference Paper
We present the Computer Expression Recognition Toolbox (CERT), a software tool for fully automatic real-time facial expression recognition, and officially release it for free academic use. CERT can automatically code the intensity of 19 different facial actions from the Facial Action Unit Coding System (FACS) and 6 different protoypical facial expressions. It also estimates the locations of 10 facial features as well as the 3-D orientation (yaw, pitch, roll) of the head. On a database of posed facial expressions, Extended Cohn-Kanade (CK+ (1)), CERT achieves an average recognition performance (probability of correctness on a two-alternative forced choice (2AFC) task between one positive and one negative example) of 90.1% when analyzing facial actions. On a spontaneous facial expression dataset, CERT achieves an accuracy of nearly 80%. In a standard dual core laptop, CERT can process 320 × 240 video images in real time at approximately 10 frames per second.
Full-text available
Article
In an earlier study, we evaluated the effectiveness of several acoustic measures in predicting breathiness ratings for sustained vowels spoken by nonpathological talkers who were asked to produce nonbreathy, moderately breathy, and very breathy phonation (Hillenbrand, Cleveland, & Erickson, 1994). The purpose of the present study was to extend these results to speakers with laryngeal pathologies and to conduct tests using connected speech in addition to sustained vowels. Breathiness ratings were obtained from a sustained vowel and a 12-word sentence spoken by 20 pathological and 5 nonpathological talkers. Acoustic measures were made of (a) signal periodicity, (b) first harmonic amplitude, and (c) spectral tilt. For the sustained vowels, a frequency domain measure of periodicity provided the most accurate predictions of perceived breathiness, accounting for 92% of the variance in breathiness ratings. The relative amplitude of the first harmonic and two measures of spectral tilt correlated moderately with breathiness ratings. For the sentences, both signal periodicity and spectral tilt provided accurate predictions of breathiness ratings, accounting for 70%-85% of the variance.
Full-text available
Article
A decomposition algorithm that uses a pitch-scaled harmonic filter was evaluated using synthetic signals and applied to mixed-source speech, spoken by three subjects, to separate the voiced and unvoiced parts. Pulsing of the noise component was observed in voiced frication, which was analyzed by complex demodulation of the signal envelope. The timing of the pulsation, represented by the phase of the anharmonic modulation coefficient, showed a step change during a vowel-fricative transition corresponding to the change in location of the noise source within the vocal tract. Analysis of fricatives [see text] demonstrated a relationship between steady-state phase and place, and f0 glides confirmed that the main cause was a place-dependent delay.
Full-text available
Article
Quantification of perceptual voice characteristics allows the assessment of voice changes. Acoustic measures of jitter, shimmer, and noise-to-harmonic ratio (NHR) are often unreliable. Measures of cepstral peak prominence (CPP) may be more reliable predictors of dysphonia. Trained listeners analyzed voice samples from 281 patients. The NHR, amplitude perturbation quotient, smoothed pitch perturbation quotient, percent jitter, and CPP were obtained from sustained vowel phonation, and the CPP was obtained from running speech. For the first time, normal and abnormal values of CPP were defined, and they were compared with other acoustic measures used to predict dysphonia. The CPP for running speech is a good predictor and a more reliable measure of dysphonia than are acoustic measures of jitter, shimmer, and NHR.
Full-text available
Article
Among the many clinical decisions that psychiatrists must make, assessment of a patient's risk of committing suicide is definitely among the most important, complex, and demanding. When reviewing his clinical experience, one of the authors observed that successful predictions of suicidality were often based on the patient's voice independent of content. The voices of suicidal patients judged to be high-risk near-term exhibited unique qualities, which distinguished them from nonsuicidal patients. We investigated the discriminating power of two excitation-based speech parameters, vocal jitter and glottal flow spectrum, for distinguishing among high-risk near-term suicidal, major depressed, and nonsuicidal patients. Our sample consisted of ten high-risk near-term suicidal patients, ten major depressed patients, and ten nondepressed control subjects. As a result of two sample statistical analyses, mean vocal jitter was found to be a significant discriminator only between suicidal and nondepressed control groups (p < 0.05). The slope of the glottal flow spectrum, on the other hand, was a significant discriminator between all three groups (p < 0.05). A maximum likelihood classifier, developed by combining the a posteriori probabilities of these two features, yielded correct classification scores of 85% between near-term suicidal patients and nondepressed controls, 90% between depressed patients and nondepressed controls, and 75% between near-term suicidal patients and depressed patients. These preliminary classification results support the hypothesized link between phonation and near-term suicidal risk. However, validation of the proposed measures on a larger sample size is necessary.
Full-text available
Conference Paper
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their voiced and unvoiced constituents. In this paper, we evaluate its ability to reconstruct the time series of the two components accurately using a variety of synthetic, speech-like signals, and discuss its performance. These results determine the degree of confidence that can be expected for real speech signals: typically, 5 dB improvement in the signal-to-noise ratio of the harmonic component and approximately 5 dB more than the initial harmonics-to-noise ratio (HNR) in the anharmonic component. A selection of the analysis opportunities that the decomposition offers is demonstrated on speech recordings, including dynamic HNR estimation and separate linear prediction analyses of the two components. These new capabilities provided by the PSHF can facilitate discovering previously hidden features and investigating interactions of unvoiced sources, such as frication, with voicing
Full-text available
Article
Almost all speech contains simultaneous contributions from more than one acoustic source within the speaker's vocal tract. In this paper, we propose a method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach. The PSHF outputs periodic and aperiodic components that are estimates of the respective contributions of the different types of acoustic source. It produces four reconstructed time series signals by decomposing the original speech signal, first, according to amplitude, and then according to power of the Fourier coefficients. Thus, one pair of periodic and aperiodic signals is optimized for subsequent time-series analysis, and another pair for spectral analysis. The performance of the PSHF algorithm is tested on synthetic signals, using three forms of disturbance (jitter, shimmer and additive noise), and the results were used to predict the performance on real speech. Processing recorded speech examples elicited latent features from the signals, demonstrating the PSHF's potential for analysis of mixed-source speech
Full-text available
Article
We present a straightforward and robust algorithm for periodicity detection, working in the lag (autocorrelation) domain. When it is tested for periodic signals and for signals with additive noise or jitter, it proves to be several orders of magnitude more accurate than the methods commonly used for speech analysis. This makes our method capable of measuring harmonics-to-noise ratios in the lag domain with an accuracy and reliability much greater than that of any of the usual frequency-domain methods. By definition, the best candidate for the acoustic pitch period of a sound can be found from the position of the maximum of the autocorrelation function of the sound, while the degree of periodicity (the harmonics-to-noise ratio) of the sound can be found from the relative height of this maximum. However, sampling and windowing cause problems in accurately determining the position and height of the maximum. These problems have led to inaccurate timedomain and cepstral methods for p...
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
Article
Neurophysiological changes in the brain associated with major depression disorder can disrupt articulatory precision in speech production. Motivated by this observation, we address the hypothesis that articulatory features, as manifested through formant frequency tracks, can help in automatically classifying depression state. Specifically, we investigate the relative importance of vocal tract formant frequencies and their dynamic features from sustained vowels and conversational speech. Using a database consisting of audio from 35 subjects with clinical measures of depression severity, we explore the performance of Gaussian mixture model (GMM) and support vector machine (SVM) classifiers. With only formant frequencies and their dynamics given by velocity and acceleration, we show that depression state can be classified with an optimal sensitivity/specificity/area under the ROC curve of 0.86/0.64/0.70 and 0.77/0.77/0.73 for GMMs and SVMs, respectively. Future work will involve merging our formant-based characterization with vocal source and prosodic features.
Conference Paper
Speech analysis has shown potential for identifying neurological impairment. With brain trauma, changes in brain structure or connectivity may result in changes in source, prosodic, or articulatory aspects of voice. In this work, we examine the articulatory components of speech reflected in formant tracks, and how changes in track dynamics and coordination map to cognitive decline. We address a population of athletes regularly receiving impacts to the head and showing signs of preclinical mild traumatic brain injury (mTBI), a state indicated by impaired cognitive performance occurring prior to concussion. We hypothesize that this preclinical damage results in 1) changes in average vocal tract dynamics measured by formant frequencies, their velocities, and acceleration, and 2) changes in articulatory coordination measured by a novel formant-frequency cross-correlation characterization. These features allow machine learning algorithms to detect preclinical mTBI identified by a battery of cognitive tests. A comparison is performed of the effectiveness of vocal tract dynamics features versus articulatory coordination features. This evaluation is done using receiver operating characteristic (ROC) curves along with confidence bounds. The articulatory dynamics features achieve area under the ROC curve (AUC) values between 0.72 and 0.98, whereas the articulatory coordination features achieve AUC values between 0.94 and 0.97.
Article
A hypothesis in characterizing human depression is that change in the brain?s basal ganglia results in a decline of motor coordination [6][8][14]. Such a neuro-physiological change may therefore affect laryngeal control and dynamics. Under this hypothesis, toward the goal of objective monitoring of depression severity, we investigate vocal-source biomarkers for depression; specifically, source features that may relate to precision in motor control, including vocal-fold shimmer and jitter, degree of aspiration, fundamental frequency dynamics, and frequency-dependence of variability and velocity of energy. We use a 35-subject database collected by Mundt et al. [1] in which subjects were treated over a six-week period, and investigate correlation of our features with clinical (HAMD), as well as self-reported (QIDS) Total subject assessment scores. To explicitly address the motor aspect of depression, we compute correlations with the Psychomotor Retardation component of clinical and self-reported Total assessments. For our longitudinal database, most correlations point to statistical relationships of our vocal-source biomarkers with psychomotor activity, as well as with depression severity.
Article
35 right-handed White females (18–35 yrs) viewed positive and stress-inducing motion picture films and then reported on their subjective experience. Spontaneous facial expressions provided accurate information about more specific aspects of emotional experience than just the pleasant vs unpleasant distinction. The facial action coding system (P. Ekman and W. V. Friesen, 1978) isolated a particular type of smile that was related to differences in reported happiness between Ss who showed this action and Ss who did not, to the intensity of happiness, and to which of 2 happy experiences was reported as happiest. Ss who showed a set of facial actions hypothesized to be signs of various negative affects reported experiencing more negative emotion than Ss who did not show these actions. How much these facial actions were shown was related to the reported intensity of negative affect. Specific facial actions associated with the experience of disgust are identified. (38 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame-by-frame point estimates of formant frequencies followed by candidate selection and smoothing using dynamic programming methods that minimize ad hoc cost functions. The goal of the current work is to provide both point estimates and associated uncertainties of center frequencies and bandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithms take advantage of a linearized mapping to infer formant and antiformant parameters from frame-based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis of KARMA, wavesurfer, and praat is accomplished in the all-pole case using a manually marked formant database and synthesized speech waveforms. KARMA formant tracks exhibit lower overall root-mean-square error relative to the two benchmark algorithms with the ability to modify parameters in a controlled manner to trade off bias and variance. Antiformant tracking performance of KARMA is illustrated using synthesized and spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practitioners to recognize time-varying confidence in parameters of interest and adjust algorithmic settings accordingly.
Conference Paper
Understanding how someone is speaking can be equally important to what they are saying when evaluating emotional disorders, such as depression. In this study, we use the acoustic speech signal to analyze variations in prosodic feature statistics for subjects suffering from a depressive disorder. A new sample database of subjects with and without a depressive disorder is collected and pitch, energy, and speaking rate feature statistics are generated at a sentence level and grouped into a series of observations (subset of sentences) for analysis. A common technique in quantifying an observation had been to simply use the average of the feature statistic for the subset of sentences within an observation. However, we investigate the merit of a series of statistical measures as a means of quantifying a subset of feature statistics to capture emotional variations from sentence to sentence within a single observation. Comparisons with the exclusive use of the average show an improvement in overall separation accuracy for other quantifying statistics.
Article
Reynolds, Douglas A., Quatieri, Thomas F., and Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing10(2000), 19–41.In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.