Conference Paper

Vocal and Facial Biomarkers of Depression based on Motor Incoordination and Timing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In individuals with major depressive disorder, neurophysiological changes often alter motor control and thus affect the mechanisms controlling speech production and facial expression. These changes are typically associated with psychomotor retardation, a condition marked by slowed neuromotor output that is behaviorally manifested as altered coordination and timing across multiple motor-based properties. Changes in motor outputs can be inferred from vocal acoustics and facial movements as individuals speak. We derive novel multi-scale correlation structure and timing feature sets from audio-based vocal features and video-based facial action units from recordings provided by the 4th International Audio/Video Emotion Challenge (AVEC). The feature sets enable detection of changes in coordination, movement, and timing of vocal and facial gestures that are potentially symptomatic of depression. Combining complementary features in Gaussian mixture model and extreme learning machine classifiers, our multivariate regression scheme predicts Beck depression inventory ratings on the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31. Future work calls for continued study into detection of neurological disorders based on altered coordination and timing across audio and video modalities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The formants are directly related to the articulation of the vocal tract and can therefore be interpreted directly. Formant tracks have also been used successfully as features before, for example when modelling emotion and depression [47]. This method was adopted in our previous work for cognitive workload monitoring in [16]. ...
... The formant features were extracted using the Kalman-based auto-regressive moving average smoothing (KARMA) algorithm [48] as in [47]. The main advantage of using KARMA is that the algorithm produces smoother formant tracks than other methods and it provides a sensible interpolation during non-voiced periods. ...
... The features used to characterise the cardiovascular system are the well defined blood pressure measures obtained from a Finometer from Finapress [43,44]. The novel voice features presented in this work are derived from the formant track features developed in previous work [16,47] and characterise the vocal tract shape and change in shape. The reason why the formant track features were chosen in contrast to of voice source features (e.g. ...
Article
Full-text available
Monitoring cognitive workload has the potential to improve both the performance and fidelity of human decision making. However, previous efforts towards discriminating further than binary levels (e.g., low/high or neutral/high) in cognitive workload classification have not been successful. This lack of sensitivity in cognitive workload measurements might be due to individual differences as well as inadequate methodology used to analyse the measured signal. In this paper, a method that combines the speech signal with cardiovascular measurements for screen and heartbeat classification is introduced. For validation, speech and cardiovascular signals from 97 university participants and 20 airline pilot participants were collected while cognitive stimuli of varying difficulty level were induced with the Stroop colour/word test. For the trinary classification scheme (low, medium, high cognitive workload) the prominent result using classifiers trained on each participant achieved 15.17 ± 0.79% and 17.38 ± 1.85% average misclassification rates indicating good discrimination at three levels of cognitive workload. Combining cardiovascular and speech measures synchronized to each heartbeat and consolidated with short-term dynamic measures might therefore provide enhanced sensitivity in cognitive workload monitoring. The results show that the influence of individual differences is a limiting factor for a generic classification and highlights the need for research to focus on methods that incorporate individual differences to achieve even better results. This method can potentially be used to measure and monitor workload in real time in operational environments.
... Features derived from short speech samples based on voice characteristics (21,22) smartphones. The details of the assessment schedule can be found in the Supplementary Material. ...
... Voice analytics has shown promise for detecting symptoms of depression (21,22). Study participants entered sound data through the smartphone application twice per week. ...
... kg/m 2 . The mean (range) of HAM-D total score at screening was 20.4 (17)(18)(19)(20)(21)(22)(23)(24)(25) for the patients, and it was 1.2 (0-3) for the healthy controls. The mean (range) of C-SSRS total score at screening was 32.1 (0-77) for the patients, and it was 1.1 (0-8) for the healthy controls. ...
... Beyond security, emotion classification is important in computer vision applications used for video indexing and retrieval, robot motion, entertainment, monitoring of smart home systems [16,17], and neuro-physiological and psychological studies. For instance, emotion classification is important in monitoring the psychological and neuro-physiological condition of individuals with personality trait disorders [18,19], and to monitor and identify people with autism spectral disorders [20]. ...
... ), (17)(18)(16)(17)(18), (dt2, nn2), (this, jewel), (complementary, complementary)] ...
... ), (17)(18)(16)(17)(18), (dt2, nn2), (this, jewel), (complementary, complementary)] ...
Article
Full-text available
Emotion classification is a research area in which there has been very intensive literature production concerning natural language processing, multimedia data, semantic knowledge discovery, social network mining, and text and multimedia data mining. This paper addresses the issue of emotion classification and proposes a method for classifying the emotions expressed in multimodal data extracted from videos. The proposed method models multimodal data as a sequence of features extracted from facial expressions, speech, gestures, and text, using a linguistic approach. Each sequence of multimodal data is correctly associated with the emotion by a method that models each emotion using a hidden Markov model. The trained model is evaluated on samples of multimodal sentences associated with seven basic emotions. The experimental results demonstrate a good classification rate for emotions.
... Previous studies have used a variety of data to determine depression, including voice [3,4], facial expressions [5][6][7][8], behavior [9,10], Electroencephalography (EEG) [11,12] and handwriting [13,14]. However, the above methods involve several financial costs. ...
... However, speech signals are susceptible to external variability factors, which can affect feature reliability [3]. Other speech-related studies, including work by Williamson et al. [4], have examined motor coordination in speech as a potential indicator for inferring depression severity. Facial expression analysis and behavior pattern analysis approaches have also yielded promising results. ...
... In the field of biomarkers, initial attention was focused on hand-crafted features such as Local Phase Quantization (LPQ) [10], Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) [11], and Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) [12] extracted from facial videos. Other features explored include speech, facial action units (FAUs), facial landmarks, head poses, and gazes [13,14]. The development of these hand-crafted features heavily relies on specific knowledge related to the depression recognition task. ...
... Fortunately, with the advent of deep learning [16], a new pathway has been opened for depression recognition tasks. Researchers can train endto-end deep neural networks using depression-related data including facial video [17][18][19][20], speech [13,21,22] and other sources. These deep neural networks can discover subtle, indistinguishable depression-related features and implicitly discriminate among them for improved prediction [23]. ...
Preprint
While existing depression recognition methods based on deep learning show promise, their practical application is hindered by the lack of trustworthiness, as these deep models are often deployed as \textit{black box} models, leaving us uncertain about the confidence of the model predictions. For high-risk clinical applications like depression recognition, uncertainty quantification is essential in decision-making. In this paper, we introduce conformal depression prediction (CDP), a depression recognition method with uncertainty quantification based on conformal prediction (CP), giving valid confidence intervals with theoretical coverage guarantees for the model predictions. CDP is a plug-and-play module that requires neither model retraining nor an assumption about the depression data distribution. As CDP provides only an average performance guarantee across all inputs rather than per-input performance guarantee, we propose CDP-ACC, an improved conformal prediction with approximate conditional coverage. CDP-ACC firstly estimates the prediction distribution through neighborhood relaxation, and then introduces a conformal score function by constructing nested sequences, so as to provide tighter prediction interval for each specific input. We empirically demonstrate the application of uncertainty quantification in depression recognition, and the effectiveness and superiority of CDP and CDP-ACC on the AVEC 2013 and AVEC 2014 datasets
... By using machine learning methods, the accuracy when detecting depression and predicting severity scores has been compared. Williamson achieved a correlation coefficient of 0.56 for BDI scores with an analysis centered on formant frequencies, resulting in 0.53 with an analysis centered on Mel-cepstrum frequency differences and 0.7 using both [14]. ...
... Based on prior studies about depression and speech [7, 8,11,13,14], we selected the acoustic features used to train the prediction model in Table 3. NAQ and QOQ are features that can quantify the state of the vocal cord from the voice. Fast fourier transform (FFT) band power is a feature that shows the power of the voice in each of three levels, namely: 0-500 Hz, 500-1000 Hz, and 1000-4000 Hz. ...
Article
Full-text available
Voice-based depression detection methods have been studied worldwide as an objective and easy method to detect depression. Conventional studies estimate the presence or severity of depression. However, an estimation of symptoms is a necessary technique not only to treat depression, but also to relieve patients’ distress. Hence, we studied a method for clustering symptoms from HAM-D scores of depressed patients and by estimating patients in different symptom groups based on acoustic features of their speech. We could separate different symptom groups with an accuracy of 79%. The results suggest that voice from speech can estimate the symptoms associated with depression.
... More recently, case-control studies of adult populations have used audio-based feature extraction techniques to classify cognitive state following head injury [10]- [13]. Related studies have used acoustic features to examine similar types of speech degradation due to ALS [7], [8], [14], depression [15]- [18], Parkinson's disease [7], cognitive load [19] and diagnosed dysarthria and dysphonia [7], [20]- [22]. ...
... Changes over time in the coupling strengths among the formant tracks cause changes in the eigenvalue spectra of the resulting correlation matrices; weakly coupled formant-tracks may indicate more complex interactions between the articulators. Williamson et al. first applied this multivariate correlation approach to epileptic seizure prediction from multichannel EEG [33] and subsequently to the tracking and prediction of major depressive disorder from audio-based vocal signals [15], [34]. ...
Preprint
Full-text available
Recommendations for common outcome measures following pediatric traumatic brain injury (TBI) support the integration of instrumental measurements alongside perceptual assessment in recovery and treatment plans. A comprehensive set of sensitive, robust and non-invasive measurements is therefore essential in assessing variations in speech characteristics over time following pediatric TBI. In this article, we study the changes in the acoustic speech patterns of a pediatric cohort of ten subjects diagnosed with severe TBI. We extract a diverse set of both well-known and novel acoustic features from child speech recorded throughout the year after the child produced intelligible words. These features are analyzed individually and by speech subsystem, within-subject and across the cohort. As a group, older children exhibit highly significant (p<0.01) increases in pitch variation and phoneme diversity, shortened pause length, and steadying articulation rate variability. Younger children exhibit similar steadied rate variability alongside an increase in formant-based articulation complexity. Correlation analysis of the feature set with age and comparisons to normative developmental data confirm that age at injury plays a significant role in framing the recovery trajectory. Nearly all speech features significantly change (p<0.05) for the cohort as a whole, confirming that acoustic measures supplementing perceptual assessment are needed to identify efficacious treatment targets for speech therapy following TBI.
... Acoustic Representations for Depression Detection: Depression is shown to degrade cognitive planning and psychomotor functioning, thus affecting the human speech production mechanism (Cummins et al. 2015). These effects manifest as variations in the speech voice quality (Williamson et al. 2014) and several features have been proposed to capture these variations in speech for depression detection. Spectral features such as formants and mel-frequency cepstral coefficients (MFCCs), prosodic features such as F 0 , jitter, shimmer and glottal features were initially used for depression detection (Low et al. 2010;Cummins et al. 2011;Simantiraki et al. 2017). ...
... Spectral, prosodic and other voice quality related features extracted using OpenSMILE (Eyben, Wöllmer, and Schuller 2010) and COVAREP (Degottex et al. 2014) toolkits were also used for depression analysis (Valstar et al. 2016;Al Hanai, Ghassemi, and Glass 2018). Further, features developed based on speech articulation such as vocal tract coordination features were analyzed for depression detection (Williamson et al. 2014;Huang, Epps, and Joachim 2020;Seneviratne et al. 2020). Recently, sentiment and emotion embeddings, representing non-verbal characteristics of speech, were used for depression severity estimation (Dumpala et al. 2021a). ...
Conference Paper
Full-text available
Depression detection from speech has attracted a lot of attention in recent years. However, the significance of speaker-specific information in depression detection has not yet been explored. In this work, we introduce—and analyze the significance of—speaker embeddings in a temporal context for the task of depression detection from speech. Experimental results show that the speaker embeddings provide important cues to achieve state-of-the-art performance in depression detection. We also show that combining conventional OpenSMILE and COVAREP features, which carry complementary information, with speaker embeddings further improves the depression detection performance. The significance of the temporal context in the training of deep learning models for depression detection is also analyzed in this paper.
... Therefore, in the early stage of the studies of SDR, the main work is to learn acoustic features related with depression and explore feature set for better performance [22,23]. In the meantime, traditional machine learning algorithms are employed in SDR such as Support Vector Machine (SVM) [24][25][26][27], Hidden Markov Model [28], Gaussian Mixture Model (GMM) [ [27,29,30], K-means [31,32], Boosting Logistic Regression [33][34][35], multi-layer perceptron [30,35], etc. ...
... As a clustering algorithm, it is employed in early research of SDR [27,30,59,69]. Moreover, GMM-based regression methods such as Gaussian Staircase Regression (GSR) are proposed, where each GMM consists of an ensemble of Gaussian classifiers [29,54,61,62]. In specific, firstly, speech features are mapped to different partitions of clinical depression score, then the mapping results are used as the basis of regression analysis. ...
Article
Full-text available
Depression has become one of the most common mental illnesses in the world. For better prediction and diagnosis, methods of automatic depression recognition based on speech signal are constantly proposed and updated, with a transition from the early traditional methods based on hand-crafted features to the application of architectures of deep learning. This paper systematically and precisely outlines the most prominent and up-to-date research of automatic depression recognition by intelligent speech signal processing so far. Furthermore, methods for acoustic feature extraction, algorithms for classification and regression, as well as end to end deep models are investigated and analysed. Finally, general trends are summarised and key unresolved issues are identified to be considered in future studies of automatic speech depression recognition.
... Notably, while the audio-visual multimodal method presented in paper 41 was the winning entry of the 2014 AVEC Challenge, its performance does not exceed that of the single-modal prediction method we proposed. This finding further underscores the superiority of our approach. ...
Article
Full-text available
The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making it a crucial behavioral marker for detecting depression. However, existing studies often overlook the separation of speaker-related and emotion-related features in speech when recognizing depression. To tackle this challenge, we propose a Mixture-of-Experts (MoE) method that integrates speaker-related and emotion-related features for depression recognition. Our approach begins with a Time Delay Neural Network to pre-train a speaker-related feature extractor using a large-scale speaker recognition dataset while simultaneously pre-training a speaker’s emotion-related feature extractor with a speech emotion dataset. We then apply transfer learning to extract both features from a depression dataset, followed by fusion. A multi-domain adaptation algorithm trains the MoE model for depression recognition. Experimental results demonstrate that our method achieves 74.3% accuracy on a self-built Chinese localized depression dataset and an MAE of 6.32 on the AVEC2014 dataset. Thus, it outperforms state-of-the-art deep learning methods that use speech features. Additionally, our approach shows strong performance across Chinese and English speech datasets, highlighting its effectiveness in addressing cultural variations.
... For example, Meng et al. introduced a layered system utilizing Motion History Histogram features, and Nasir et al. employed a multi-resolution model combining audio and video features to diagnose depression more effectively [15,17]. Williamson et al. proposed a system that harnesses speech, prosody, and facial action units to assess depression severity, illustrating the value of multimodal integration [25]. ...
Preprint
Full-text available
In this work, we introduce the TriFusion Network, an innovative deep learning framework designed for the simultaneous analysis of auditory, visual, and textual data to accurately assess emotional states. The architecture of the TriFusion Network is uniquely structured, featuring both independent processing pathways for each modality and integrated layers that harness the combined strengths of these modalities to enhance emotion recognition capabilities. Our approach addresses the complexities inherent in multimodal data integration, with a focus on optimizing the interplay between modality-specific features and their joint representation. Extensive experimental evaluations on the challenging AVEC Sentiment Analysis in the Wild dataset highlight the TriFusion Network's robust performance. It significantly outperforms traditional models that rely on simple feature-level concatenation or complex score-level fusion techniques. Notably, the TriFusion Network achieves Concordance Correlation Coefficients (CCC) of 0.606, 0.534, and 0.170 for the arousal, valence, and liking dimensions respectively, demonstrating substantial improvements over existing methods. These results not only confirm the effectiveness of the TriFusion Network in capturing and interpreting complex emotional cues but also underscore its potential as a versatile tool in real-world applications where accurate emotion recognition is critical.
... Gaussian Mixture Model (GMM)-based approaches marked a significant advancement, with Williamson et al [35], [36], using formant frequencies and delta-mel-cepstra to depict vocal tract shape and dynamics, using a Gaussian Staircase Model for regression. Cummins et al. introduced a GMM-UBM model to amalgamate audio and visual information [37], and Jain et al. utilized GMM (Fisher Vector) to merge features extracted from multiple video segments [38]. ...
Preprint
Mood disorders, including depression and anxiety, often manifest through facial expressions. While previous research has explored the connection between facial features and emotions, machine learning algorithms for estimating mood disorder severity have been hindered by small datasets and limited real-world application. To address this gap, we analyzed facial videos of 11,427 participants, a dataset two orders of magnitude larger than previous studies. This comprehensive collection includes standardized facial expression videos from reading tasks, along with a detailed psychological scale that measures depression, anxiety, and stress. By examining the relationships among these emotional states and employing clustering analysis, we identified distinct subgroups embodying different emotional profiles. We then trained tree-based classifiers and deep learning models to estimate emotional states from facial features. Results indicate that models previously effective on small datasets experienced decreased performance when applied to our large dataset, highlighting the importance of data scale and mitigating overfitting in practical settings. Notably, our study identified subtle shifts in pupil dynamics and gaze orientation as potential markers of mood disorders, providing valuable information on the interaction between facial expressions and mental health. This research marks the first large-scale and comprehensive investigation of facial expressions in the context of mental health, laying the groundwork for future data-driven advancements in this field.
... Our choice of a standardized feature set worked well in this setting, but may fail to work for differential voice disorder diagnosis or when generalizing to larger datasets, which may bring in additional sources of variance unaccounted for in this dataset. With the availability of more data, additional features could be extracted that better capture changes in coordination (e.g., XCORR [69]). ...
Article
Full-text available
Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. Patients with confirmed UVFP through endoscopic examination (N = 77) and controls with normal voices matched for age and sex (N = 77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician’s performance (range: 0.74–0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician’s ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.
... Subsequent works have used paralinguistic speech processing (PSP) algorithms that applied machine learning in order to combine a multitude of speech parameters and thereby improve the assessment of depressive symptom severity (e.g., [31][32][33][34][35]) and diagnostic status of MDD (e.g., [36][37][38]). Machine learning enables the development and utilization of complex, nonlinear algorithmic models that have been trained to predict output variables ("labels," e.g., diagnostic status) by a large number of input variables ("features," e.g., speech parameters; see [25]). ...
Article
Full-text available
New developments in machine learning-based analysis of speech can be hypothesized to facilitate the long-term monitoring of major depressive disorder (MDD) during and after treatment. To test this hypothesis, we collected 550 speech samples from telephone-based clinical interviews with 267 individuals in routine care. With this data, we trained and evaluated a machine learning system to identify the absence/presence of a MDD diagnosis (as assessed with the Structured Clinical Interview for DSM-IV) from paralinguistic speech characteristics. Our system classified diagnostic status of MDD with an accuracy of 66% (sensitivity: 70%, specificity: 62%). Permutation tests indicated that the machine learning system classified MDD significantly better than chance. However, deriving diagnoses from cut-off scores of common depression scales was superior to the machine learning system with an accuracy of 73% for the Hamilton Rating Scale for Depression (HRSD), 74% for the Quick Inventory of Depressive Symptomatology–Clinician version (QIDS-C), and 73% for the depression module of the Patient Health Questionnaire (PHQ-9). Moreover, training a machine learning system that incorporated both speech analysis and depression scales resulted in accuracies between 73 and 76%. Thus, while findings of the present study demonstrate that automated speech analysis shows the potential of identifying patterns of depressed speech, it does not substantially improve the validity of classifications from common depression scales. In conclusion, speech analysis may not yet be able to replace common depression scales in clinical practice, since it cannot yet provide the necessary accuracy in depression detection. This trial is registered with DRKS00023670.
... The study of speech biomarkers in mental health holds great potential, offering a non-invasive and easily accessible avenue to capture significant motor, cognitive and behavioral changes due to mental health disorders such as depression and anxiety [10][11][12][13]. Clinical evidence and research studies have increasingly linked specific automated extracted speech features, such as prosody, articulation, and fluency, with various mental health conditions, including depression [10,14], anxiety [15], suicide-risk assessment [16], fatigue [17,18], or sleep deprivation [19]. The complexity of human speech extends beyond the intricate motor coordination involved. ...
Preprint
Full-text available
Background: While speech analysis holds promise for mental health assessment, research often focuses on single symptoms, despite symptom co-occurrences and interactions. In addition, predictive models in Mental Health do not properly assess speech-based systems' limitations, such as uncertainty, or fairness for a safe clinical deployment. Objective: We investigated the predictive potential of mobile-collected speech data for detecting and estimating depression, anxiety, fatigue, and insomnia, focusing beyond mere accuracy, in the general population. Methods: We included n=435 healthy adults and recorded their answers concerning their perceived mental and sleep states. We asked them how they felt and if they had slept well lately. Clinically validated questionnaires measured depression, anxiety, insomnia, and fatigue severity. We developed a novel speech and machine learning pipeline involving voice activity detection, feature extraction, and model training. We detected voice activity automatically with a bidirectional neural network and examined participants' speech with a fully ML automatic pipeline to capture speech variability. Then, we modelled speech with a ThinResNet model that was pre-trained on a large open free database. Based on this speech modelling, we evaluated clinical threshold detection, individual score prediction, model uncertainty estimation, and performance fairness across demographics (age, sex, education). We employed a train-validation-test split for all evaluations: to develop our models, select the best ones and assess the generalizability of held-out data. Results: Our methods achieved high detection performance for all symptoms, particularly depression (PHQ-9 AP=0.77, BDI AP=0.83), insomnia (AIS AP=0.86), and fatigue (MFI Total Score AP=0.88). These strengths were maintained while ensuring high abstention rates for uncertain cases (Risk-Coverage AUCs < 0.1). Individual symptom scores were predicted with good accuracy (Correlations were all significant, with Pearson strengths between 0.59 and 0.74). Fairness analysis revealed that models were consistent for sex (average Disparity Ratio (DR) = 0.77), to a lesser extent for education level (average Disparity Ratio (DR) = 0.44) and worse for age groups (average Disparity Ratio (DR) = 0.26). Conclusions: This study demonstrates the potential of speech-based systems for multifaceted mental health assessment in the general population, not only for detecting clinical thresholds but also for estimating their severity. Addressing fairness and incorporating uncertainty estimation with selective classification are key contributions that can enhance the clinical utility and responsible implementation of such systems. This approach offers promise for more accurate and nuanced mental health assessments, potentially benefiting both patients and clinicians.
... The researchers combined Gaussian Staircase Regression with extreme machine learning classifiers (ECM) and obtained a test RMSE of 8.12. There were a couple approaches used for Feature extraction,  Hand crafted Feature extraction  Deep Learning based feature extraction[25].1. Hand crafted Feature Extraction:In this approach, 2 kinds of descriptors were adopted.(1. ...
Research
Full-text available
One in 24 suffers from critical mental illness like Schizophrenia, Psychosis, Clinical Depression, Anxiety Disorder,Obsessive Compulsion Disorder (OCD), Autism, Bipolar Disorder, Attention Deficit Hyperactivity Disorder (ADHD) etc. found that the average Vector Similarity between adjacent sentences in free speech, along with other variables like Number of words/phrases, pauses, tone, intensity, frequency and other Low-Level Descriptors form the raw audio recording could be used to identify clinically high-risk patients with great accuracy. Audio visual hallucination and thought insertion appear to be the top side effects in case of patients suffering from Schizophrenia [3]. Acoustic studies between healthy and depressed individuals [4] shows us that the top audio features which help identify depression in mental illnesses are Loudness, MFCC5 and MFCC7. One of the studies dealing with "Automated Depression detection using Audio features" [5], suggests that the lacking objective Clinical depression assessment methods is the key reason that several patients can't be treated appropriately on time. This study aims to find an optimal approach to calculate depression scores amongst people suffering from mental illnesses using Artificial Intelligence techniques.
... Current research focuses on applying deep learning [8] to depression detection, which is more effective than hand-crafted [6], [39] feature extraction. Most of the research data on depression detection is not publicly available, and only a few small datasets in laboratory scenarios are publicly available (AVEC2019 [40], etc.). ...
Article
Full-text available
Depression is one of the most common mental illnesses, but few of the currently proposed in-depth models based on social media data take into account both temporal and spatial information in the data for the detection of depression. In this paper, we present an efficient, low-covariance multimodal integrated spatio-temporal converter framework called DepMSTAT, which aims to detect depression using acoustic and visual features in social media data. The framework consists of four modules: a data preprocessing module, a token generation module, a Spatial-Temporal Attentional Transformer (STAT) module, and a depression classifier module. To efficiently capture spatial and temporal correlations in multimodal social media depression data, a plug-and-play STAT module is proposed. The module is capable of extracting unimodal spatio-temporal features and fusing unimodal information, playing a key role in the analysis of acoustic and visual features in social media data. Through extensive experiments on a depression database (D-Vlog), the method in this paper shows high accuracy (71.53%) in depression detection, achieving a performance that exceeds most models. This work provides a scaffold for studies based on multimodal data that assists in the detection of depression.
... As, the analysis of clinical situations in which sufferers attempted to conceal their suicidal intent by claiming they were not depressed but revealed Micro-expressions shows significant emotions of pessimism that the patient was trying to conceal. By observing the patient's Micro-expressions, the doctor can accurately assess the patient's psychological condition, which enables the doctor to assess the patient's physical pain as well as any mental trauma and further implement effective treatment [13].Consequently, research community used different methods such as normalized optical flow [14], facial action units [15],local binary patterns,edge orientation histogram,and motion history histogramto recognize facial expression which in turn help in detection of depressive disorders [16]. ...
... In addition to analysing speech and voice alone, there are also studies that combine, for example, the measurement of voice and facial expression features 37 38 used the OpenFace software library, whose computer version is based on Ekman and Friesen's Facial Action Coding System (FACS) 39,40 . For prosodic parameters, they also chose the mean fundamental frequency, variability in fundamental frequency and loudness as acoustic features. ...
Article
Full-text available
This explorative study of patients with chronic schizophrenia aimed to clarify whether group art therapy followed by a therapist-guided picture review could influence patients’ communication behaviour. Data on voice and speech characteristics were obtained via objective technological instruments, and these characteristics were selected as indicators of communication behaviour. Seven patients were recruited to participate in weekly group art therapy over a period of 6 months. Three days after each group meeting, they talked about their last picture during a standardized interview that was digitally recorded. The audio recordings were evaluated using validated computer-assisted procedures, the transcribed texts were evaluated using the German version of the LIWC2015 program, and the voice recordings were evaluated using the audio analysis software VocEmoApI. The dual methodological approach was intended to form an internal control of the study results. An exploratory factor analysis of the complete sets of output parameters was carried out with the expectation of obtaining typical speech and voice characteristics that map barriers to communication in patients with schizophrenia. The parameters of both methods were thus processed into five factors each, i.e., into a quantitative digitized classification of the texts and voices. The factor scores were subjected to a linear regression analysis to capture possible process-related changes. Most patients continued to participate in the study. This resulted in high-quality datasets for statistical analysis. To answer the study question, two results were summarized: First, text analysis factor called Presence proved to be a potential surrogate parameter for positive language development. Second, quantitative changes in vocal emotional factors were detected, demonstrating differentiated activation patterns of emotions. These results can be interpreted as an expression of a cathartic healing process. The methods presented in this study make a potentially significant contribution to quantitative research into the effectiveness and mode of action of art therapy.
... For instance, it has been proposed by the Thomas Quatieri's group that the vocal tract coordination information can be very informative for different speaker traits 1.2 Dealing automatically with pathological spoken language such as depression (Williamson et al., 2014), Parkinson's Disease (Smith et al., 2017) or cognitive load (Quatieri et al., 2015). This method consists in the computation of auto-correlations and cross-correlations of MFCCs of a given speech sequence. ...
Thesis
Neurodegenerative diseases are a major social issue and public health priorityworldwide. Huntington Disease (HD) is a rare disease of genetic origin that causescognitive, behavioural and motor disorders due to brain lesions, in particular in thestriatum. People with the genetic mutation of HD have a pre-symptomatic phaseof several decades during which they have no neurological disorder before thesymptomatic phase occurs. The symptoms of this disease have many implicationsin the life activities of the patient, with a gradual loss of autonomy, until the deathof the patient. This makes HD a potential model of neurodegenerative diseasesthat could lead to the development of new clinical monitoring tools. The currentmedical monitoring in HD is expensive and requires the patient to travel regularly tothe hospital, generating a significant human and financial burden. The purpose ofthis thesis is to develop and validate new computational methods for automaticallymonitoring Huntington’s Disease individuals, thanks to the analysis of their spokenlanguage productions. Spoken language production invokes various cognitive, socialand motor skills, and its realisation is influenced by the mental state of the individual.Our hypothesis is that through the inspection of the produced speech and its contentwe can assess these different skills and states. To this date, the analysis of spokenlanguage disorders in HD is only performed in a few clinical departments andspecialised research teams, at a small scale without classic clinical validation. Inaddition, the potential of spoken language markers to predict the different symptomsin HD have not been explored.Therefore in this thesis, we designed a comprehensive spoken language battery,along with a complete annotation protocol that is parsable by a computer program.This battery measures different parameters to obtain a wide clinical picture ofspoken language in HD, that varies the linguistic target, the cognitive load, theemotional content, the topics and the materials of the discourse. To speed up theannotations protocol, we designed and developed open-source software to managelinguistic annotation campaigns. This allowed us to collect what is, to the best ofour knowledge, the largest database of fine-grained annotated spoken languageproductions in HD, with 125 annotated interviews of 3 groups of individuals: healthycontrols, premanifest individuals carrying the gene that causes HD and manifestHD at different stages. Besides, we also formalized and implemented the tracks oficommunication introduced by H. Clark, which allow analyzing the use of spokenlanguage in spontaneous exchanges for HD individuals. Then, to speed up and automate the annotation process, we developed and validated machine learning methodsto recognise turn-takings and identify these tracks of communication directly fromspeech. Finally, thanks to this new database, we assessed the capabilities of spokenlanguage markers to predict the different symptoms in HD. We especially found outthat rhythm and articulatory markers extracted from tasks with a cognitive load canpredict accurately the global, motor, functional and cognitive components of thedisease. We additionally found significant correlations between silence statistics andthe volume of the striatum, the neuro-anatomical hallmark of the disease progress.In spontaneous productions, we found that the ratio of tracks of communicationwas different between HD individuals and other groups. The primary track wasdiminished, the timing ratio of secondary presentation (filled pauses) also decreasedand the timing of incidental elements (ex: vocal noises, audible respiration) greatlyincreased. We also proposed new methodologies to examine the emotional speechproduction in HD. Finally, we found out that the manifest individuals with HD haveboth vocal and linguistic impairments during emotional speech production.
... The cohort of individuals in this dataset, both individuals with a history of TBI and controls, have also been assessed for additional comorbidities, such as post traumatic stress disorder, depression, mood disorders, and sleep disorders. As with the current paper comparing individuals with and without TBI, MDD can manifest as low complexity in speech motor coordination [20,26]. Previous studies have primarily focused on individuals who have a single diagnosis, without documented comorbidities. ...
... Automatic depression detection (ADD) has gained popularity with the advent of publicly available data sets [21] and the power of ML techniques to learn complex patterns. Among speech-based methods, previous studies have focused more on using handcrafted acoustic features, such as prosody [13], formant [22], and cepstral [23] features, and then classifying patterns using ML algorithms, such as support vector machine (SVM) [24], logistic regression [25], and random forest (RF) [26]. These studies have suggested that acoustic features are closely related to depression. ...
Article
Full-text available
Background: Automatic diagnosis of depression based on speech can complement mental health treatment methods in the future. Previous studies have reported that acoustic properties can be used to identify depression. However, few studies have attempted a large-scale differential diagnosis of patients with depressive disorders using acoustic characteristics of non-English speakers. Objective: This study proposes a framework for automatic depression detection using large-scale acoustic characteristics based on the Korean language. Methods: We recruited 153 patients who met the criteria for major depressive disorder and 165 healthy controls without current or past mental illness. Participants' voices were recorded on a smartphone while performing the task of reading predefined text-based sentences. Three approaches were evaluated and compared to detect depression using data sets with text-dependent read speech tasks: conventional machine learning models based on acoustic features, a proposed model that trains and classifies log-Mel spectrograms by applying a deep convolutional neural network (CNN) with a relatively small number of parameters, and models that train and classify log-Mel spectrograms by applying well-known pretrained networks. Results: The acoustic characteristics of the predefined text-based sentence reading automatically detected depression using the proposed CNN model. The highest accuracy achieved with the proposed CNN on the speech data was 78.14%. Our results show that the deep-learned acoustic characteristics lead to better performance than those obtained using the conventional approach and pretrained models. Conclusions: Checking the mood of patients with major depressive disorder and detecting the consistency of objective descriptions are very important research topics. This study suggests that the analysis of speech data recorded while reading text-dependent sentences could help predict depression status automatically by capturing the characteristics of depression. Our method is smartphone based, is easily accessible, and can contribute to the automatic identification of depressive states.
... Williamson et al. [59] utilized feature sets derived from facial movements and acoustic verbal cues to detect psychomotor retardation. They employed Principal component analysis for dimensionality reduction and then applied the Gaussian mixture model to classify the combination of principal feature vectors. ...
Article
The problem of detecting depression is multi-faceted because of variability in depressive symptoms caused by individual differences. The variations can be seen in historical information (like decreased physical activity etc.) and also in verbal/non-verbal behaviors (like lower pitch, downward eye gaze etc.). The primary goal of this research is to develop a novel classification system for diagnosing depression that considers both historical information and also verbal/non-verbal behaviors. For this purpose, we created a realworld multimodal dataset of depressed and non-depressed subjects with fourteen-day real-time smartphone usage records and audio-visual recordings. We extracted numerous features related to physiological/physical activity from smartphone usage records to capture historical information and features like pitch and eye gaze (verbal and non-verbal manifestations) from audio-visual clues. We experimented with early fusion using Decision trees classifier (along with several feature selection strategies) and Support Vector Machine (SVM) classifier with several late fusion methods. Then, we conducted a comparative study among both fusion strategies. Our findings showed that SVM classifier using late fusion strategy achieves best accuracy of 89%. In addition, a popular benchmarking multimodal dataset (DAIC-WOZ database) is used to further validate the effectiveness of our approach by fusing multi-faceted feature vectors for depression detection.
... To overcome potential concealments of real emotional status in rating depression and other emotions using common scales, novel techniques on facial expression and text content are commonly used. Severity of depression was correlated with facial expression features which were identified via movements of expression muscles (7). A machine learning model for the assessment of depression severity was built by extracting facial expression features (8). ...
Article
Full-text available
Background Emotional disturbance is an important risk factor of suicidal behaviors. To ensure speech emotion recognition (SER), a novel technique to evaluate emotional characteristics of speech, precision in labeling emotional words is a prerequisite. Currently, a list of suicide-related emotional word is absent. The aims of this study were to establish an Emotional Words List for Suicidal Risk Assessment (EWLSRA) and test the reliability and validity of the list in a suicide-related SER task. Methods Suicide-related emotion words were nominated and discussed by 10 suicide prevention professionals. Sixty-five tape-recordings of calls to a large psychological support hotline in China were selected to test psychometric characteristics of the EWLSRA. Results The results shows that the EWLSRA consists of 11 emotion words which were highly associated with suicide risk scores and suicide attempts. Results of exploratory factor analysis support one-factor model of this list. The Fleiss’ Kappa value of 0.42 indicated good inter-rater reliability of the list. In terms of criteria validities, indices of despair (Spearman ρ = 0.54, P < 0.001), sadness (ρ = 0.37, P = 0.006), helplessness (ρ = 0.45, P = 0.001), and numbness (ρ = 0.35, P = 0.009) were significantly associated with suicidal risk scores. The index of the emotional word of numbness in callers with suicide attempt during the 12-month follow-up was significantly higher than that in callers without suicide attempt during the follow-up (P = 0.049). Conclusion This study demonstrated that the EWLSRA has adequate psychometric performance in identifying suicide-related emotional words of recording of hotline callers to a national wide suicide prevention line. This list can be useful for SER in future studies on suicide prevention.
... More importantly, most of the predicted values approached the true labels. (Gupta et al., 2014;Jain, Crowley, Dey, & Lux, 2014;Jan, Meng, Gaus, Zhang, & Turabzadeh, 2014;Mitra et al., 2014;Pérez Espinosa et al., 2014;Sidorov & Minker, 2014;Williamson, Quatieri, Helfer, Ciccarelli, & Mehta, 2014). A denotes audio modality, and V represents the video modality. ...
Article
Depression has been considered the most dominant mental disorder over the past few years. To help clinicians effectively and efficiently estimate the severity scale of depression, various automated systems based on deep learning have been proposed. To estimate the severity of depression, i.e., the depression severity score (Beck Depression Inventory–II), various deep architectures have been designed to perform regression using the Euclidean loss. However, they do not consider the label distribution, and they do not learn the relationships between the facial images and BDI–II scores, which can be resulting in the noisy labeling for automatic depression estimation (ADE). To mitigate this problem, we propose an automated deep architecture, namely the self-adaptation network (SAN), to improve this uncertain labeling for ADE. Specifically, the architecture consists of four modules: (1) ResNet-18 and ResNet-50 are adopted in the deep feature extraction module (DFEM) to extract informative deep features; (2) a self-attention module (SAM) is adopted to learn the weights from the mini-batch; (3) a square ranking regularization module (SRRM) to create high partitions and low partitions is proposed; and (4) a re-label module (RM) is used to re-label the uncertain annotations for ADE in the low partitions. We conduct extensive experiments on depression databases (i.e., AVEC2013 and AVEC2014) and obtain a performance comparable to the performances of other ADE methods in assessing the severity of depression. More importantly, the proposed method can learn valuable depression patterns from facial videos and obtain a performance comparable to the performances of other methods for depression recognition.
... Williamson et al. [59] utilized feature sets derived from facial movements and acoustic verbal cues to detect psychomotor retardation. They employed Principal component analysis for dimensionality reduction and then applied the Gaussian mixture model to classify the combination of principal feature vectors. ...
Article
Full-text available
Depression has become a global concern, and COVID-19 also has caused a big surge in its incidence. Broadly, there are two primary methods of detecting depression: Task-based and Mobile Crowd Sensing (MCS) based methods. These two approaches, when integrated, can complement each other. This paper proposes a novel approach for depression detection that combines real-time MCS and task-based mechanisms. We aim to design an end-to-end machine learning pipeline, which involves multimodal data collection, feature extraction, feature selection, fusion, and classification to distinguish between depressed and non-depressed subjects. For this purpose, we created a real-world dataset of depressed and non-depressed subjects. We experimented with: various features from multi-modalities, feature selection techniques, fused features, and machine learning classifiers such as Logistic Regression, Support Vector Machines (SVM), etc. for classification. Our findings suggest that combining features from multiple modalities perform better than any single data modality, and the best classification accuracy is achieved when features from all three data modalities are fused. Feature selection method based on Pearson’s correlation coefficients improved the accuracy in comparison with other methods. Also, SVM yielded the best accuracy of 86%. Our proposed approach was also applied on benchmarking dataset, and results demonstrated that the multimodal approach is advantageous in performance with state-of-the-art depression recognition techniques.
... Williamson et al. [64] derived the facial coordination features from the facial action unit signal. The dimensionality of obtained features was reduced using Principal Component Analysis (PCA) to enhance the prediction of the BDI score for depression. ...
Article
Full-text available
Presently, while automated depression diagnosis has made great progress, most of the recent works have focused on combining multiple modalities rather than strengthening a single one. In this research work, we present a unimodal framework for depression detection based on facial expressions and facial motion analysis. We investigate a wide set of visual features extracted from different facial regions. Due to high dimensionality of the obtained feature sets, identification of informative and discriminative features is a challenge. This paper suggests a hybrid dimensionality reduction approach which leverages the advantages of the filter and wrapper methods. First, we use a univariate filter method, Fisher Discriminant Ratio, to initially reduce the size of each feature set. Subsequently, we propose an Incremental Linear Discriminant Analysis (ILDA) approach to find an optimal combination of complementary and relevant feature sets. We compare the performance of the proposed ILDA with the batch-mode LDA and also the Composite Kernel based Support Vector Machine (CKSVM) method. The experiments conducted on the Distress Analysis Interview Corpus Wizard-of-Oz (DAIC-WOZ) dataset demonstrate that the best depression classification performance is obtained by using different feature extraction methods in combination rather than individually. ILDA generates better depression classification results in comparison to the CKSVM. Moreover, ILDA based wrapper feature selection incurs lower computational cost in comparison to the CKSVM and the batch-mode LDA methods. The proposed framework significantly improves the depression classification performance, with an F1 Score of 0.805, which is better than all the video based depression detection models suggested in literature, for the DAIC-WOZ dataset. Salient facial regions and well performing visual feature extraction methods are also identified.
Article
Full-text available
Background: While speech analysis holds promise for mental health assessment, research often focuses on single symptoms, despite symptom co-occurrences and interactions. In addition, predictive models in mental health do not properly assess speech-based systems' limitations, such as uncertainty, or fairness for a safe clinical deployment. Objective: We investigated the predictive potential of mobile-collected speech data for detecting and estimating depression, anxiety, fatigue, and insomnia, focusing on other factors than mere accuracy, in the general population. Methods: We included n=865 healthy adults and recorded their answers regarding their perceived mental and sleep states. We asked how they felt and if they had slept well lately. Clinically validated questionnaires measuring depression, anxiety, insomnia, and fatigue severity were also used. We developed a novel speech and machine learning pipeline involving voice activity detection, feature extraction, and model training. We automatically analyzed participants' speech with a fully ML automatic pipeline to capture speech variability. Then, we modelled speech with pretrained deep learning models that were pre-trained on a large open free database and we selected the best one on the validation set. Based on the best speech modelling approach, we evaluated clinical threshold detection, individual score prediction, model uncertainty estimation, and performance fairness across demographics (age, sex, education). We employed a train-validation-test split for all evaluations: to develop our models, select the best ones and assess the generalizability of held-out data. Results: The best model was WhisperM with a max pooling, and oversampling method. Our methods achieved good detection performance for all symptoms, depression (PHQ-9 AUC= 0.76F1=0.49, BDI AUC=0.78, F1=0,65), anxiety (GAD-7 F1=0.50, AUC=0.77) insomnia (AIS AUC=0.73, F1=0.62), and fatigue (MFI Total Score F1=0.88, AUC=0.68). These strengths were maintained for depression detection with BDI and Fatigue for abstention rates for uncertain cases (Risk-Coverage AUCs < 0.4). Individual symptom scores were predicted with good accuracy (Correlations were all significant, with Pearson strengths between 0.31 and 0.49). Fairness analysis revealed that models were consistent for sex (average Disparity Ratio (DR) = 0.86), to a lesser extent for education level (average Disparity Ratio (DR) = 0.47) and worse for age groups (average Disparity Ratio (DR) = 0.33). Conclusions: This study demonstrates the potential of speech-based systems for multifaceted mental health assessment in the general population, not only for detecting clinical thresholds but also for estimating their severity. Addressing fairness and incorporating uncertainty estimation with selective classification are key contributions that can enhance the clinical utility and responsible implementation of such systems. This approach offers promise for more accurate and nuanced mental health assessments, benefiting both patients and clinicians.
Conference Paper
Full-text available
Depression recognition (DR) using facial images, audio signals, or language text recordings has achieved remarkable performance. Recently, multimodal DR has shown improved performance over single-modal methods by leveraging information from a combination of these modalities. However, collecting high-quality data containing all modalities poses a challenge. In particular, these methods often encounter performance degradation when certain modalities are either missing or degraded. To tackle this issue, we present a generalizable multimodal framework for DR by aggregating feature disentanglement and privileged knowledge distillation. In detail, our approach aims to disentangle homogeneous and heterogeneous features within multimodal signals while suppressing noise, thereby adaptively aggregating the most informative components for high-quality DR. Subsequently, we leverage knowledge distillation to transfer privileged knowledge from complete modalities to the observed input with limited information, thereby significantly improving the tolerance and compatibility. These strategies form our novel Feature Disentanglement and Privileged knowledge Distillation Network for DR, dubbed Dis2DR. Experimental evaluations on AVEC 2013, AVEC 2014, AVEC 2017, and AVEC 2019 datasets demonstrate the effectiveness of our Dis2DR method. Remarkably, Dis2DR achieves superior performance even when only a single modality is available, surpassing existing state-of-the-art multimodal DR approaches AVA-DepressNet by up to 9.8% on the AVEC 2013 dataset.
Article
Video-based automatic depression analysis provides a fast, objective and repeatable self-assessment solution, which has been widely developed in recent years. While depression cues may be reflected by human facial behaviours of various temporal scales, most existing approaches either focused on modelling depression from short-term or video-level facial behaviours. In this sense, we propose a two-stage framework that models depression severity from multi-scale short-term and video-level facial behaviours. The short-term depressive behaviour modelling stage first deep learns depression-related facial behavioural features from multiple short temporal scales, where a Depression Feature Enhancement (DFE) module is proposed to enhance the depression-related cues for all temporal scales and remove non-depression related noise. Two novel graph encoding strategies are proposed in the video-level depressive behavior modeling stage, i.e., Sequential Graph Representation (SEG) and Spectral Graph Representation (SPG), to re-encode all short-term features of the target video into a video-level graph representation, summarizing depression-related multi-scale video-level temporal information. As a result, the produced graph representations predict depression severity using both short-term and long-term facial behaviour patterns. The experimental results on AVEC 2013, AVEC 2014 and AVEC 2019 datasets show that the proposed DFE module constantly enhanced the depression severity estimation performance for various CNN models while the SPG is superior than other video-level modelling methods. More importantly, the result achieved for the proposed two-stage framework shows its promising and solid performance compared to widely-used one-stage modelling approaches. Our code is publicly available at https://github.com/jiaqi-pro/Depression-detection-Graph
Article
Depression is a major mental health disease of human which is rapidly affecting lives worldwide. Early detection and intervention are crucial for effective treatment and management. Depression is regularly identified with thoughts of suicide. Significant depression can bring about a determination of social and physical side effects. It could remember changes in rest, craving, vitality level, concentration, day by day conduct or confidence. In recent years, deep-learned applications concentrated on neural networks have shown superior performance at hand-crafted apps in various areas. This system presents an innovative artificial intelligence (AI) system designed for automatic depression level analysis by analyzing visual and vocal expressions. Leveraging advances in computer vision and natural language processing, the proposed system extracts relevant features from facial expressions and speech patterns to assess depression severity levels accurately. Deep-learned apps that settle the above issues that may precisely assess the degree of voice and face depression. In the proposed method, Convolutionary Neural Networks (CNN) is developed for learning deep-learned features and descriptive raw waveforms for visual expressions. Second. Keywords: Artificial Intelligence, Depression, Vocal, Facial Expressions, Deep Learning
Chapter
Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub (https://github.com/cosmaadrian/multimodal-depression-from-video).
Article
Full-text available
Automatic depression diagnosis is a challenging problem, that requires integrating spatial-temporal information and extracting features from audiovisual signals. In terms of privacy protection, the development trend of recognition algorithms based on facial landmarks has created additional challenges and difficulties. In this paper, we propose an audiovisual attention network (AVA-DepressNet) for depression recognition. It is a novel multimodal framework with facial privacy protection, and uses attention-based modules to enhance audiovisual spatial and temporal features. In addition, an adversarial multistage (AMS) training strategy is developed to optimize the encoder-decoder structure. Additionally, facial structure prior knowledge is creatively used in AMS training. Our AVA-DepressNet is evaluated on popular audiovisual depression datasets: AVEC 2013, AVEC 2014, and AVEC 2017. The results show that our approach reaches the state-of-the-art performance or competitive results for depression recognition.
Chapter
In literature, many feature extraction methods have been suggested to compute features from videos in order to diagnose depression. But, using a single feature extraction method does not generate good performance. Many works have combined features extracted using different feature extraction methods and applied feature selection using Evolutionary Algorithms (EA) to improve the depression detection accuracy. However, with high dimensional features, the search space and computational complexity for an EA increases and it converges to a sub-optimal solution. In order to reduce the search space and computational complexity of an EA we suggest a two phased evolutionary approach based on the Quantum Whale Optimization Algorithm (QWOA). In the first phase, QWOA is used to reduce and select the optimum combination of feature extraction methods. In the second phase, features computed using the feature extraction methods selected in the first stage are concatenated, and QWOA is used to select the relevant features. Experiments performed on the DAICWOZ dataset demonstrate that the proposed approach significantly reduces the computational complexity and converges to a score of 0.8726 (0.9353) for the F1 Depressed (F1 non-Depressed) category. The obtained depression detection performance exceeds the state-of-the-art results. The optimum combination of features selected are statistically significant for detecting depression. KeywordsDepression detectionFacial FeaturesFeature selectionQuantum Whale Optimization Algorithm
Article
Mental disorders are rapidly increasing each year and have become a major challenge affecting the social and financial well-being of individuals. There is a need for phenotypic characterization of psychiatric disorders with biomarkers to provide a rich signature for Major Depressive Disorder, improving the understanding of the pathophysiological mechanisms underlying these mental disorders. This comprehensive review focuses on depression and relapse detection modalities such as self-questionnaires, audiovisuals, and EEG, highlighting noteworthy publications in the last ten years. The article concentrates on the literature that adopts machine learning by audiovisual and EEG signals. It also outlines preprocessing, feature extraction, and public datasets for depression detection. The review concludes with recommendations that will help improve the reliability of developed models and the determinism of computational intelligence-based systems in psychiatry. To the best of our knowledge, this survey is the first comprehensive review on depression and relapse prediction by self-questionnaires, audiovisual, and EEG-based approaches. The findings of this review will serve as a useful and structured starting point for researchers studying clinical and non-clinical depression recognition and relapse through machine learning-based approaches.
Article
Depression, or major depressive disorder, is a mental illness that significantly affects psychosocial functioning and reduces the quality of one’s life. The annual incidence of depression throughout the globe is around 6%. The disorder should be diagnosed at a particular stage for the treatment to be designed. Biomarkers can help to do so with objective pieces of evidence. Various biomarkers like Imaging biomarkers, Molecular biomarkers, Transcriptomic biomarkers, Genetic biomarkers, Neuroendocrine, and Inflammatory biomarkers can be used to diagnose depression. The use of digital sensors has also been reported recently for the determination of depression. This review summarizes various biomarkers to diagnose depression. Further recent updates and related clinical trials are included.
Article
There is an urgent need to detect depression using a non-intrusive approach that is reliable and accurate. In this paper, a simple and efficient unimodal depression detection approach based on speech is proposed, which is non-invasive, cost-effective and computationally inexpensive. A set of spectral, temporal and spectro-temporal features is derived from the speech signal of healthy and depressed subjects. To select a minimal subset of the relevant and non-redundant speech features to detect depression, a two-phase approach based on the nature-inspired wrapper-based feature selection Quantum-based Whale Optimization Algorithm (QWOA) is proposed. Experiments are performed on the publicly available Distress Analysis Interview Corpus Wizard-of-Oz (DAICWOZ) dataset and compared with three established univariate filtering techniques for feature selection and four well-known evolutionary algorithms. The proposed model outperforms all the univariate filter feature selection techniques and the evolutionary algorithms. It has low computational complexity in comparison to traditional wrapper-based evolutionary methods. The performance of the proposed approach is superior in comparison to existing unimodal and multimodal automated depression detection models. The combination of spectral, temporal and spectro-temporal speech features gave the best result with the LDA classifier. The performance achieved with the proposed approach, in terms of F1-score for the depressed class and the non-depressed class and error is 0.846, 0.932 and 0.094 respectively. Statistical tests demonstrate that the acoustic features selected using the proposed approach are non-redundant and discriminatory. Statistical tests also establish that the performance of the proposed approach is significantly better than that of the traditional wrapper-based evolutionary methods.
Chapter
Through the advancement of a new generation of information and communication technologies, such as 5G, IoMTInternet of Medical Things (IoMT), machine learning, etc., the scientific community has already extensively explored the possibilities of utilizing such technologies in varied healthcare processes. From the perspective of process management, it could be argued that every single process of the cycle of health is evolving with such trends, from health monitoring and online health consultation to in-hospital diagnosis and surgery, eventually follow-up examinations and rehabilitations. For example, the process of health monitoring and assessment could be enhanced with wearable or non-contact devices to achieve 24/7 monitoring.
Article
Full-text available
Mental disorders are closely related to deficits in cognitive control. Such cognitive impairments may result in aberrations in mood, thinking, work, body functions, emotions, social engagements and general behaviour. Mental disorders may affect the phenotypic behaviour like eye movements, facial expressions and speech. Furthermore, a close association has been observed within mental disorders and physiological responses emanating from the brain, muscles, heart, eyes, skin, etc. Mental disorders disrupt higher cognitive function, social cognition, control of complex behaviours and regulation of emotion. Cognitive computation may help understand such disruptions for improved decision-making with the help of computers. This study presents a systematic literature review to promulgate state of art computational methods and technologies facilitating automated detection of mental disorders. For this survey, the relevant literature between 2010 and 2021 has been studied. Recommendations of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) model were adopted for identification, screening, validating and inclusion of research literature. The self-diagnosis tools for detection of mental disorders like questionnaires and rating scales are inconsistent and static in nature. They cannot encompass the diversity of mental disorders, inter-individual variability and impact of emotional state of an individual. Furthermore, there are no standard baselines for mental disorders. This situation mandates a multi-faceted approach which may utilise data from physiological signals, behavioural patterns and even data obtained from various online portals like social media to efficiently and effectively detect the prevalence, type and severity of mental disorders.
Article
Full-text available
Background Major Depressive Disorder (MDD) is prevalent, often chronic, and requires ongoing monitoring of symptoms to track response to treatment and identify early indicators of relapse. Remote Measurement Technologies (RMT) provide an opportunity to transform the measurement and management of MDD, via data collected from inbuilt smartphone sensors and wearable devices alongside app-based questionnaires and tasks. A key question for the field is the extent to which participants can adhere to research protocols and the completeness of data collected. We aimed to describe drop out and data completeness in a naturalistic multimodal longitudinal RMT study, in people with a history of recurrent MDD. We further aimed to determine whether those experiencing a depressive relapse at baseline contributed less complete data. Methods Remote Assessment of Disease and Relapse – Major Depressive Disorder (RADAR-MDD) is a multi-centre, prospective observational cohort study conducted as part of the Remote Assessment of Disease and Relapse – Central Nervous System (RADAR-CNS) program. People with a history of MDD were provided with a wrist-worn wearable device, and smartphone apps designed to: a) collect data from smartphone sensors; and b) deliver questionnaires, speech tasks, and cognitive assessments. Participants were followed-up for a minimum of 11 months and maximum of 24 months. Results Individuals with a history of MDD ( n = 623) were enrolled in the study,. We report 80% completion rates for primary outcome assessments across all follow-up timepoints. 79.8% of people participated for the maximum amount of time available and 20.2% withdrew prematurely. We found no evidence of an association between the severity of depression symptoms at baseline and the availability of data. In total, 110 participants had > 50% data available across all data types. Conclusions RADAR-MDD is the largest multimodal RMT study in the field of mental health. Here, we have shown that collecting RMT data from a clinical population is feasible. We found comparable levels of data availability in active and passive forms of data collection, demonstrating that both are feasible in this patient group.
Article
Full-text available
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evaluation of a patient's level of depression the observation of facial expressions and vocal cues. It is in this context that we present the fourth Audio-Visual Emotion recognition Challenge (AVEC 2014). This edition of the challenge uses a subset of the tasks used in a previous challenge, allowing for more focussed studies. In addition, labels for a third dimension (Dominance) have been added and the number of annotators per clip has been increased to a minimum of three, with most clips annotated by 5. The challenge has two goals logically organised as sub-challenges: the first is to predict the continuous values of the affective dimensions valence, arousal and dominance at each moment in time. The second is to predict the value of a single self-reported severity of depression indicator for each recording in the dataset. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Article
Full-text available
Neurophysiological changes in the brain associated with early dementia can disrupt articulatory timing and precision in speech production. Motivated by this observation, we address the hypothesis that speaking rate and articulatory coordination, as manifested through formant frequency tracks, can predict performance on an animal fluency task administerd to the elderly. Specifically, using phoneme-based measures of speaking rate and articulatory coordination derived from formant cross-correlation measures, we investigate the capability of speech features, estimated from paragraph-recall and naturalistic free speech, to predict animal fluency assessment scores. Using a database consisting of audio from elderly subjects over a 4 year period, we develop least-squares regression models of our cognitive performance measures. The best performing model combined speaking rate and formant features, resulting in a correlation (R) of 0.61 and a root mean squared error (RMSE) of 5.07 with respect to a 9- 34 score range. Vocal features thus provide a reduction by about 30% in MSE from a baseline (mean score) in predicting cognitive performance derived from the animal fluency assessment.
Article
Full-text available
In an earlier study, we evaluated the effectiveness of several acoustic measures in predicting breathiness ratings for sustained vowels spoken by nonpathological talkers who were asked to produce nonbreathy, moderately breathy, and very breathy phonation (Hillenbrand, Cleveland, & Erickson, 1994). The purpose of the present study was to extend these results to speakers with laryngeal pathologies and to conduct tests using connected speech in addition to sustained vowels. Breathiness ratings were obtained from a sustained vowel and a 12-word sentence spoken by 20 pathological and 5 nonpathological talkers. Acoustic measures were made of (a) signal periodicity, (b) first harmonic amplitude, and (c) spectral tilt. For the sustained vowels, a frequency domain measure of periodicity provided the most accurate predictions of perceived breathiness, accounting for 92% of the variance in breathiness ratings. The relative amplitude of the first harmonic and two measures of spectral tilt correlated moderately with breathiness ratings. For the sentences, both signal periodicity and spectral tilt provided accurate predictions of breathiness ratings, accounting for 70%-85% of the variance.
Article
Full-text available
In Major Depressive Disorder (MDD), neurophysiologic changes can alter motor control [1, 2] and therefore alter speech production by influencing the characteristics of the vocal source, tract, and prosodics. Clinically, many of these characteristics are associated with psychomotor retardation, where a patient shows sluggishness and motor disorder in vocal articulation, affecting coordination across multiple aspects of production [3, 4]. In this paper, we exploit such effects by selecting features that reflect changes in coordination of vocal tract motion associated with MDD. Specifically, we investigate changes in correlation that occur at different time scales across formant frequencies and also across channels of the delta-mel-cepstrum. Both feature domains provide measures of coordination in vocal tract articulation while reducing effects of a slowly-varying linear channel, which can be introduced by time-varying microphone placements. With these two complementary feature sets, using the AVEC 2013 depression dataset, we design a novel Gaussian mixture model (GMM)-based multivariate regression scheme, referred to as Gaussian Staircase Regression, that provides a root-mean-squared-error (RMSE) of 7.42 and a mean-absolute-error (MAE) of 5.75 on the standard Beck depression rating scale. We are currently exploring coordination measures of other aspects of speech production, derived from both audio and video signals.
Article
Full-text available
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Article
Full-text available
35 right-handed White females (18–35 yrs) viewed positive and stress-inducing motion picture films and then reported on their subjective experience. Spontaneous facial expressions provided accurate information about more specific aspects of emotional experience than just the pleasant vs unpleasant distinction. The facial action coding system (P. Ekman and W. V. Friesen, 1978) isolated a particular type of smile that was related to differences in reported happiness between Ss who showed this action and Ss who did not, to the intensity of happiness, and to which of 2 happy experiences was reported as happiest. Ss who showed a set of facial actions hypothesized to be signs of various negative affects reported experiencing more negative emotion than Ss who did not show these actions. How much these facial actions were shown was related to the reported intensity of negative affect. Specific facial actions associated with the experience of disgust are identified. (38 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
A seizure prediction algorithm is proposed that combines novel multivariate EEG features with patient-specific machine learning. The algorithm computes the eigenspectra of space-delay correlation and covariance matrices from 15-s blocks of EEG data at multiple delay scales. The principal components of these features are used to classify the patient's preictal or interictal state. This is done using a support vector machine (SVM), whose outputs are averaged using a running 15-minute window to obtain a final prediction score. The algorithm was tested on 19 of 21 patients in the Freiburg EEG data set who had three or more seizures, predicting 71 of 83 seizures, with 15 false predictions and 13.8h in seizure warning during 448.3h of interictal data. The proposed algorithm scales with the number of available EEG signals by discovering the variations in correlation structure among any given set of signals that correlate with seizure risk.
Conference Paper
Full-text available
A patient-specific seizure prediction algorithm is proposed that extracts novel multivariate signal coherence features from ECoG recordings and classifies a patient's pre-seizure state. The algorithm uses space-delay correlation and covariance matrices at several delay scales to extract the spatiotemporal correlation structure from multichannel ECoG signals. Eigen spectra and amplitude features are extracted from the correlation and covariance matrices, followed by dimensionality reduction using principal components analysis, classification using a support vector machine, and temporal integration to produce a seizure prediction score. Evaluation on the Freiburg EEG database produced a sensitivity of 90.8% and false positive rate of .094.
Conference Paper
Full-text available
In 2000, the Cohn-Kanade (CK) database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK database has become one of the most widely used test-beds for algorithm development and evaluation. During this period, three limitations have become apparent: 1) While AU codes are well validated, emotion labels are not, as they refer to what was requested rather than what was actually performed, 2) The lack of a common performance metric against which to evaluate new algorithms, and 3) Standard protocols for common databases have not emerged. As a consequence, the CK database has been used for both AU and emotion detection (even though labels for the latter have not been validated), comparison with benchmark algorithms is missing, and use of random subsets of the original database makes meta-analyses difficult. To address these and other concerns, we present the Extended Cohn-Kanade (CK+) database. The number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded and emotion labels have been revised and validated. In addition to this, non-posed sequences for several types of smiles and their associated metadata have been added. We present baseline results using Active Appearance Models (AAMs) and a linear support vector machine (SVM) classifier using a leave-one-out subject cross-validation for both AU and emotion detection for the posed data. The emotion and AU labels, along with the extended image data and tracked landmarks will be made available July 2010.
Conference Paper
Full-text available
In this paper, we report the influence that classification accuracies have in speech analysis from a clinical dataset by adding acoustic low-level descriptors (LLD) belonging to prosodic (i.e. pitch, formants, energy, jitter, shimmer) and spectral features (i.e. spectral flux, centroid, entropy and roll-off) along with their delta (Δ) and delta-delta (Δ-Δ) coefficients to two baseline features of Mel frequency cepstral coefficients and Teager energy critical-band based autocorrelation envelope. Extracted acoustic low-level descriptors (LLD) that display an increase in accuracy after being added to these baseline features were finally modeled together using Gaussian mixture models and tested. A clinical data set of speech from 139 adolescents, including 68 (49 girls and 19 boys) diagnosed as clinically depressed, was used in the classification experiments. For male subjects, the combination of (TEO-CB-Auto-Env + Δ + Δ-Δ) + F0 + (LogE + Δ + Δ-Δ) + (Shimmer + Δ) + Spectral Flux + Spectral Roll-off gave the highest classification rate of 77.82% while for the female subjects, using TEO-CB-Auto-Env gave an accuracy of 74.74%.
Conference Paper
Full-text available
We present the Computer Expression Recognition Toolbox (CERT), a software tool for fully automatic real-time facial expression recognition, and officially release it for free academic use. CERT can automatically code the intensity of 19 different facial actions from the Facial Action Unit Coding System (FACS) and 6 different protoypical facial expressions. It also estimates the locations of 10 facial features as well as the 3-D orientation (yaw, pitch, roll) of the head. On a database of posed facial expressions, Extended Cohn-Kanade (CK+ (1)), CERT achieves an average recognition performance (probability of correctness on a two-alternative forced choice (2AFC) task between one positive and one negative example) of 90.1% when analyzing facial actions. On a spontaneous facial expression dataset, CERT achieves an accuracy of nearly 80%. In a standard dual core laptop, CERT can process 320 × 240 video images in real time at approximately 10 frames per second.
Article
Full-text available
In an earlier study, we evaluated the effectiveness of several acoustic measures in predicting breathiness ratings for sustained vowels spoken by nonpathological talkers who were asked to produce nonbreathy, moderately breathy, and very breathy phonation (Hillenbrand, Cleveland, & Erickson, 1994). The purpose of the present study was to extend these results to speakers with laryngeal pathologies and to conduct tests using connected speech in addition to sustained vowels. Breathiness ratings were obtained from a sustained vowel and a 12-word sentence spoken by 20 pathological and 5 nonpathological talkers. Acoustic measures were made of (a) signal periodicity, (b) first harmonic amplitude, and (c) spectral tilt. For the sustained vowels, a frequency domain measure of periodicity provided the most accurate predictions of perceived breathiness, accounting for 92% of the variance in breathiness ratings. The relative amplitude of the first harmonic and two measures of spectral tilt correlated moderately with breathiness ratings. For the sentences, both signal periodicity and spectral tilt provided accurate predictions of breathiness ratings, accounting for 70%-85% of the variance.
Article
Full-text available
A decomposition algorithm that uses a pitch-scaled harmonic filter was evaluated using synthetic signals and applied to mixed-source speech, spoken by three subjects, to separate the voiced and unvoiced parts. Pulsing of the noise component was observed in voiced frication, which was analyzed by complex demodulation of the signal envelope. The timing of the pulsation, represented by the phase of the anharmonic modulation coefficient, showed a step change during a vowel-fricative transition corresponding to the change in location of the noise source within the vocal tract. Analysis of fricatives [see text] demonstrated a relationship between steady-state phase and place, and f0 glides confirmed that the main cause was a place-dependent delay.
Article
Full-text available
Quantification of perceptual voice characteristics allows the assessment of voice changes. Acoustic measures of jitter, shimmer, and noise-to-harmonic ratio (NHR) are often unreliable. Measures of cepstral peak prominence (CPP) may be more reliable predictors of dysphonia. Trained listeners analyzed voice samples from 281 patients. The NHR, amplitude perturbation quotient, smoothed pitch perturbation quotient, percent jitter, and CPP were obtained from sustained vowel phonation, and the CPP was obtained from running speech. For the first time, normal and abnormal values of CPP were defined, and they were compared with other acoustic measures used to predict dysphonia. The CPP for running speech is a good predictor and a more reliable measure of dysphonia than are acoustic measures of jitter, shimmer, and NHR.
Article
Full-text available
Among the many clinical decisions that psychiatrists must make, assessment of a patient's risk of committing suicide is definitely among the most important, complex, and demanding. When reviewing his clinical experience, one of the authors observed that successful predictions of suicidality were often based on the patient's voice independent of content. The voices of suicidal patients judged to be high-risk near-term exhibited unique qualities, which distinguished them from nonsuicidal patients. We investigated the discriminating power of two excitation-based speech parameters, vocal jitter and glottal flow spectrum, for distinguishing among high-risk near-term suicidal, major depressed, and nonsuicidal patients. Our sample consisted of ten high-risk near-term suicidal patients, ten major depressed patients, and ten nondepressed control subjects. As a result of two sample statistical analyses, mean vocal jitter was found to be a significant discriminator only between suicidal and nondepressed control groups (p < 0.05). The slope of the glottal flow spectrum, on the other hand, was a significant discriminator between all three groups (p < 0.05). A maximum likelihood classifier, developed by combining the a posteriori probabilities of these two features, yielded correct classification scores of 85% between near-term suicidal patients and nondepressed controls, 90% between depressed patients and nondepressed controls, and 75% between near-term suicidal patients and depressed patients. These preliminary classification results support the hypothesized link between phonation and near-term suicidal risk. However, validation of the proposed measures on a larger sample size is necessary.
Conference Paper
Full-text available
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their voiced and unvoiced constituents. In this paper, we evaluate its ability to reconstruct the time series of the two components accurately using a variety of synthetic, speech-like signals, and discuss its performance. These results determine the degree of confidence that can be expected for real speech signals: typically, 5 dB improvement in the signal-to-noise ratio of the harmonic component and approximately 5 dB more than the initial harmonics-to-noise ratio (HNR) in the anharmonic component. A selection of the analysis opportunities that the decomposition offers is demonstrated on speech recordings, including dynamic HNR estimation and separate linear prediction analyses of the two components. These new capabilities provided by the PSHF can facilitate discovering previously hidden features and investigating interactions of unvoiced sources, such as frication, with voicing
Article
Full-text available
Almost all speech contains simultaneous contributions from more than one acoustic source within the speaker's vocal tract. In this paper, we propose a method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach. The PSHF outputs periodic and aperiodic components that are estimates of the respective contributions of the different types of acoustic source. It produces four reconstructed time series signals by decomposing the original speech signal, first, according to amplitude, and then according to power of the Fourier coefficients. Thus, one pair of periodic and aperiodic signals is optimized for subsequent time-series analysis, and another pair for spectral analysis. The performance of the PSHF algorithm is tested on synthetic signals, using three forms of disturbance (jitter, shimmer and additive noise), and the results were used to predict the performance on real speech. Processing recorded speech examples elicited latent features from the signals, demonstrating the PSHF's potential for analysis of mixed-source speech
Article
Full-text available
We present a straightforward and robust algorithm for periodicity detection, working in the lag (autocorrelation) domain. When it is tested for periodic signals and for signals with additive noise or jitter, it proves to be several orders of magnitude more accurate than the methods commonly used for speech analysis. This makes our method capable of measuring harmonics-to-noise ratios in the lag domain with an accuracy and reliability much greater than that of any of the usual frequency-domain methods. By definition, the best candidate for the acoustic pitch period of a sound can be found from the position of the maximum of the autocorrelation function of the sound, while the degree of periodicity (the harmonics-to-noise ratio) of the sound can be found from the relative height of this maximum. However, sampling and windowing cause problems in accurately determining the position and height of the maximum. These problems have led to inaccurate timedomain and cepstral methods for p...
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
Article
Neurophysiological changes in the brain associated with major depression disorder can disrupt articulatory precision in speech production. Motivated by this observation, we address the hypothesis that articulatory features, as manifested through formant frequency tracks, can help in automatically classifying depression state. Specifically, we investigate the relative importance of vocal tract formant frequencies and their dynamic features from sustained vowels and conversational speech. Using a database consisting of audio from 35 subjects with clinical measures of depression severity, we explore the performance of Gaussian mixture model (GMM) and support vector machine (SVM) classifiers. With only formant frequencies and their dynamics given by velocity and acceleration, we show that depression state can be classified with an optimal sensitivity/specificity/area under the ROC curve of 0.86/0.64/0.70 and 0.77/0.77/0.73 for GMMs and SVMs, respectively. Future work will involve merging our formant-based characterization with vocal source and prosodic features.
Conference Paper
Speech analysis has shown potential for identifying neurological impairment. With brain trauma, changes in brain structure or connectivity may result in changes in source, prosodic, or articulatory aspects of voice. In this work, we examine the articulatory components of speech reflected in formant tracks, and how changes in track dynamics and coordination map to cognitive decline. We address a population of athletes regularly receiving impacts to the head and showing signs of preclinical mild traumatic brain injury (mTBI), a state indicated by impaired cognitive performance occurring prior to concussion. We hypothesize that this preclinical damage results in 1) changes in average vocal tract dynamics measured by formant frequencies, their velocities, and acceleration, and 2) changes in articulatory coordination measured by a novel formant-frequency cross-correlation characterization. These features allow machine learning algorithms to detect preclinical mTBI identified by a battery of cognitive tests. A comparison is performed of the effectiveness of vocal tract dynamics features versus articulatory coordination features. This evaluation is done using receiver operating characteristic (ROC) curves along with confidence bounds. The articulatory dynamics features achieve area under the ROC curve (AUC) values between 0.72 and 0.98, whereas the articulatory coordination features achieve AUC values between 0.94 and 0.97.
Article
A hypothesis in characterizing human depression is that change in the brain?s basal ganglia results in a decline of motor coordination [6][8][14]. Such a neuro-physiological change may therefore affect laryngeal control and dynamics. Under this hypothesis, toward the goal of objective monitoring of depression severity, we investigate vocal-source biomarkers for depression; specifically, source features that may relate to precision in motor control, including vocal-fold shimmer and jitter, degree of aspiration, fundamental frequency dynamics, and frequency-dependence of variability and velocity of energy. We use a 35-subject database collected by Mundt et al. [1] in which subjects were treated over a six-week period, and investigate correlation of our features with clinical (HAMD), as well as self-reported (QIDS) Total subject assessment scores. To explicitly address the motor aspect of depression, we compute correlations with the Psychomotor Retardation component of clinical and self-reported Total assessments. For our longitudinal database, most correlations point to statistical relationships of our vocal-source biomarkers with psychomotor activity, as well as with depression severity.
Article
Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame-by-frame point estimates of formant frequencies followed by candidate selection and smoothing using dynamic programming methods that minimize ad hoc cost functions. The goal of the current work is to provide both point estimates and associated uncertainties of center frequencies and bandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithms take advantage of a linearized mapping to infer formant and antiformant parameters from frame-based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis of KARMA, wavesurfer, and praat is accomplished in the all-pole case using a manually marked formant database and synthesized speech waveforms. KARMA formant tracks exhibit lower overall root-mean-square error relative to the two benchmark algorithms with the ability to modify parameters in a controlled manner to trade off bias and variance. Antiformant tracking performance of KARMA is illustrated using synthesized and spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practitioners to recognize time-varying confidence in parameters of interest and adjust algorithmic settings accordingly.
Conference Paper
Understanding how someone is speaking can be equally important to what they are saying when evaluating emotional disorders, such as depression. In this study, we use the acoustic speech signal to analyze variations in prosodic feature statistics for subjects suffering from a depressive disorder. A new sample database of subjects with and without a depressive disorder is collected and pitch, energy, and speaking rate feature statistics are generated at a sentence level and grouped into a series of observations (subset of sentences) for analysis. A common technique in quantifying an observation had been to simply use the average of the feature statistic for the subset of sentences within an observation. However, we investigate the merit of a series of statistical measures as a means of quantifying a subset of feature statistics to capture emotional variations from sentence to sentence within a single observation. Comparisons with the exclusive use of the average show an improvement in overall separation accuracy for other quantifying statistics.
Article
Reynolds, Douglas A., Quatieri, Thomas F., and Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing10(2000), 19–41.In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
Conference Paper
In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues as- sociated with both standard word-based and phonetic search methods. We develop three methods that compare query lat- tices derived from example audio against a standard ngram- based phonetic index and we analyze factors affecting the per- formance of these systems. We show that the best systems under this paradigm are able to achieve 77% precision when retrieving utterances from conversational telephone speech and returning 10 results from a single query (performance that is better than a similar dictionary-based approach) suggesting significant util- ity for applications requiring high precision. We also show that these systems can be further improved using relevance feed- back: By incorporating four additional queries the precision of the best system can be improved by 13.7% relative. Our sys- tems perform well despite high phone recognition error rates (> 40%) and make use of no pronunciation or letter-to-sound resources.
Conference Paper
Of increasing importance in the civilian and military population is the recognition of Major Depressive Disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we investigate automatic classifiers of depression state, that have the important property of mitigating nuisances due to data variability, such as speaker and channel effects, unrelated to levels of depression. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a six-week duration, along with standard clinical HAMD depression ratings. Preliminary experiments indicate that by mitigating nuisances, thus focusing on depression severity as a class, we can significantly improve classification accuracy over baseline Gaussian-mixture-model-based classifiers.
Conference Paper
Abstract Vocal tract resonances play a central role in the perception and analysis of speech. Here we consider the canonical task of es- timating such resonances from an observed acoustic waveform, and formulate it as a statistical model-based tracking problem. In this vein, Deng and colleagues recently showed that a robust linearization of the formant-to-cepstrum map enables the effec- tive use of a Kalman ltering framework. We extend this model both to account for the uncertainty of speech presence by way of a censored likelihood formulation, as well as to explicitly model formant cross-correlation via a vector autoregression, and in doing so retain a conditionally linear and Gaussian framework amenable,to efcient,estimation schemes. We provide evalua- tions using a recently introduced public database of formant tra- jectories, for which results indicate improvements from twenty to over 30% per formant in terms of root mean square error, relative to a contemporary,benchmark,formant analysis tool. Index Terms: formant tracking, speech analysis, Kalman lter - ing, vocal tract resonances, system identication
Article
Due to the simplicity of their implementations, least square support vector machine (LS-SVM) and proximal support vector machine (PSVM) have been widely used in binary classification applications. The conventional LS-SVM and PSVM cannot be used in regression and multiclass classification applications directly, although variants of LS-SVM and PSVM have been proposed to handle such cases. This paper shows that both LS-SVM and PSVM can be simplified further and a unified learning framework of LS-SVM, PSVM, and other regularization algorithms referred to extreme learning machine (ELM) can be built. ELM works for the "generalized" single-hidden-layer feedforward networks (SLFNs), but the hidden layer (or called feature mapping) in ELM need not be tuned. Such SLFNs include but are not limited to SVM, polynomial network, and the conventional feedforward neural networks. This paper shows the following: 1) ELM provides a unified learning platform with a widespread type of feature mappings and can be applied in regression and multiclass classification applications directly; 2) from the optimization method point of view, ELM has milder optimization constraints compared to LS-SVM and PSVM; 3) in theory, compared to ELM, LS-SVM and PSVM achieve suboptimal solutions and require higher computational complexity; and 4) in theory, ELM can approximate any target continuous function and classify any disjoint regions. As verified by the simulation results, ELM tends to have better scalability and achieve similar (for regression and binary class cases) or much better (for multiclass cases) generalization performance at much faster learning speed (up to thousands times) than traditional SVM and LS-SVM.
Article
Efforts to develop more effective depression treatments are limited by assessment methods that rely on patient-reported or clinician judgments of symptom severity. Depression also affects speech. Research suggests several objective voice acoustic measures affected by depression can be obtained reliably over the telephone. Thirty-five physician-referred patients beginning treatment for depression were assessed weekly, using standard depression severity measures, during a six-week observational study. Speech samples were also obtained over the telephone each week using an IVR system to automate data collection. Several voice acoustic measures correlated significantly with depression severity. Patients responding to treatment had significantly greater pitch variability, paused less while speaking, and spoke faster than at baseline. Patients not responding to treatment did not show similar changes. Telephone standardization for obtaining voice data was identified as a critical factor influencing the reliability and quality of speech data. This study replicates and extends previous research with a larger sample of patients assessing clinical change associated with treatment. The feasibility of obtaining voice acoustic measures reflecting depression severity and response to treatment using computer-automated telephone data collection techniques is also established. Insight and guidance for future research needs are also identified.
Article
To improve ecological validity, perceptual and instrumental assessment of disordered voice, including overall voice quality, should ideally sample both sustained vowels and continuous speech. This investigation assessed the utility of combining both voice contexts for the purpose of auditory-perceptual ratings as well as acoustic measurement of overall voice quality. Sustained vowel and continuous speech samples from 251 subjects with (n=229) or without (n=22) various voice disorders were concatenated and perceptually rated on overall voice quality by five experienced voice clinicians. After removing the nonvoiced segments within the continuous speech samples, the concatenated samples were analyzed using 13 acoustic measures based on fundamental frequency perturbation, amplitude perturbation, spectral and cepstral analyses. Stepwise multiple regression analysis yielded a six-variable acoustic model for the multiparametric measurement of overall voice quality of the concatenated samples (with a cepstral measure as the main contributor to the prediction of overall voice quality). The correlation of this model with mean ratings of overall voice quality resulted in r(s)=0.78. A cross-validation approach involving the iterated internal cross-correlations with 30 subgroups of 100, 50, and 10 samples confirmed a comparable degree of association. Furthermore, the ability of the model to distinguish voice-disordered from vocally normal participants was assessed using estimates of diagnostic precision including receiver operating characteristic (ROC) curve analysis, sensitivity, and specificity, as well as likelihood ratios (LRs), which adjust for base-rate differences between the groups. Depending on the cutoff criteria employed, the analyses revealed an impressive area under ROC=0.895 as well as respectable sensitivity, specificity, and LR. The results support the diagnostic utility of combining voice samples from both continuous speech and sustained vowels in acoustic and perceptual analysis of disordered voice. The findings are discussed in relation to the extant literature and the need for further refinement of the acoustic algorithm.
Article
When subjects are instructed to self-generate happy, sad, and angry imagery, discrete patterns of facial muscle activity can be detected using electromyographic (EMG) procedures. Prior research from this laboratory suggests that depressed subjects show attenuated facial EMG patterns during imagery conditions, particularly during happy imagery. In the present experiment, 12 depressed subjects and 12 matched normals were requested to generate happy and sad imagery, first with the instruction to simply "think" about the imagery, and then to self-regulate the affective state by "reexperiencing the feelings" associated with the imagery. Continuous recordings of facial EMG were obtained from the corrugator, zygomatic major, depressor anguli oris, and mentalis muscle regions. It was hypothesized that (a) these muscle sites would reliably differentiate between happy and sad imagery. (b) the instruction to self-generate the affective feeling state would produce greater EMG differences than the "think" instructions, and (c) the "think" instructions would be a more sensitive indicator of the difference between depressed and nondepressed subjects, especially for happy imagery. All three hypotheses were confirmed. The application of facial electromyography to the assessment of normal and clinical mood states, and the role of facial muscle patterning in the subjective experience of emotion, are discussed.
Article
Twenty-three acute schizophrenics, 21 acute major depressives (Research Diagnostic Criteria), and 15 normal controls participated in a study on facial expression and emotional face recognition. Under clinical conditions, spontaneous facial expression was assessed according to the affective flattening section of the Scale for the Assessment of Negative Symptoms. Under experimental laboratory conditions involuntary (emotion-eliciting interview) and voluntary facial expression (imitation and simulation of six basic emotions) were recorded on videotape, from which a raterbased analysis of intensity or correctness of facial activity was obtained. Emotional face recognition was also assessed under experimental conditions using the same stimulus material. All subjects were assessed twice (within 4 weeks), controlling for change of the psychopathological status in the patient groups. In schizophrenics, neuroleptic drug influence was controlled by random allocation to treatment with either haloperidol or perazine. The main findings were that schizophrenics and depressives are characterized by different quantitative, qualitative, and temporal patterns of affect-related dysfunctions. In particular, schizophrenics demonstrated a trait-like deficit in affect recognition and in their spontaneous and voluntary facial activity, irrespective of medication, drug type and dosage, or extrapyramidal side-effects. In depressives a stable deficit could be demonstrated only in their involuntary expression under emotion-eliciting interview conditions, whereas in the postacute phase a reduction in their voluntary expression became apparent. Differences in patterns of affect-related behavioral deficits may reflect dysfunctions in different underlying psychobiological systems.
Article
This pilot study tests one model for interdisciplinary research between speech science and psychiatry. Strengths and weaknesses of the model are noted. Thirteen depressed subjects were evaluated before and after treatment with antidepressant medication. Subjects were rated on scales for severity of depression and speech deviations. Scores on a depressed voice scale, comprising seven of the speech dimensions found to be most consistently altered in depression, showed significant improvement after treatment for depression. The constellation of speech signs found in depression suggested a hypokinetic disturbance of the extrapyramidal system. Several directions for further inquiry into this potential relationship are suggested.
Article
Clinicians and researchers lack accuracy in assessing psychomotor functions of patients. Several objective monitoring techniques have recently been proposed with the goal of compiling accurate determinations. These include electromyographic determinations of facial expressions of emotion, measurement of speech phonation and pause times, and use of movement-activated recording monitors to quantify motility. Objective psychomotor assessments may improve classification, longitudinal monitoring, treatment selection, and prediction of outcome for patients with depression and mania.
Article
A principal components analysis was performed on a set (10) of acoustic, aerodynamic, perceptual and laryngoscopic data obtained from 87 dysphonic patients. Two principal components were clearly identified: the first represents in some way the glottal air leakage, resulting in turbulent noise, particularly obvious in higher spectral frequencies, and giving the perceptual impression of breathiness; the second accounts rather for the degree of aperiodicity in vocal fold oscillation, reflected in jitter measurements and with a perceptual correlate of harshness or roughness. Morphological changes of vocal folds correlate more closely with this second principal component. Among acoustic parameters, harmonics-to-noise ratio in the formant zone and magnitude of the dominant cepstrum peak seem to integrate to some extent the effects of both principal components.
Article
Acoustic properties of speech have previously been identified as possible cues to depression, and there is evidence that certain vocal parameters may be used further to objectively discriminate between depressed and suicidal speech. Studies were performed to analyze and compare the speech acoustics of separate male and female samples comprised of normal individuals and individuals carrying diagnoses of depression and high-risk, near-term suicidality. The female sample consisted of ten control subjects, 17 dysthymic patients, and 21 major depressed patients. The male sample contained 24 control subjects, 21 major depressed patients, and 22 high-risk suicidal patients. Acoustic analyses of voice fundamental frequency (Fo), amplitude modulation (AM), formants, and power distribution were performed on speech samples extracted from audio recordings collected from the sample members. Multivariate feature and discriminant analyses were performed on feature vectors representing the members of the control and disordered classes. Features derived from the formant and power spectral density measurements were found to be the best discriminators of class membership in both the male and female studies. AM features emerged as strong class discriminators of the male classes. Features describing Fo were generally ineffective discriminators in both studies. The results support theories that identify psychomotor disturbances as central elements in depression and suicidality.
Article
In summary, MDD is a highly prevalent major medical whose pathophysiology is still poorly understood. MDD is often recurrent or chronic, and evidence suggests that genetic factors partially influence overall risk of illness but also influence the sensitivity of individuals to the depressogenic effects of environmental adversity. Treatment with antidepressants, ECT, or certain forms of psychotherapy is fairly effective, but a substantial proportion of patients do not respond adequately, thereby requiring subsequent interventions. There are still many unanswered questions about MDD: (1) What are the susceptibility genes and their environmental modifiers? (2) What are the pathophysiologies of the neural systems underlying this complex disorder? (3) How do we understand the therapeutic mechanisms underlying the currently available pharmacological and ECT approaches? And (4) how do we improve our success rate in treating this highly disabling medical condition and how can we develop more rational algorithms for those who do not respond to standard treatments? Future investigations need to keep these issues in mind.‡To whom correspondence should be addressed (e-mail: mfava@partners.org [M. F.], kendler@hsc.vcu.edu [K. S. K.]).
Article
Traditional measures of dysphonia vary in their reliability and in their correlations with perceptions of grade. Measurements of cepstral peak prominence (CPP) have been shown to correlate well with perceptions of breathiness. Because it is a measure of periodicity, CPP should also predict roughness. The ability of CPP and other acoustic measures to predict overall dysphonia and the subcategories of breathiness and roughness in pathological voice samples is explored. Preoperative and postoperative speech samples from 19 patients with unilateral recurrent laryngeal nerve paralysis who underwent operative intervention were analyzed by trained listeners and by measures of smoothed CPP (CPPS), noise-to-harmonic ratio (NHR), amplitude perturbation quotient (APQ), relative average perturbation (RAP), and smoothed pitch perturbation quotient (sPPQ). The data were analyzed with bivariate Pearson correlation statistics. Grade of dysphonia and breathiness ratings correlated better with measurements of CPPS than with the other measures. CPPS from samples of connected speech (CPPS-s) best predicted overall dysphonia. None of the measures were useful in predicting roughness.
Article
Sample complexity results from computational learning theory, when applied to neural network learning for pattern classification problems, suggest that for good generalization performance the number of training examples should grow at least linearly with the number of adjustable parameters in the network. Results in this paper show that if a large neural network is used for a pattern classification problem and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. For example, consider a two-layer feedforward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A and the input dimension is n. We show that the misclassification probability is no more than a certain error estimate (that is related to squared error on the training set) plus A3 √((log n)/m) (ignoring log A and log m factors), where m is the number of training patterns. This may explain the generalization performance of neural networks, particularly when the number of training examples is considerably smaller than the number of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training. The proof techniques appear to be useful for the analysis of other pattern classifiers: when the input domain is a totally bounded metric space, we use the same approach to give upper bounds on misclassification probability for classifiers with decision boundaries that are far from the training examples
Xiaojian Ding and Rui Zhang 2012 Extreme Learning Machine for Regression and Multiclass Classification
  • Guang-Bin
  • Hongming Huang
  • Zhou
The computer expression recognition toolbox (CERT) Automatic Face & Gesture Recognition and Workshops
  • G Littlewort
  • J Whitehill
  • T Wu
  • I Fasel
  • M Frank
  • J Movellan
  • M Bartlett
Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents. Acoustics Speech and Signal Processing (ICASSP)
  • L.-S Low
  • M Maddage
  • M Lech
  • L Sheeber
  • N Allen