Conference Paper

Ensembles of BERT for Depression Classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Depression is among the most prevalent mental health disorders with increasing prevalence worldwide. While early detection is critical for the prognosis of depression treatment, detecting depression is challenging. Previous deep learning research has thus begun to detect depression with the transcripts of clinical interview questions. Since approaches using Bidirectional Encoder Representations from Transformers (BERT) have demonstrated particular promise, we hypothesize that ensembles of BERT variants will improve depression detection. Thus, in this research, we compare the depression classification abilities of three BERT variants and four ensembles of BERT variants on the transcripts of responses to 12 clinical interview questions. Specifically, we implement the ensembles with different ensemble strategies, number of model components, and architectural layer combinations. Our results demonstrate that ensembles increase mean F1 scores and robustness across clinical interview data. Clinical relevance- This research highlights the potential of ensembles to detect depression with text which is important to guide future development of healthcare application ecosystems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... questions from patients with depression, making the process more accessible and less time-consuming while reducing the already huge burden on healthcare workers. depression screening, such as social media posts, handwritten notes (either from individuals or discharge summaries from medical professional), and written text transcripts of speech [22][23][24][25][26][27][28][29][30][31]. ...
... This makes it simple to overfit the model and difficult to generalize when there are insufficient data points [32]. To address this issue, there has been a growing interest in leveraging "data-centric AI movement" for depression classification [25][26][27][28][29][30][31]. ...
... They utilized transfer learning from four variants of the BERT model to identify depression in discharge summaries, and RoBERTa yielded the highest classification F1-score of 0.830 [28]. Another example was the work by Senn et al. (2022), which used transcripts of 12 clinical interview questions and a transfer learning approach with three variants of BERT and an ensemble of extra LSTM and attention layers. The results indicated that RoBERTa tended to perform better with basic architectures individually without ensemble. ...
Article
Full-text available
Depression is a serious mental health disorder that poses a major public health concern in Thailand and have a profound impact on individuals’ physical and mental health. In addition, the lack of number to mental health services and limited number of psychiatrists in Thailand make depression particularly challenging to diagnose and treat, leaving many individuals with the condition untreated. Recent studies have explored the use of natural language processing to enable access to the classification of depression, particularly with a trend toward transfer learning from pre-trained language model. In this study, we attempted to evaluate the effectiveness of using XLM-RoBERTa, a pre-trained multi-lingual language model supporting the Thai language, for the classification of depression from a limited set of text transcripts from speech responses. Twelve Thai depression assessment questions were developed to collect text transcripts of speech responses to be used with XLM-RoBERTa in transfer learning. The results of transfer learning with text transcription from speech responses of 80 participants (40 with depression and 40 normal control) showed that when only one question (Q1) of “How are you these days?” was used, the recall, precision, specificity, and accuracy were 82.5%, 84.65, 85.00, and 83.75%, respectively. When utilizing the first three questions from Thai depression assessment tasks (Q1 − Q3), the values increased to 87.50%, 92.11%, 92.50%, and 90.00%, respectively. The local interpretable model explanations were analyzed to determine which words contributed the most to the model’s word cloud visualization. Our findings were consistent with previously published literature and provide similar explanation for clinical settings. It was discovered that the classification model for individuals with depression relied heavily on negative terms such as ‘not,’ ‘sad,’, ‘mood’, ‘suicide’, ‘bad’, and ‘bore’ whereas normal control participants used neutral to positive terms such as ‘recently,’ ‘fine,’, ‘normally’, ‘work’, and ‘working’. The findings of the study suggest that screening for depression can be facilitated by eliciting just three questions from patients with depression, making the process more accessible and less time-consuming while reducing the already huge burden on healthcare workers.
... Most of the above counterpart models utilize a hybrid or ensemble approach in the classifier part of the architecture. To know the performance of the hybrid model on the Fig. 7 Comparison of the accuracy metric as a percentage text representation part, we presented a study that proposed an ensemble model using the combination of BERT and RoBERTa [39], to be compared with our proposed model. This strategy took the pooler output from the pre-trained models, i.e., BERT and RoB-ERTa, and then concatenated the outputs going to the LSTM cells and an extra attention layer before the final classification. ...
... This strategy took the pooler output from the pre-trained models, i.e., BERT and RoB-ERTa, and then concatenated the outputs going to the LSTM cells and an extra attention layer before the final classification. However, this work [39] only performed a simple hybrid approach by concatenating the output of each text representation to become inputs for the classifier section. It did not propose the hybrid approach in the classifier section. ...
Article
Full-text available
Various attempts have been conducted to improve the performance of text-based sentiment analysis. These significant attempts have focused on text representation and model classifiers. This paper introduced a hybrid model based on the text representation and the classifier models, to address sentiment classification with various topics. The combination of BERT and a distilled version of BERT (DistilBERT) was selected in the representative vectors of the input sentences, while the combination of long short-term memory and temporal convolutional networks was taken to enhance the proposed model in understanding the semantics and context of each word. The experiment results showed that the proposed model outperformed various counterpart schemes in considered metrics. The reliability of the proposed model was confirmed in a mixed dataset containing nine topics.
... DCNNs are used to learn local feature representations for each modality, while DNNs integrate various features for final prediction [32]. They integrated tweet and user behavioral features, encoding user tweets using a hierarchical attention network [33], and investigated the depression classification capability of 3 bidirectional encoder representation from transformer (BERT) variants and 4 combinations of BERT variants on the text responses to 12 clinical interview questions. They found that ensemble methods could improve both F 1 -scores and robustness [34] and proposed a multimodal fusion method for depression detection, where BERT is used to obtain the sentence representation and LSTM and CNN are employed to capture the representation of speech. ...
Article
Background Depression represents a pressing global public health concern, impacting the physical and mental well-being of hundreds of millions worldwide. Notwithstanding advances in clinical practice, an alarming number of individuals at risk for depression continue to face significant barriers to timely diagnosis and effective treatment, thereby exacerbating a burgeoning social health crisis. Objective This study seeks to develop a novel online depression risk detection method using natural language processing technology to identify individuals at risk of depression on the Chinese social media platform Sina Weibo. Methods First, we collected approximately 527,333 posts publicly shared over 1 year from 1600 individuals with depression and 1600 individuals without depression on the Sina Weibo platform. We then developed a hierarchical transformer network for learning user-level semantic representations, which consists of 3 primary components: a word-level encoder, a post-level encoder, and a semantic aggregation encoder. The word-level encoder learns semantic embeddings from individual posts, while the post-level encoder explores features in user post sequences. The semantic aggregation encoder aggregates post sequence semantics to generate a user-level semantic representation that can be classified as depressed or nondepressed. Next, a classifier is employed to predict the risk of depression. Finally, we conducted statistical and linguistic analyses of the post content from individuals with and without depression using the Chinese Linguistic Inquiry and Word Count. Results We divided the original data set into training, validation, and test sets. The training set consisted of 1000 individuals with depression and 1000 individuals without depression. Similarly, each validation and test set comprised 600 users, with 300 individuals from both cohorts (depression and nondepression). Our method achieved an accuracy of 84.62%, precision of 84.43%, recall of 84.50%, and F 1 -score of 84.32% on the test set without employing sampling techniques. However, by applying our proposed retrieval-based sampling strategy, we observed significant improvements in performance: an accuracy of 95.46%, precision of 95.30%, recall of 95.70%, and F 1 -score of 95.43%. These outstanding results clearly demonstrate the effectiveness and superiority of our proposed depression risk detection model and retrieval-based sampling technique. This breakthrough provides new insights for large-scale depression detection through social media. Through language behavior analysis, we discovered that individuals with depression are more likely to use negation words (the value of “swear” is 0.001253). This may indicate the presence of negative emotions, rejection, doubt, disagreement, or aversion in individuals with depression. Additionally, our analysis revealed that individuals with depression tend to use negative emotional vocabulary in their expressions (“NegEmo”: 0.022306; “Anx”: 0.003829; “Anger”: 0.004327; “Sad”: 0.005740), which may reflect their internal negative emotions and psychological state. This frequent use of negative vocabulary could be a way for individuals with depression to express negative feelings toward life, themselves, or their surrounding environment. Conclusions The research results indicate the feasibility and effectiveness of using deep learning methods to detect the risk of depression. These findings provide insights into the potential for large-scale, automated, and noninvasive prediction of depression among online social media users.
... The different features chosen in these works include profile information, semantic relationships, psycho-linguistic markers, etc. On the other hand, numerous studies [3,4,12] have focused on the identification of psychological issues such as personality disorders, as well as their symptoms and consequences, including suicide. In general, these works are based on the concept of deep learning in various stages of the process, including: (i) the representation of the different features from the raw data, (ii) the automatic extraction of the relevant features, (iii) the classification task using a combination of different layers. ...
Conference Paper
In this paper, we propose a data warehouse model that enables the analysis of social media data related to people with personality disorders (PD). This model aims to assist decision-makers in making relevant decisions for people with personality disorders worldwide, based on the analysis of their activities and content on social media. The uniqueness of this model in comparison to other models proposed for analyzing social media data is that: (i) it is primarily addressed to the PD, therefore it takes into account their specific information, (ii) it considers both behavior and writing style analysis at the same time, employing various artificial intelligence (AI) and natural language processing (NLP) techniques. Overall, the proposed model may predict the global distribution of people with PD around the world. In addition, it can reveal the linguistic specificity of these patients, which may enhance the psychiatrist’s expertise. The proposed model is implemented and the results achieved are compared to those of ASHA.
... Currently, an increasing number of studies are applying ML and deep learning (DL) models to the fields of personality (Dai & Wang, 2023), work and organizational psychology (Kwon et al., 2021) and clinical practice (Senn et al., 2022). In recent years, there has been an exponential increase in the application of these ML models to the study of engagement. ...
Article
Full-text available
Engagement has been defined as an attitude toward work, as a positive, satisfying, work‐related state of mind characterized by high levels of vigour, dedication, and absorption. Both its definition and its assessment have been controversial; however, new methods for its assessment, including artificial intelligence (AI), have been introduced in recent years. Therefore, this research aims to determine the state of the art of AI in the study of engagement. To this end, we conducted a systematic review in accordance with PRISMA to analyse the publications to date on the use of AI for the analysis of engagement. The search, carried out in six databases, was filtered, and 15 papers were finally analysed. The results show that AI has been used mainly to assess and predict engagement levels, as well as to understand the relationships between engagement and other variables. The most commonly used AI techniques are machine learning (ML) and natural language processing (NLP), and all publications use structured and unstructured data, mainly from self‐report instruments, social networks, and datasets. The accuracy of the models varies from 22% to 87%, and its main benefit has been to help both managers and HR staff understand employee engagement, although it has also contributed to research. Most of the articles have been published since 2015, and the geography has been global, with publications predominantly in India and the US. In conclusion, this study highlights the state of the art in AI for the study of engagement and concludes that the number of publications is increasing, indicating that this is possibly a new field or area of research in which important advances can be made in the study of engagement through new and novel techniques.
... The majority of word-based expressions not only result from the direct use of emotional words but also from the interpretation of the meaning of concepts and interactions of concepts, which are described in the textual document. Specific focus on evaluating emotion from word-based expressions has recently been conducted by researchers to detect depression from social media text [5][6][7][8]. ...
Chapter
Emotions are significant aspects of human existence and influence inter- action between individuals and groups, influencing how we think and behave. In this research, we aim to use conventional and neural network models to identify emotions from textual data and compare which performed best. The Go Emotions dataset contained 27 different emotions across 58,000 samples. The approach involves modelling the conventional machine learning models and the neural network-based models and comparing the results over test dataset and choosing the best model. Upon comparing the classification reports for the conventional and neural network- based models on the Ekman taxonomy, conventional machine learning algorithms were outperformed by neural network-based models which gained almost 10% more than conventional models. Conventional models averaged the values around 50% of macro-average F1-score except for the KNN classifier which performed poorly getting the macro-average F1-score of 21%. BERT classifier with Ekman taxonomy including neutral emotion had a macro-average precision of 55% and a sensitivity of 68%. This classifier also outperformed the macro-average F1-score by 106 61%. While the RoBERTa classifier had a macro-average precision of 65%, the recall, or sensitivity, was found to be 53%. This study clearly states that neural network-based models outperformed conventional models. Our study proposed BERT model which achieved a macro-average F1-score of 0.50 across Go Emotion taxonomy.
... Very apparent growth is noticeable since the release of BERT-(Bidirectional Encoder Representations from Transformers) by (Devlin et al., 2019) whose groundbreaking architecture significantly advanced the field of natural language processing (NLP). Since then, plenty of researchers have employed BERT architecture for the detection of mental health issues (Villatoro-Tello et al., 2021;Rodrigues Makiuchi et al., 2019;Senn et al., 2022). In the near future, it is reasonable to anticipate an even greater focus of researchers on the area of NLP considering the recent release of the GPT model by Ope-nAI which has generated significant enthusiasm and attention among both scholars and practitioners. ...
... Precision Recall F 1 score DAIC-WOZ dataset: Villatoro-Tello [33] 0.53 Villatoro-Tello [34] 0.59 0.59 0.59 Senn et al. [35] 0 It's important to note that some pertinent studies were left excluded from this comparison because they used different or various datasets, training methods, and models, or they presented metrics that are not comparable to our scores. ...
... Text and video/audio information were often used as the main components. Examples of text-based screening include using social media text (Jain et al. 2019;Lin et al. 2020) or transcribed interviews (Devlin et al. 2018;Senn et al. 2022). Similarly, when using video/audio information, basic features such as sound quality and spectral analysis were often used (Yalamanchili et al. 2020), and pre-trained models such as Face Action Units (FAU) (Williamson et al. 2016;Valstar et al. 2016) or emotional classifier models are used to extract audio information (Williamson et al. 2016). ...
Article
Depression is a major societal issue. However, depression can be hard to self-diagnose, and people suffering from depression often hesitate to consult with professionals. We discuss the design and initial testings of our prototype application that performs depression detection using multi-modal information such as questionnaires, speech, and face landmarks. The application has an animated avatar ask questions concerning the users’ well-being. To perform screening, we opt for a 2-stage method which first predicts individual HAM-D ratings for better explainability, which may help facilitate the referral process to medical professionals if required. Initial results show that our system archives 0.85 Marco-F1 for the depression detection task.
... We obtained respectively an F-measure equal to 82, 72 and 75 for the evaluation of the different tasks regrouping: (i) deep learning approach for the classification of tweets with the same topic, (ii) hybrid approach based on ontology and deep learning technique for the classification of tweets as should be filtered and may be recommended. We note that in general, our architecture has the most relevant results compared to other works in the same field (processing textual data related to psychology) (Saha et al. 2021;Sampath and Durairaj 2022;Senn et al. 2022) which can validate our hypothesis about the performance of BERT as a popular attention model and its specific architecture is based on bidirectionally trained that have a deeper sense of language context, and BILTSM in preserving the dependency between the different units. We can notice from the results obtained by using different transforming models that the results are varied and that no single model outperforms the others. ...
... Senn et al. (Senn et al., 2022) explore the effectiveness of different BERT models and ensembles in classifying depression from transcripts of clinical interviews. AudiBERT was used for depression classification from audio recordings, but the ablation study shows that BERT was the most influential component. ...
Article
Background and objective Depression is a substantial public health issue, with global ramifications. While initial literature reviews explored the intersection between artificial intelligence (AI) and mental health, they have not yet critically assessed the specific contributions of Large Language Models (LLMs) in this domain. The objective of this systematic review was to examine the usefulness of LLMs in diagnosing and managing depression, as well as to investigate their incorporation into clinical practice. Methods This review was based on a thorough search of the PubMed, Embase, Web of Science, and Scopus databases for the period January 2018 through March 2024. The search used PROSPERO and adhered to PRISMA guidelines. Original research articles, preprints, and conference papers were included, while non-English and non-research publications were excluded. Data extraction was standardized, and the risk of bias was evaluated using the ROBINS-I, QUADAS-2, and PROBAST tools. Results Our review included 34 studies that focused on the application of LLMs in detecting and classifying depression through clinical data and social media texts. LLMs such as RoBERTa and BERT demonstrated high effectiveness, particularly in early detection and symptom classification. Nevertheless, the integration of LLMs into clinical practice is in its nascent stage, with ongoing concerns about data privacy and ethical implications. Conclusion LLMs exhibit significant potential for transforming strategies for diagnosing and treating depression. Nonetheless, full integration of LLMs into clinical practice requires rigorous testing, ethical considerations, and enhanced privacy measures to ensure their safe and effective use.
Article
Background Depression is a prevalent global mental health disorder with substantial individual and societal impact. Natural language processing (NLP), a branch of artificial intelligence, offers the potential for improving depression screening by extracting meaningful information from textual data, but there are challenges and ethical considerations. Objective This literature review aims to explore existing NLP methods for detecting depression, discuss successes and limitations, address ethical concerns, and highlight potential biases. Methods A literature search was conducted using Semantic Scholar, PubMed, and Google Scholar to identify studies on depression screening using NLP. Keywords included “depression screening,” “depression detection,” and “natural language processing.” Studies were included if they discussed the application of NLP techniques for depression screening or detection. Studies were screened and selected for relevance, with data extracted and synthesized to identify common themes and gaps in the literature. Results NLP techniques, including sentiment analysis, linguistic markers, and deep learning models, offer practical tools for depression screening. Supervised and unsupervised machine learning models and large language models like transformers have demonstrated high accuracy in a variety of application domains. However, ethical concerns related to privacy, bias, interpretability, and lack of regulations to protect individuals arise. Furthermore, cultural and multilingual perspectives highlight the need for culturally sensitive models. Conclusions NLP presents opportunities to enhance depression detection, but considerable challenges persist. Ethical concerns must be addressed, governance guidance is needed to mitigate risks, and cross-cultural perspectives must be integrated. Future directions include improving interpretability, personalization, and increased collaboration with domain experts, such as data scientists and machine learning engineers. NLP’s potential to enhance mental health care remains promising, depending on overcoming obstacles and continuing innovation.
Article
Background Large language models (LLMs) are advanced artificial neural networks trained on extensive datasets to accurately understand and generate natural language. While they have received much attention and demonstrated potential in digital health, their application in mental health, particularly in clinical settings, has generated considerable debate. Objective This systematic review aims to critically assess the use of LLMs in mental health, specifically focusing on their applicability and efficacy in early screening, digital interventions, and clinical settings. By systematically collating and assessing the evidence from current studies, our work analyzes models, methodologies, data sources, and outcomes, thereby highlighting the potential of LLMs in mental health, the challenges they present, and the prospects for their clinical use. Methods Adhering to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, this review searched 5 open-access databases: MEDLINE (accessed by PubMed), IEEE Xplore, Scopus, JMIR, and ACM Digital Library. Keywords used were (mental health OR mental illness OR mental disorder OR psychiatry) AND (large language models). This study included articles published between January 1, 2017, and April 30, 2024, and excluded articles published in languages other than English. Results In total, 40 articles were evaluated, including 15 (38%) articles on mental health conditions and suicidal ideation detection through text analysis, 7 (18%) on the use of LLMs as mental health conversational agents, and 18 (45%) on other applications and evaluations of LLMs in mental health. LLMs show good effectiveness in detecting mental health issues and providing accessible, destigmatized eHealth services. However, assessments also indicate that the current risks associated with clinical use might surpass their benefits. These risks include inconsistencies in generated text; the production of hallucinations; and the absence of a comprehensive, benchmarked ethical framework. Conclusions This systematic review examines the clinical applications of LLMs in mental health, highlighting their potential and inherent risks. The study identifies several issues: the lack of multilingual datasets annotated by experts, concerns regarding the accuracy and reliability of generated content, challenges in interpretability due to the “black box” nature of LLMs, and ongoing ethical dilemmas. These ethical concerns include the absence of a clear, benchmarked ethical framework; data privacy issues; and the potential for overreliance on LLMs by both physicians and patients, which could compromise traditional medical practices. As a result, LLMs should not be considered substitutes for professional mental health services. However, the rapid development of LLMs underscores their potential as valuable clinical aids, emphasizing the need for continued research and development in this area. Trial Registration PROSPERO CRD42024508617; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=508617
Article
Depression is a significant global health challenge. Still, many people suffering from depression remain undiagnosed. Furthermore, the assessment of depression can be subject to human bias. Natural Language Processing (NLP) models offer a promising solution. We investigated the potential of four NLP models (BERT, Llama2-13B, GPT-3.5, and GPT-4) for depression detection in clinical interviews. Participants (N = 82) underwent clinical interviews and completed a self-report depression questionnaire. NLP models inferred depression scores from interview transcripts. Questionnaire cut-off values for depression were used as a classifier for depression. GPT-4 showed the highest accuracy for depression classification (F1 score 0.73), while zero-shot GPT-3.5 initially performed with low accuracy (0.34), improved to 0.82 after fine-tuning, and achieved 0.68 with clustered data. GPT-4 estimates of symptom severity PHQ-8 score correlated strongly (r = 0.71) with true symptom severity. These findings demonstrate the potential of AI models for depression detection. However, further research is necessary before widespread deployment can be considered.
Article
Personality analysis has a positive influence on humanity as it aids in identifying personality traits and disorders. In addition, it facilitates the monitoring of cases and enriches doctors’ knowledge bases, particularly in decision-making processes. This study includes a comprehensive literature review on personality analysis approaches from social media, aiming to gain a thorough understanding of the current studies on personality therapy. Moreover, the objective of this study is to identify various limitations present in these studies and explore potential avenues for enhancement. More specifically, this research begins with an introduction that discusses the main concepts of traits and personality disorders, as well as the importance of psychological analysis. Following that, four cluster studies related to personality analysis on social media are presented: personality traits, personality disorders, detection of links between diseases, and monitoring patient status. Then, the majority of the currently available works for each cluster are exposed. Afterward, a comparative study of the different presented works is proposed. Finally, an outline of plans for further research in this area is provided, detailing potential paths for exploration.
Article
Full-text available
In the face of rising depression rates, the urgency of early and accurate diagnosis has never been more paramount. Traditional diagnostic methods, while invaluable, can sometimes be limited in access and susceptible to biases, potentially leading to underdiagnoses. This paper explores the innovative potential of \ai technology, specifically machine learning, as a diagnostic tool for depression. Drawing from prior research, we note the success of machine learning in discerning depression indicators on social media platforms and through automated interviews. A particular focus is given to the BERT-based NLP transformer model, previously shown to be effective in detecting depression from simulated interview data. Our study assessed this model's capability to identify depression from transcribed, semi-structured clinical interviews within a general population sample. While the BERT model displayed an accuracy of 0.71, it was surpassed by an untrained GPT-3.5 model, which achieved an impressive accuracy of 0.88. These findings emphasise the transformative potential of NLP transformer models in the realm of depression detection. However, given the relatively small dataset (N=17) utilised, we advise a measured interpretation of the results. This paper is designed as a pilot study, and further studies will incorporate bigger datasets.
Chapter
In recent years, depression has caused severe social and psychological problems. The purpose of the paper is to automatically identify users with depressive tendencies to facilitate early intervention and prevent the progression of depression into more severe consequences. The paper proposes a Depression Prediction model based on Multi-feature Fusion (DPMFF), which extracts contextual semantic features and deep emotional features from user documents to predict depression risk. The behavioral and linguistic features of depressed users were examined through statistical analysis. Experiments on micro-blog datasets demonstrate that DPMFF can effectively identify users with depressive tendencies and outperform other models. The data analysis found that compared with normal users, users with depressive tendencies were usually active on social networks late at night, and the proportion of content containing absolute words and negative words was significantly higher than average.
Chapter
Depression is a common mental health disorder with large social and economic consequences. It can be costly and difficult to detect, traditionally requiring hours of assessment by a trained clinical. Recently, machine learning models have been trained to screen for depression with patient voice recordings collected during an interview with a virtual agent. To engage the patient in a conversation and increase the quantity of responses, the virtual interviewer asks a series of follow-up questions. However, asking fewer questions would reduce the time burden of screening for the participant. We, therefore, assess if these follow-up questions have a tangible impact on the performance of deep learning models for depression classification. Specifically, we study the effect of including the vocal and transcribed replies to one, two, three, four, five, or all follow-up questions in the depression screening models. We notably achieve this using unimodal and multimodal pre-trained transfer learning models. Our findings reveal that follow-up questions can help increase F1 scores for the majority of the interview questions. This research can be leveraged for the design of future mental illness screening applications by providing important information about both question selection and the best number of follow-up questions.
Chapter
Mental illnesses are often undiagnosed, highlighting the need for an effective alternative to traditional screening surveys. We propose our Early Mental Health Uncovering (EMU) framework that conducts rapid mental illness screening with active and passive modalities. We designed, deployed, and evaluated the EMU app to passively collect retrospective digital phenotype data and actively collect short voice recordings. The EMU app also administered a depression screening survey to label the data. We collected data from crowdsourced and student populations, both of whom shared sufficient voice recordings for modeling. We thus assess the classification ability of machine learning and deep learning models trained with scripted and unscripted voice recordings. For the crowdsourced participants, machine learning models screened for depression with an AUC of 0.78 and suicidal ideation with an AUC of 0.73. For the student participants, deep learning models screened for depression with an AUC of 0.70 and suicidal ideation with an AUC of 0.72. Combining datasets did not improve screening capabilities, though the best performing models on the combined dataset notably required voice transcripts. This research facilitates a better understanding of modality selection for mobile screening. We will make the features publicly available to further advance mental illness screening research.
Article
Given that depression is one of the most prevalent mental illnesses, developing effective and unobtrusive diagnosis tools is of great importance. Recent work that screens for depression with text messages leverage models relying on lexical category features. Given the colloquial nature of text messages, the performance of these models may be limited by formal lexicons. We thus propose a strategy to automatically construct alternative lexicons that contain more relevant and colloquial terms. Specifically, we generate 36 lexicons from fiction, forum, and news corpuses. These lexicons are then used to extract lexical category features from the text messages. We utilize machine learning models to compare the depression screening capabilities of these lexical category features. Out of our 36 constructed lexicons, 14 achieved statistically significantly higher average F1 scores over the pre-existing formal lexicon and basic bag-of-words approach. In comparison to the pre-existing lexicon, our best performing lexicon increased the average F1 scores by 10%. We thus confirm our hypothesis that less formal lexicons can improve the performance of classification models that screen for depression with text messages. By providing our automatically constructed lexicons, we aid future machine learning research that leverages less formal text.
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Article
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.
Conference Paper
Full-text available
We present SimSensei Kiosk, an implemented virtual human interviewer designed to create an engaging face-to-face interaction where the user feels comfortable talking and sharing information. SimSensei Kiosk is also designed to create interactional situations favorable to the automatic assessment of distress indicators, defined as verbal and nonverbal behaviors correlated with depression, anxiety or post-traumatic stress disorder (PTSD). In this paper, we summarize the design methodology, performed over the past two years, which is based on three main development cycles: (1) analysis of face-to-face human interactions to identify potential distress indicators, dialogue policies and virtual human gestures, (2) development and analysis of a Wizard-of-Oz prototype system where two human operators were deciding the spoken and gestural responses, and (3) development of a fully automatic virtual interviewer able to engage users in 15-25 minute interactions. We show the potential of our fully automatic virtual human interviewer in a user study, and situate its performance in relation to the Wizard-of-Oz prototype. Copyright © 2014, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
Article
Full-text available
This paper is the first review into the automatic analysis of speech for use as an objective predictor of depression and suicidality. Both conditions are major public health concerns; depression has long been recognised as a prominent cause of disability and burden worldwide, whilst suicide is a misunderstood and complex course of death that strongly impacts the quality of life and mental health of the families and communities left behind. Despite this prevalence the diagnosis of depression and assessment of suicide risk, due to their complex clinical characterisations, are difficult tasks, nominally achieved by the categorical assessment of a set of specific symptoms. However many of the key symptoms of either condition, such as altered mood and motivation, are not physical in nature; therefore assigning a categorical score to them introduces a range of subjective biases to the diagnostic procedure. Due to these difficulties, research into finding a set of biological, physiological and behavioural markers to aid clinical assessment is gaining in popularity. This review starts by building the case for speech to be considered a key objective marker for both conditions; reviewing current diagnostic and assessment methods for depression and suicidality including key non-speech biological, physiological and behavioural markers and highlighting the expected cognitive and physiological changes associated with both conditions which affect speech production. We then review the key characteristics; size associated clinical scores and collection paradigm, of active depressed and suicidal speech databases. The main focus of this paper is on how common paralinguistic speech characteristics are affected by depression and suicidality and the application of this information in classification and prediction systems. The paper concludes with an in-depth discussion on the key challenges – improving the generalisability through greater research collaboration and increased standardisation of data collection, and the mitigating unwanted sources of variability – that will shape the future research directions of this rapidly growing field of speech processing research.
Article
Full-text available
Diagnostic and treatment delay in depression are due to physician and patient factors. Patients vary in awareness of their depressive symptoms and ability to bring depression-related concerns to medical attention. To inform interventions to improve recognition and management of depression in primary care by understanding patients' inner experiences prior to and during the process of seeking treatment. Focus groups, analyzed qualitatively. One hundred and sixteen adults (79% response) with personal or vicarious history of depression in Rochester NY, Austin TX and Sacramento CA. Neighborhood recruitment strategies achieved sociodemographic diversity. Open-ended questions developed by a multidisciplinary team and refined in three pilot focus groups explored participants' "lived experiences" of depression, depression-related beliefs, influences of significant others, and facilitators and barriers to care-seeking. Then, 12 focus groups stratified by gender and income were conducted, audio-recorded, and analyzed qualitatively using coding/editing methods. Participants described three stages leading to engaging in care for depression - "knowing" (recognizing that something was wrong), "naming" (finding words to describe their distress) and "explaining" (seeking meaningful attributions). "Knowing" is influenced by patient personality and social attitudes. "Naming" is affected by incongruity between the personal experience of depression and its narrow clinical conceptualizations, colloquial use of the word depression, and stigma. "Explaining" is influenced by the media, socialization processes and social relations. Physical/medical explanations can appear to facilitate care-seeking, but may also have detrimental consequences. Other explanations (characterological, situational) are common, and can serve to either enhance or reduce blame of oneself or others. To improve recognition of depression, primary care physicians should be alert to patients' ill-defined distress and heterogeneous symptoms, help patients name their distress, and promote explanations that comport with patients' lived experience, reduce blame and stigma, and facilitate care-seeking.
Chapter
In the paper, we present a ‘ Open image in new window ’+‘ Open image in new window ’ +‘ Open image in new window ’ three-stage paradigm, which is a supplementary framework for the standard ‘ Open image in new window ’+‘ Open image in new window ’ language model approach. Furthermore, based on three-stage paradigm, we present a language model named PPBERT. Compared with original BERT architecture that is based on the standard two-stage paradigm, we do not fine-tune pre-trained model directly, but rather post-train it on the domain or task related dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also from the training dataset reduce bias. Extensive experimental results indicate that proposed model improves the performance of the baselines on 24 NLP tasks, which includes eight GLUE benchmarks, eight SuperGLUE benchmarks, six extractive question answering benchmarks. More remarkably, our proposed model is a more flexible and pluggable model, where post-training approach is able to be plugged into other PLMs that are based on BERT. Extensive ablations further validate the effectiveness and its state-of-the-art (SOTA) performance. The open source code, pre-trained models and post-trained models are available publicly.
Chapter
Many time reviewers fail to appreciate novel ideas of a researcher and provide generic feedback. Thus, proper assignment of reviewers based on their area of expertise is necessary. Moreover, reading each and every paper from end-to-end for assigning it to a reviewer is a tedious task. In this paper, we describe a system which our team FideLIPI submitted in the shared task of SDPRA-2021 (https://sdpra-2021.github.io/website/ (accessed January 25, 2021)) [14]. It comprises four independent sub-systems capable of classifying abstracts of scientific literature to one of the given seven classes. The first one is a RoBERTa [10] based model built over these abstracts. Adding topic models/Latent dirichlet allocation (LDA) [2] based features to the first model results in the second sub-system. The third one is a sentence level RoBERTa [10] model. The fourth one is a Logistic Regression model built using Term Frequency Inverse Document Frequency (TF-IDF) features. We ensemble predictions of these four sub-systems using majority voting to develop the final system which gives a F1 score of 0.93 on the test and validation set. This outperforms the existing State Of The Art (SOTA) model SciBERT’s [1] in terms of F1 score on the validation set. Our codebase is available at https://github.com/SDPRA-2021/shared-task/tree/main/FideLIPI.
Conference Paper
Depression is a common, but serious mental disorder that affects people all over the world. Besides providing an easier way of diagnosing the disorder, a computer-aided automatic depression assessment system is demanded in order to reduce subjective bias in the diagnosis. We propose a multimodal fusion of speech and linguistic representation for depression detection. We train our model to infer the Patient Health Questionnaire (PHQ) score of subjects from AVEC 2019 DDS Challenge database, the E-DAIC corpus. For the speech modality, we use deep spectrum features extracted from a pretrained VGG-16 network and employ a Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. For the textual embeddings, we extract BERT textual features and employ a Convolutional Neural Network (CNN) followed by a LSTM layer. We achieved a CCC score equivalent to 0.497 and 0.608 on the E-DAIC corpus development set using the unimodal speech and linguistic models respectively. We further combine the two modalities using a feature fusion approach in which we apply the last representation of each single modality model to a fully-connected layer in order to estimate the PHQ score. With this multimodal approach, it was possible to achieve the CCC score of 0.696 on the development set and 0.403 on the testing set of the E-DAIC corpus, which shows an absolute improvement of 0.283 points from the challenge baseline.
Article
OBJECTIVE: While considerable attention has focused on improving the detection of depression, assessment of severity is also important in guiding treatment decisions. Therefore, we examined the validity of a brief, new measure of depression severity. MEASUREMENTS: The Patient Health Questionnaire (PHQ) is a self-administered version of the PRIME-MD diagnostic instrument for common mental disorders. The PHQ-9 is the depression module, which scores each of the 9 DSM-IV criteria as “0” (not at all) to “3” (nearly every day). The PHQ-9 was completed by 6,000 patients in 8 primary care clinics and 7 obstetrics-gynecology clinics. Construct validity was assessed using the 20-item Short-Form General Health Survey, self-reported sick days and clinic visits, and symptom-related difficulty. Criterion validity was assessed against an independent structured mental health professional (MHP) interview in a sample of 580 patients. RESULTS: As PHQ-9 depression severity increased, there was a substantial decrease in functional status on all 6 SF-20 subscales. Also, symptom-related difficulty, sick days, and health care utilization increased. Using the MHP reinterview as the criterion standard, a PHQ-9 score ≥10 had a sensitivity of 88% and a specificity of 88% for major depression. PHQ-9 scores of 5, 10, 15, and 20 represented mild, moderate, moderately severe, and severe depression, respectively. Results were similar in the primary care and obstetrics-gynecology samples. CONCLUSION: In addition to making criteria-based diagnoses of depressive disorders, the PHQ-9 is also a reliable and valid measure of depression severity. These characteristics plus its brevity make the PHQ-9 a useful clinical and research tool.
Article
Depression has a profound impact on patient health, individual and family quality of life, activities of daily living, and daily functioning, as well as on healthcare providers, payers, and employers. Persons with depression tend to have multiple comorbidities that compound the negative effects and increase costs. The economic burden of the disease is significant, with direct medical costs estimated at $3.5 million per 1000 plan members with depression. Depression is significantly underdiagnosed and undertreated, particularly in primary care where the majority of patients with depression seek care. Effective strategies to achieve remission have been identified and have proven effective in clinical trials. Early detection, intervention, and appropriate treatment can promote remission, prevent relapse, and reduce the emotional and financial burden of the disease.
Detection and Classification of mental illnesses on social media using RoBERTa
  • A Murarka
  • B Radhakrishnan
  • S Ravichandran
Ensemble BERT for Classifying Medication-mentioning Tweets
  • dang
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  • sanh
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • devlin
Detection and Classification of mental illnesses on social media using RoBERTa
  • murarka