Article

DPre: Effective preprocessing techniques for social media depressive text

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Depression has become one of the most common public health issues. Several people with depression rely on social media to express their grief. The text data generated by these users can be exploited to promote study in this area in order to detect early-stage depression and provide support. However, to develop a reliable automatic depression detection system, the social media text cannot be used directly as there is a lot of irrelevant, inaccurate, and noisy information available. Moreover, the basic preprocessing steps which are used with most of the machine learning models have limited functionality and thus lead to lots of information loss. This loss of information is not affordable especially in the domain of affective computing (mental health) for text. In this paper, we present various preprocessing techniques for depressive text, DPre, to obtain readable text from raw and noisy tweets. This method can help in minimizing the loss of information and expressions hidden in the raw tweet. Moreover, the processed and clean text will be ready to input into any machine learning algorithm. The readability of the processed text is evaluated and compared with raw tweets using four readability scores: Flesch Reading Score, Flesch_kincaid Score, the Coleman-Liau Index, and Dale_Chall Score. Compared to basic state-of-art preprocessing methods, the proposed method significantly improved the readability score.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
Depression is presently one of society's main psychological disorders. An intensified public mental health concern has been prompted by recent experiences with the emergence of corona virus disease 2019 (COVID-19). At present, the emphasis of research on human emotional state representation has changed from basic emotions to a large number of emotions in continuous three-dimensional space owing to the complexity of describing and evaluating a vast number of emotions within a single framework. Significant considerations of 3D continuous valence, arousal and dominance space while overseeing mental health issues are important as they relate to the expression of emotion and behavioural reactions. The goal of this research is to design a machine learning regressor modal to estimate the continuous valence, arousal and dominance score which results from the process of emotional intelligence via text interpretation. In the pursuit of goal, EmoBank dataset, which contains text information as well as valence–arousal–dominance values and for validation ISEAR, a labelled corpus of categorical emotions datasets is used. We learn an embedding using three pre-trained word embeddings: word2vec, Doc2vec and BERT, and find that BERT significantly outperforms the result. In a future study, the regressor model will be adopted in depression detection by distributing the categorical negative emotions in terms of VAD.
Article
Full-text available
Objectives: The study sought to test the feasibility of using Twitter data to assess determinants of consumers' health behavior toward human papillomavirus (HPV) vaccination informed by the Integrated Behavior Model (IBM). Materials and methods: We used 3 Twitter datasets spanning from 2014 to 2018. We preprocessed and geocoded the tweets, and then built a rule-based model that classified each tweet into either promotional information or consumers' discussions. We applied topic modeling to discover major themes and subsequently explored the associations between the topics learned from consumers' discussions and the responses of HPV-related questions in the Health Information National Trends Survey (HINTS). Results: We collected 2 846 495 tweets and analyzed 335 681 geocoded tweets. Through topic modeling, we identified 122 high-quality topics. The most discussed consumer topic is "cervical cancer screening"; while in promotional tweets, the most popular topic is to increase awareness of "HPV causes cancer." A total of 87 of the 122 topics are correlated between promotional information and consumers' discussions. Guided by IBM, we examined the alignment between our Twitter findings and the results obtained from HINTS. Thirty-five topics can be mapped to HINTS questions by keywords, 112 topics can be mapped to IBM constructs, and 45 topics have statistically significant correlations with HINTS responses in terms of geographic distributions. Conclusions: Mining Twitter to assess consumers' health behaviors can not only obtain results comparable to surveys, but also yield additional insights via a theory-driven approach. Limitations exist; nevertheless, these encouraging results impel us to develop innovative ways of leveraging social media in the changing health communication landscape.
Article
Full-text available
Readability indices have been widely used in order to measure textual difficulty. They can be useful for the automatic classification of texts, especially in language teaching. Among other applications, they allow for the previous determination of the difficulty level of texts without the need of reading them through. The aim of this research is twofold: first, to examine the degree of accuracy of the six most commonly used readability indices, and second, to present a new optimized measure. The main problem is that these readability indices may offer disparity, and this is precisely what has motivated our attempt to unite their potential. A discriminant analysis of all the variables under examination has enabled the creation of a much more precise model, improving the previous best results by 15%. Furthermore, errors and disparities in the difficulty level of the analyzed texts have been detected.
Article
Full-text available
Background Mobile apps have become popular resources for mental health support. Availability of information about developers' data security procedures for health apps, specifically those targeting mental health, has not been thoroughly investigated. If people are to use and trust these tools for their mental health, it is crucial we evaluate the transparency and quality around the data practices of these apps. The present study reviewed data security and privacy policies of mobile apps for depression. Methods We reviewed mobile apps retrieved from iTunes and Google Play stores in October 2017, using the term “depression”, and evaluated the transparency of data handling procedures of those apps. Results We identified 116 eligible mobile phone apps. Of those, 4% (5/116) received a transparency score of acceptable, 28% (32/116) questionable, and 68% (79/116) unacceptable. Only a minority of the apps (49%) had a privacy policy. The availability of policies differed significantly by platform, with apps from iTunes more likely to have a policy than from the Google Play store. Mobile apps collecting identifiable information were significantly more likely to have a privacy policy (79%) compared to those collecting only non-identifiable information (34%). Conclusion The majority of apps reviewed were not sufficiently transparent with information regarding data security. Apps have great potential to scale mental health resources, providing resources to people unable or reluctant to access traditional face-to-face care, or as an adjunct to treatment. However, if they are to be a reasonable resource, they must be safe, secure, and responsible.
Article
Full-text available
Mental health detection in Online Social Network (OSN) is widely studied in the recent years. OSN has encouraged new ways to communicate and share information, and it is used regularly by millions of people. It generates a mass amount of information that can be utilised to develop mental health detection. The rich content provided by OSN should not be overlooked as it could give more value to the data explored by the researcher. The main purpose of this study is to extract and scrutinise related works from related literature on detection of mental health using OSN. With the focus on the method used, machine learning algorithm, sources of OSN, and types of language used for the mental health detection were chosen for the study. The basic design of this study is in the form of a survey from the literature related to current research in mental health. Major findings revealed that the most frequently used method in mental health detection is machine learning techniques, with Support Vector Machine (SVM) as the most chosen algorithm. Meanwhile, Twitter is the major data source from OSN with English language used for mental health detection. The researcher found a few challenges from the previous studies and analyses, and these include limitations in language barrier, account privacy in OSN, single type of OSN, text analysis, and limited features selection. Based on the limitations, the researcher outlined a future direction of mental health detection using language based on user’s geo-location and mother tongue. The use of pictorial, audio and video formats in OSN could become one of the potential areas to be explored in future research. Extracting data from multiple sources of OSNs with new features selection will probably improve mental health detection in the future. In conclusion, this research has a big potential to be explored further in the future.
Article
Full-text available
Psychological stress is threatening people’s health. It is non-trivial to detect stress timely for proactive care. With the popularity of social media, people are used to sharing their daily activities and interacting with friends on social media platforms, making it feasible to leverage online social network data for stress detection. In this paper, we find that users stress state is closely related to that of his/her friends in social media, and we employ a large-scale dataset from real-world social platforms to systematically study the correlation of users’ stress states and social interactions. We first define a set of stress-related textual, visual, and social attributes from various aspects, and then propose a novel hybrid model - a factor graph model combined with Convolutional Neural Network to leverage tweet content and social interaction information for stress detection. Experimental results show that the proposed model can improve the detection performance by 6-9% in F1-score. By further analyzing the social interaction data, we also discover several intriguing phenomena, i.e. the number of social structures of sparse connections (i.e. with no delta connections) of stressed users is around 14% higher than that of non-stressed users, indicating that the social structure of stressed users’ friends tend to be less connected and less complicated than that of non-stressed users.
Conference Paper
Full-text available
Prescription drug abuse is one of the fastest growing public health problems in the USA. To address this epidemic, a near real-time monitoring strategy, instead of one resorting to a retrospective health records, may improve detecting the prevalence and patterns of abuse of both illegal drugs and prescription medications. In this paper, our primary goals are to demonstrate the possibility of utilizing social media, e.g., Twitter, for automatic monitoring of illegal drug and prescription medication abuse. We use machine learning methods for an automatic classification that can identify tweets that are indicative of drug abuse. We collected tweets associated with well-known illegal and prescription drugs. We manually annotated 300 tweets that are likely to be related to drug abuse. Our experiment compares a set of classification algorithms, and a decision tree classifier J48, and the SVM outperform others for determining whether tweets contain signals of drug abuse. This automatic supervised classification study results illustrate the utility of Twitter in examining patterns of abuse, and show the feasibility of building the drug abuse detection system that can process large volume data from social media sources in a near real-time.
Article
Full-text available
Ubiquitous nature of online social media and ever expending usage of short text messages becomes a potential source of crowd wisdom extraction especially in terms of sentiments therefore sentiment classification and analysis is a significant task of current research purview. Major challenge in this area is to tame the data in terms of noise, relevance, emoticons, folksonomies and slangs. This works is an effort to see the effect of pre-processing on twitter data for the fortification of sentiment classification especially in terms of slang word. The proposed method of pre-processing relies on the bindings of slang words on other coexisting words to check the significance and sentiment translation of the slang word. We have used n-gram to find the bindings and conditional random fields to check the significance of slang word. Experiments were carried out to observe the effect of proposed method on sentiment classification which clearly indicates the improvements in accuracy of classification.
Conference Paper
Full-text available
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year’s task that ran successfully as part of SemEval2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.
Conference Paper
Full-text available
In this paper, we extensively evaluate the effectiveness of using a user's social media activities for estimating degree of depression. As ground truth data, we use the results of a web-based questionnaire for measuring degree of depression of Twitter users. We extract several features from the activity histories of Twitter users. By leveraging these features, we construct models for estimating the presence of active depression. Through experiments, we show that (1) features obtained from user activities can be used to predict depression of users with an accuracy of 69%, (2) topics of tweets estimated with a topic model are useful features, (3) approximately two months of observation data are necessary for recognizing depression, and longer observation periods do not contribute to improving the accuracy of estimation for current depression; sometimes, longer periods worsen the accuracy.
Conference Paper
Full-text available
For INEX 2011 QA track, we wanted to measure the im- pact of two generic measures of readability in the selection of sentences related to topics. This is a step towards adaptive information retrieval approaches that take into account the reading skills of users and their level of expertise. We show that Flesch and Dale-Chall measures do not allow to filter sentences for obtaining a satisfactory readability level for INEX QA 2011 track and that the corresponding scores are not corre- lated to human assessment.
Conference Paper
Full-text available
This paper investigates the impact of misspelled words in statistical machine translation and proposes an extension of the translation engine for handling misspellings. The enhanced system decodes a word-based confusion network representing spelling variations of the input text. We present extensive experimental results on two translation tasks of increasing complexity which show how misspellings of different types do affect performance of a statistical machine translation decoder and to what extent our enhanced system is able to recover from such errors.
Chapter
Recognition of mental state (stress, anxiety, or depression) of a person is an important subject of research to avoid any unfortunate happening. Factors such as the declining economy, fear of the virus, and social alienation have recently affected the spike in depression and anxiety that followed the onset of pandemic. There is mounting evidence that people with mental disorders use social media at a high pace. We therefore explore the prospects of social media in online personas. In this paper, we present a thorough review of various approaches used in literature for detecting depression. It is followed by a discussion on identified gaps and challenges. The presented studies can provide new direction to the researchers who are working in the field of depression detection.
Article
In spite of the growing opportunities and demands for using social media to assist government decision-making, few studies have investigated social media sentiments toward public services due to the large volume and noisy nature of big data. Taking a design science approach, this paper suggests a systematic method to assign tweets into each of the SERVQUAL dimensions to identify sentiments and track perceived service quality of healthcare services for policy makers. The method consists of (1) identifying more reliable topic sets through repeated latent Dirichlet allocation (LDA) and clustering; and (2) classifying tweets using topics based on an existing theory for service quality. The method is applied to tweets on the quality of NHS of the UK to demonstrate its usability. We measured social perceptions of healthcare service quality and identified keywords for each SERVQUAL dimension. Moreover, a comparison between the social perceptions derived from the tweets and traditional survey result on the same service quality shows the similarity which confirms the usability of the proposed method. The method has a practical value as a complimentary tool for the more expensive national scale surveys as well as academic value as a novel method integrating text mining with theoretically sound quality framework, SERVQUAL.
Chapter
Personality and character have major effects on certain behavioural outcomes. As advancements in technology occur, more people these days are using social media such as Facebook, Twitter and Instagram. Due to the increase in social media’s popularity, types of behaviours are now easier to group and study as this is important to know the behaviour of users via social networking in order to analyse similarities of certain behaviour types and this can be used to predict what they post as well as what they comment, share and like on social networking sites. However, very few review studies have undertaken grouping according to similarities and differences to predict the personality and behaviour of individuals with the help of social networking sites such as Facebook, Twitter and Instagram. Therefore, the purpose of this research is to collect data from previous researches and to analyse the methods they have used. This paper reviewed 30 research studies on the topic of behavioural analysis using social media from 2015 to 2017. This research is based upon the method of previous publications and analyzed the results, limitations and number of users to draw conclusions. Our results indicated that the percentage of completed research on the Facebook, Twitter and Instagram show that 50% of the studies were done on Twitter, 27% on Facebook and 23% on Instagram. Twitter seems to be more popular and recent than the other two spheres as there are more studies on it. Further, we extracted the studies based on the year and graphs in 2015 indicated that more research has been done on Facebook to analyze the behavior of users and the trends are decreasing in the following year. However, more studies have been done on Twitter in 2016 than any other social media. The results also show the classifications based on different methods to analyze individual behavior. However, most of studies have been done on Twitter as it is more popular and newer than Facebook and Instagram particularly from 2015 to 2017, and more research needs to be done on other social media spheres in order to analyze the trending behaviors of users. This study should be useful to get knowledge about the methods used to analyze user behavior with description, limitations and results. Although some researchers collect demographic information on users’ gender on Facebook, others on Twitter do not. This lack of demographic data, which is typically available in more traditional sources such as surveys, has created a new focus on developing methods to work out these traits as a means of expanding Big Data research.
Conference Paper
Depression is among the most commonly diagnosed mental disorders around the world. With the increasing popularity of online social network platforms and the advances in data science, more research efforts have been spent on understanding mental disorders through social media by analysing linguistic style, sentiment, online social networks and other activity traces. However, the role of basic emotions and their changes over time, have not yet been fully explored in extant work. In this paper, we proposed a novel approach for identifying users with or at risk of depression by incorporating measures of eight basic emotions as features from Twitter posts over time, including a temporal analysis of these features. The results showed that emotion-related expressions can reveal insights of individuals' psychological states and emotions measured from such expressions show predictive power of identifying depression on Twitter. We also demonstrated that the changes in an individual's emotions as measured over time bear additional information and can further improve the effectiveness of emotions as features, hence, improve the performance of our proposed model in this task.
Article
This study evaluated 56 documents developed by 11 nonprofit and public social service agencies to provide information to clients. The author used the Flesch Reading Ease, Simple Measure of Goobledegook (SMOG), and Gunning Fog Index formulas to assess reading grade levels and the Suitability Assessment of Material to evaluate overall suitability for readers with limited literacy skills. Findings: All documents but one were above the recommended fifth grade level for low-literacy materials. Suitability Assessment of Material scores indicated 44.6% (n = 26) were not appropriately formatted for readers with limited literacy skills. Applications: Findings suggest a need for improving social service agencies’ and practitioners’ knowledge and awareness of the importance of assessing the readability and suitability of print materials, especially those intended for clients.
Article
Objectives: Wikipedia is the largest online encyclopedia with over 40 million articles, and generating 500 million visits per month. The aim of this study is to assess the readability and quality of Wikipedia pages on neurosurgical related topics. Patients and methods: We selected the neurosurgical related Wikipedia pages based on the series of online patient information articles that are published by the American Association of Neurological Surgeons (AANS). We assessed readability of Wikipedia pages using five different readability scales (Flesch Reading Ease, Flesch Kincaid Grade Level, Gunning Fog Index, SMOG) Grade level, and Coleman-Liau Index). We used the Center for Disease Control (CDC) Clear Communication Index as well as the DISCERN Instrument to evaluate the quality of each Wikipedia article. Results: We identified a total of fifty-five Wikipedia articles that corresponded with patient information articles published by the AANS. This constitutes 77.46% of the AANS topics. The mean Flesch Kincaid reading ease score for all of the Wikipedia articles we analyzed is 31.10, which indicates that a college-level education is necessary to understand them. In comparison to the readability analysis for the AANS articles, the Wikipedia articles were more difficult to read across every scale. None of the Wikipedia articles meet the CDC criterion for clear communications. Conclusion: Our analyses demonstrated that Wikipedia articles related to neurosurgical topics are associated with higher grade levels for reading and also below the expected levels of clear communications for patients. Collaborative efforts from the neurosurgical community are needed to enhance the readability and quality of Wikipedia pages related to neurosurgery.
Conference Paper
Depression is a major contributor to the overall global burden of diseases. Traditionally, doctors diagnose depressed people face to face via referring to clinical depression criteria. However, more than 70% of the patients would not consult doctors at early stages of depression, which leads to further deterioration of their conditions. Meanwhile, people are increasingly relying on social media to disclose emotions and sharing their daily lives, thus social media have successfully been leveraged for helping detect physical and mental diseases. Inspired by these, our work aims to make timely depression detection via harvesting social media data. We construct well-labeled depression and non-depression dataset on Twitter, and extract six depression-related feature groups covering not only the clinical depression criteria, but also online behaviors on social media. With these feature groups, we propose a multimodal depressive dictionary learning model to detect the depressed users on Twitter. A series of experiments are conducted to validate this model, which outperforms (+3% to +10%) several baselines. Finally, we analyze a large-scale dataset on Twitter to reveal the underlying online behaviors between depressed and non-depressed users.
Article
The study of emotion dynamics involves the study of the trajectories, patterns, and regularities with which emotions (or rather, the experiential, physiological, and behavioral elements that constitute an emotion) fluctuate across time, their underlying processes, and downstream consequences. Here, we formulate some of the basic principles underlying emotional change over time, discuss methods to study emotion dynamics, their relevance for psychological well-being, and a number of challenges and opportunities for the future.
Conference Paper
Classification refers to the computational techniques for classifying whether the sentiments of text are positive or negative. Sentiment Classification being a specialized domain of text mining is expected to benefit after preprocessing such as removing stopwords. Stopwords are frequently occurring words that hardly carry any information and orientation. In this paper the effect of stopwords removal on various sentiment classification models was analyzed. Sentiment Classification models were evaluated using the movie document dataset. Accuracy increased from unprocessed dataset to stopwords removed dataset for Traditional Sentiment Classifiers. Our classifiers had hardly any impact of stopwords removal which indicates that they handled stopwords at the time of classification itself. Our classifiers also displayed accuracy better than traditional classifier and another surveyed classifier based on term weighting technique.
Article
The degree of similarity between sentences is assessed by sentence similarity methods. Sentence similarity methods play an important role in areas such as summarization, search, and categorization of texts, machine translation, etc. The current methods for assessing sentence similarity are based only on the similarity between the words in the sentences. Such methods either represent sentences as bag of words vectors or are restricted to the syntactic information of the sentences. Two important problems in language understanding are not addressed by such strategies: the word order and the meaning of the sentence as a whole. The new sentence similarity assessment measure presented here largely improves and refines a recently published method that takes into account the lexical, syntactic and semantic components of sentences. The new method was benchmarked using Li–McLean, showing that it outperforms the state of the art systems and achieves results comparable to the evaluation made by humans. Besides that, the method proposed was extensively tested using the SemEval 2012 sentence similarity test set and in the evaluation of the degree of similarity between summaries using the CNN-corpus. In both cases, the measure proposed here was proved effective and useful.
Article
The integration of multiword expressions in a parsing procedure has been shown to improve accuracy in an artificial context where such expressions have been perfectly pre-identified. This paper evaluates two empirical strategies to integrate multiword units in a real con-stituency parsing context and shows that the results are not as promising as has sometimes been suggested. Firstly, we show that pre-grouping multiword expressions before pars-ing with a state-of-the-art recognizer improves multiword recognition accuracy and unlabeled attachment score. However, it has no statis-tically significant impact in terms of F-score as incorrect multiword expression recognition has important side effects on parsing. Sec-ondly, integrating multiword expressions in the parser grammar followed by a reranker specific to such expressions slightly improves all evaluation metrics.
Article
Emotions are viewed as having evolved through their adaptive value in dealing with fundamental life-tasks. Each emotion has unique features: signal, physiology, and antecedent events. Each emotion also has characteristics in common with other emotions: rapid onset, short duration, unbidden occurrence, automatic appraisal, and coherence among responses. These shared and unique characteristics are the product of our evolution, and distinguish emotions from other affective phenomena.
Social media driven public health informatics: Applications in regulatory science
  • Y Zhan
Social Media Signals for Post-traumatic Stress and Anxiety in Crisis-Inflicted Communities. NIH
  • C M De
Creating emoji lexica from unsupervised sentiment analysis of their descriptions
  • Fernández-Gavilanes
Novel text preprocessing framework for sentiment analysis
  • CSPavan
Detecting stress based on social interactions in social networks
  • Lin
Mining Twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the United States
  • Zhang
Assessing sentence similarity through lexical, syntactic and semantic analysis
  • Ferreira
Readability and quality of wikipedia pages on neurosurgical topics
  • Modiri
Pre-processing for Latent Dirichlet allocation
  • A Schofield
  • M Magnusson
  • L Thompson
  • D Mimno
How to write plain english
  • Flesch
Understanding Text Pre-Processing for Latent Dirichlet Allocation
  • Alexandra S Mans
  • Laure T David