Article

Utilizing a multi-class classification approach to detect therapeutic and recreational misuse of opioids on Twitter

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background Opioid misuse (OM) is a major health problem in the United States, and can lead to addiction and fatal overdose. We sought to employ natural language processing (NLP) and machine learning to categorize Twitter chatter based on the motive of OM. Materials and Methods We collected data from Twitter using opioid-related keywords, and manually annotated 6,988 tweets into three classes—No-OM, Pain-related-OM, and Recreational-OM—with the No-OM class representing tweets indicating no use/misuse, and the Pain-related misuse and Recreational-misuse classes representing misuse for pain or recreation/addiction. We trained and evaluated multi-class classifiers, and performed term-level k-means clustering to assess whether there were terms closely associated with the three classes. Results On a held-out test set of 1,677 tweets, a transformer-based classifier (XLNet) achieved the best performance with F1-score of 0.71 for the Pain-misuse class, and 0.79 for the Recreational-misuse class. Macro- and micro-averaged F1-scores over all classes were 0.82 and 0.92, respectively. Content-analysis using clustering revealed distinct clusters of terms associated with each class. Discussion While some past studies have attempted to automatically detect opioid misuse, none have further characterized the motive for misuse. Our multi-class classification approach using XLNet showed promising performance, including in detecting the subtle differences between pain-related and recreation-related misuse. The distinct clustering of class-specific keywords may help conduct targeted data collection, overcoming under-representation of minority classes. Conclusion Machine learning can help identify pain-related and recreational-related OM contents on Twitter to potentially enable the study of the characteristics of individuals exhibiting such behavior.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... While only a small subset of users opt to do this (under 2% (36)), the fact that this information is present on the platform in some regard means that it can be harnessed by researchers. Many groups have leveraged explicit tweet geotagging for opioid-related research (37)(38)(39)(40)(41)(42)(43) and other research areas (44,45). These studies typically only analyze tweets with geotags. ...
... Many research groups have also created models to automatically identify posts on X and Reddit with discussion related to opioids. (42,43,109,(124)(125)(126)(127) Others have used these platforms to characterize factors that influence opioid use, recovery, and the opioid epidemic generally (6,8,(128)(129)(130)(131)(132), with particular interest shown to the impact of the COVID-19 pandemic (103,120,(133)(134)(135)(136)(137)(138)(139) and co-use between opioids and other drugs (140,141). ...
... Individuals who initiate opioid use at a younger age are more susceptible to substance use disorders (168); monitoring discussions around opioid use in these age groups could help inform preventative programs that target young individuals. By contrast, Facebook skews slightly older, with 75% of [30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49] year olds reporting that they use the platforms as opposed to 67% of 18-29 year olds (169). Platform demographics also vary by gender. ...
Preprint
Social media can provide real-time insight into trends in substance use, addiction, and recovery. Prior studies have used platforms such as Reddit and X (formerly Twitter), but evolving policies around data access have threatened these platforms’ usability in research. We evaluate the potential of a broad set of platforms to detect emerging trends in the opioid epidemic. From these, we created a shortlist of 11 platforms, for which we documented official policies regulating drug-related discussion, data accessibility, geolocatability, and prior use in opioid-related studies. We quantified their volumes of opioid discussion, capturing informal language by including slang generated using a large language model. Beyond the most commonly used Reddit and X, the platforms with high potential for use in opioid-related surveillance are TikTok, YouTube, and Facebook. Leveraging many different social platforms, instead of a single platform, safeguards against sudden changes to data access and may better capture all populations that use opioids than any single platform. Teaser TikTok, Facebook, and YouTube may complement Reddit and X as text sources to monitor trends in the opioid epidemic.
... Machine learning techniques have demonstrated high effectiveness in distinguishing between drug-and non-drugrelated content. Several methods have been widely used for this purpose, including random forest (RF) [22], support vector machines (SVM) [23], and long short-term memory (LSTM) networks [22]. These methods have proven to be moderately successful, achieving 85% accuracy in identifying drug-related tweets; however, their margin of error still limits their practical usefulness. ...
... Machine learning techniques have demonstrated high effectiveness in distinguishing between drug-and non-drugrelated content. Several methods have been widely used for this purpose, including random forest (RF) [22], support vector machines (SVM) [23], and long short-term memory (LSTM) networks [22]. These methods have proven to be moderately successful, achieving 85% accuracy in identifying drug-related tweets; however, their margin of error still limits their practical usefulness. ...
... SVM, LSTM, and RF are the most commonly employed algorithms. Studies such as [23] and [28] have reported accuracy rates of over 90% with SVM, although SVM was outperformed by RF on the F-measure criterion [22,29]. There have also been promising results from deep learning techniques, particularly CNNs, which were part of the method utilized in [27], and recurrent neural networks (RNNs), which were proposed in [26]. ...
Article
Full-text available
There is a growing trend for groups associated with drug use to exploit social media platforms to propagate content that poses a risk to the population, especially those susceptible to drug use and addiction. Detecting drug-related social media content has become important for governments, technology companies, and those responsible for enforcing laws against proscribed drugs. Their efforts have led to the development of various techniques for identifying and efficiently removing drug-related content, as well as for blocking network access for those who create it. This study introduces a manually annotated Twitter dataset consisting of 112,057 tweets from 2008 to 2022, compiled for use in detecting associations connected with drug use. Working in groups, expert annotators classified tweets as either related or unrelated to drug use. The dataset was subjected to exploratory data analysis to identify its defining features. Several classification algorithms, including support vector machines, XGBoost, random forest, Naive Bayes, LSTM, and BERT, were used in experiments with this dataset. Among the baseline models, BERT with textual features achieved the highest F 1-score, at 0.9044. However, this performance was surpassed when the BERT base model and its textual features were concatenated with a deep neural network model, incorporating numerical and categorical features in the ensemble method, achieving an F 1-score of 0.9112. The Twitter dataset used in this study was made publicly available to promote further research and enhance the accuracy of the online classification of English-language drug-related content.
... Fodeh et al. (2021) [27] proposed a multi-class classification approach to categorize Twitter chatter based on the motive of opioid misuse. They collected nearly 7,000 tweets and manually annotated them into three classes. ...
... For example, [18] concluded that between hierarchical clustering, DBSCAN clustering, and K-means, the latest provided the most precise results with the most distinct and relevant clusters and was selected as their main algorithm. The K-means algorithm has been widely used in the field of NLP mainly to analyze short texts like tweets [27,41], web-short snippets [42], questions [42], news [21,43], and biomedical texts [42]. Thus, K-means clustering is an adequate and widely used tool for analyzing text data. ...
Preprint
Full-text available
The large quantity of information retrieved from communities, public data repositories, web pages, or data mining can be sparsed and poorly classified. This work shows how to employ unsupervised classification algorithms such as K-means proper to classify user reviews into their closest category, forming a balanced data set. Moreover, we found that the text vectorization technique significantly impacts the clustering formation, comparing TF-IDF and Word2Vec. The value for mapping a cluster with movie genre was 81.34% ± 20.48 of the cases when the TF-IDF was applied, whereas Word2Vec only yielded a 53.51% ± 24.1. In addition, we highlight the impact of the removal of stop-words. Thus, we detected that pre-compiled lists are not the best method to remove stop-words before clustering because there is much ambiguity, centroids are poorly separated, and only 57% of clusters could match a movie genre. Thus, our proposed approach achieved a 94% of accuracy. After analyzing the classifiers’ results, we appreciated a similar effect when divided by the stop-words method removal. Statistically significant changes were observed, especially in precision metric and Jaccard scores in both classifiers, using custom-generated stop lists rather than pre-compiled ones. Reclassifying sparse data is strongly recommended as using custom-generated stop lists.
... Notably, Support Vector Machine (SVM), Random Forest (RF), and Long Short-Term Memory (LSTM) are the most commonly implemented algorithms. SVM achieved accuracy rates exceeding 90% in studies such as [25] [26], while RF outperformed SVM in terms of F-measure in [27] [28]. Deep learning techniques, specifically Convolutional Neural Networks (CNN) partially utilized in [24], and Recurrent Neural Networks (RNN) proposed in [23], have also demonstrated promising results. ...
... Deep learning techniques, specifically Convolutional Neural Networks (CNN) partially utilized in [24], and Recurrent Neural Networks (RNN) proposed in [23], have also demonstrated promising results. LSTM networks have been utilized to develop systems for detecting drug use content on social media, alongside techniques such as SVM, RF, and BERT [28], achieving a precision rate of 85%. ...
Preprint
Full-text available
Social media platforms are increasingly enabling the propagation of content from groups related to drug use, thus posing risks for the wider population and, in particular, individuals who are amenable to drug use and drug addiction. The detection of drug use content on social media platforms is a priority for governments, technology companies, and drug law enforcement organizations. To counter this issue, various techniques have been developed to identify and promptly remove drug use content, while also blocking its creators from network access. In this paper, we introduce a manually annotated Twitter dataset, comprising 156,521 tweets published between 2008 and 2022, specifically compiled for the purpose of drug use detection. The dataset underwent annotation by several group of expert annotators who classified the tweets as either drug use or non-drug use. Exploratory data analysis was conducted to comprehend the dataset's characteristics. Various classification algorithms, including SVM, XGBoost, RF, NB, LSTM, and BERT were employed using the dataset. Among the traditional machine learning models, SVM utilizing term frequency-inverse document frequency features achieved the highest F1-Score (0.9017). However, BERT with textual features concatenated with numerical and categorical features in ensemble method surpassed the performance of traditional models, attaining F1-Score of 0.9112. To facilitate future research and enhance English online drug use classification accuracy, the dataset will be made publicly available.
... We conducted several studies in which we tried to further filter our data to remove noise. For example, we developed methods for detecting and removing bots from our cohort, 13 comparing therapeutic and recreational use of opioids from Twitter data by employing a multi-class classification strategy, 14 and automating the detection of illicit opioid use. 15 For some of our targeted analysis of Twitter data, we used only postlevel data samples (i.e., only the posts that contained the medication names rather than longitudinal data from the cohort members). ...
Preprint
Full-text available
Substance use, substance use disorder, and overdoses related to substance use are major public health problems globally and in the United States. A key aspect of addressing these problems from a public health standpoint is improved surveillance. Traditional surveillance systems are laggy, and social media are potentially useful sources of timely data. However, mining knowledge from social media is a challenging task and requires the development of advanced artificial intelligence, specifically natural language processing and machine learning methods. Funded by the National Institute on Drug Abuse, we developed a sophisticated end-to-end pipeline for mining information about nonmedical prescription medication use from social media, namely Twitter and Reddit. In this paper, we describe the progress we have made over four years, including our automated data mining infrastructure, existing challenges in social media mining for toxicovigilance, and possible future research directions.
... Additionally, pre-existing imaging data can be used to identify neural features, such as reduction in subcortical integrity, associated with OUD [77]. Patterns and motives of use can even be extracted from social media [78][79][80]. One study identified comments and conversations on suicide among opioid users on Instagram, representing a key health risk factor for OUD on a platform heavily utilized by young adults [80]. ...
Article
Intentional overdose (OD) of over-the-counter (OTC) and prescription drugs is becoming a significant social issue all over the world. While previous research has focused on drug misuse, there has been limited analysis using social networking service data. This study aims to analyze posts related to a drug overdose on Twitter® (X®) to understand the characteristics and trends of drug misuse, and to examine the applicability of social media in understanding the current situation of OD through natural language processing techniques. We collected posts in Japanese containing the term “OD” from January 10 to February 8, 2023, and analyzed 30203 posts. Using a pre-trained, fine-tuned bidirectional encoder representations from transformers (BERT) model, we classified the posts into categories, including direct mentions of OD. We examined the content for drug types and emotional context. Among the 5283 posts categorized as “Posts describing ODing,” about one-third included specific drug names or related terms. The most frequently mentioned OTC drugs included active ingredients such as codeine, dextromethorphan, ephedrine, and diphenhydramine. Prescription drugs, particularly benzodiazepines and pregabalin, were also common. Tweets peaked at midnight, suggesting a link between negative emotions and potential OD incidents. Our classifier showed high accuracy in distinguishing OD-related posts. Analyzing Twitter® posts provides valuable insights into the patterns and emotional contexts of drug misuse. Monitoring social networking services for OD-related content could help identify high-risk individuals and inform prevention strategies. Enhanced monitoring and public awareness are crucial to reducing the risks associated with both OTC and prescription drug misuse.
Article
Full-text available
Reclassification of massive datasets acquired through different approaches, such as web scraping, is a big challenge to demonstrate the effectiveness of a machine learning model. Notably, there is a strong influence of the quality of the dataset used for training those models. Thus, we propose a threshold algorithm as an efficient method to remove stopwords. This method employs an unsupervised classification technique, such as K-means, to accurately categorize user reviews from the IMDb dataset into their most suitable categories, generating a well-balanced dataset. Analysis of the performance of the algorithm revealed a notable influence of the text vectorization method used concerning the generation of clusters when assessing various preprocessing approaches. Moreover, the algorithm demonstrated that the word embedding technique and the removal of stopwords to retrieve the clustered text significantly impacted the categorization. The proposed method involves confirming the presence of a suggested stopword within each review across various genres. Upon satisfying this condition, the method assesses if the word’s frequency exceeds a predefined threshold. The threshold algorithm yielded a mapping genre success above 80% compared to precompiled lists and a Zipf’s law-based method. In addition, we employed the mini-batch K-means method for the clustering formation of each differently preprocessed dataset. This approach enabled us to reclassify reviews more coherently. Summing up, our methodology categorizes sparsely labeled data into meaningful clusters, in particular, by using a combination of the proposed stopword removal method and TF-IDF. The reclassified and balanced datasets showed a significant improvement, achieving 94% accuracy compared to the original dataset.
Article
Full-text available
Background The global or multinational scientific evidence on the distribution of opioid fatality is unknown. Hence, the current study collects epidemiological characteristics to shed light on the ongoing global or multinational opioid crisis and to promote the development of public health prevention/management strategies. Method All documents on PRISMA standards were retrieved via electronic databases. Results Among the 47 articles relevant to our studies, which depict a total population size of 10,191 individuals, the prevalence of opioid fatal overdose was 15,022 (14.74%). Among the 47 articles, 14 of them reported the gender of the participants, with 22,125 (15.79%) male individuals and 7,235 (5.17%) female individuals, and the age distribution of the participants that was most affected by the overdose was as follows: 29,272 (31.13%) belonged to the 18-34-year-old age group and 25,316 (26.92%) belonged to the less than 18-year-old age group. Eighteen studies qualified for the meta-analysis of the multinational prevalence of fatal opioid overdose, depicting an overall pooled prevalence estimate of 19.66%, with 95% CIs (0.13–0.29), I² = 99.76% determined using the random-effects model, and Q statistic of 7198.77 (p < 0.0001). The Egger test models of publication bias revealed an insubstantial level of bias (p = 0.015). The subgroup analysis of the study design (cohort or other) revealed that others have the highest prevalence estimate of 34.37, 95% CIs (0.1600–0.5901), I² = 97.04%, and a sample size of less than 1,000 shows the highest prevalence of 34.66, 95% CIs (0.2039–0.5234), I² = 97.82%, compared to that of more than 1,000 with a prevalence of 12.28, 95% CIs (0.0675–0.2131), I² = 99.85%. The meta-regression analysis revealed that sample size (less-than or greater-than 1,000), (p = 0.0098; R² = 3.83%) is significantly associated with the observed heterogeneity. Conclusion Research-based findings of fatal opioid overdose are grossly lacking in middle- and low-income nations. We established that there is a need for opioid fatality surveillance systems in developing nations.
Article
Background Opioids are strong pain medications that can be essential for acute pain. However, opioids are also commonly used for chronic conditions and illicitly where there are well-recognised concerns about the balance of their benefits and harms. Technologies using artificial intelligence (AI) are being developed to examine and optimise the use of opioids. Yet, this research has not been synthesised to determine the types of AI models being developed and the application of these models. Methods We aimed to synthesise studies exploring the use of AI in people taking opioids. We searched three databases: the Cochrane Database of Systematic Reviews, Embase and Medline on 4 January 2021. Studies were included if they were published after 2010, conducted in a real-life community setting involving humans and used AI to understand opioid use. Data on the types and applications of AI models were extracted and descriptively analysed. Results Eighty-one articles were included in our review, representing over 5.3 million participants and 14.6 million social media posts. Most (93%) studies were conducted in the USA. The types of AI technologies included natural language processing (46%) and a range of machine learning algorithms, the most common being random forest algorithms (36%). AI was predominately applied for the surveillance and monitoring of opioids (46%), followed by risk prediction (42%), pain management (10%) and patient support (2%). Few of the AI models were ready for adoption, with most (62%) being in preliminary stages. Conclusions Many AI models are being developed and applied to understand opioid use. However, there is a need for these AI technologies to be externally validated and robustly evaluated to determine whether they can improve the use and safety of opioids.
Article
Full-text available
This study compares self-disclosure on Facebook and Twitter through the lens of demographic and psychological traits. Predictive evaluation reveals that language models trained on Facebook posts are more accurate at predicting age, gender, stress, and empathy than those trained on Twitter posts. Qualitative analyses of the underlying linguistic and demographic differences reveal that users are significantly more likely to disclose information about their family, personal concerns, and emotions and provide a more `honest' self-representation on Facebook. On the other hand, the same users significantly preferred to disclose their needs, drives, and ambitions on Twitter. The higher predictive performance of Facebook is also partly due to the greater volume of language on Facebook than Twitter -- Facebook and Twitter are equally good at predicting user traits when the same-sized language samples are used to train language models. We explore the implications of these differences in cross-platform user trait prediction.
Article
Full-text available
Importance Automatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States. Objective To develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter. Design, Setting, and Participants This cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data. Main Outcomes and Measures Pearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs. Results A total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%) were not in the English language. Yearly rates of abuse-indicating social media post showed statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for 3 years (Pearson r = 0.451, P < .001; Spearman r = 0.331, P = .004). Abuse-indicating tweet rates showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical prescription opioid use (Pearson r = 0.683, P = .01; Spearman r = 0.346, P = .25), illicit drug use (Pearson r = 0.850, P < .001; Spearman r = 0.341, P = .25), illicit drug dependence (Pearson r = 0.937, P < .001; Spearman r = 0.495, P = .09), and illicit drug dependence or abuse (Pearson r = 0.935, P < .001; Spearman r = 0.401, P = .17) over the same 3-year period, although the tests lacked power to demonstrate statistical significance. A classification approach involving an ensemble of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI, 0.708-0.743). Conclusions and Relevance The correlations obtained in this study suggest that a social media–based approach reliant on supervised machine learning may be suitable for geolocation-centric monitoring of the US opioid epidemic in near real time.
Article
Full-text available
Objective: Prescription medication (PM) misuse and abuse is a major health problem globally, and a number of recent studies have focused on exploring social media as a resource for monitoring nonmedical PM use. Our objectives are to present a methodological review of social media-based PM abuse or misuse monitoring studies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this resource. Materials and methods: We identified studies involving social media, PMs, and misuse or abuse (inclusion criteria) from Medline, Embase, Scopus, Web of Science, and Google Scholar. We categorized studies based on multiple characteristics including but not limited to data size; social media source(s); medications studied; and primary objectives, methods, and findings. Results: A total of 39 studies met our inclusion criteria, with 31 (∼79.5%) published since 2015. Twitter has been the most popular resource, with Reddit and Instagram gaining popularity recently. Early studies focused mostly on manual, qualitative analyses, with a growing trend toward the use of data-centric methods involving natural language processing and machine learning. Discussion: There is a paucity of standardized, data-centric frameworks for curating social media data for task-specific analyses and near real-time surveillance of nonmedical PM use. Many existing studies do not quantify human agreements for manual annotation tasks or take into account the presence of noise in data. Conclusion: The development of reproducible and standardized data-centric frameworks that build on the current state-of-the-art methods in data and text mining may enable effective utilization of social media data for understanding and monitoring nonmedical PM use.
Article
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Article
Full-text available
Introduction Prescription medication overdose is the fastest growing drug-related problem in the USA. The growing nature of this problem necessitates the implementation of improved monitoring strategies for investigating the prevalence and patterns of abuse of specific medications. Objectives Our primary aims were to assess the possibility of utilizing social media as a resource for automatic monitoring of prescription medication abuse and to devise an automatic classification technique that can identify potentially abuse-indicating user posts. Methods We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall®, oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time. Results Our analyses show that clear signals of medication abuse can be drawn from Twitter posts and the percentage of tweets containing abuse signals are significantly higher for the three case medications (Adderall®: 23 %, quetiapine: 5.0 %, oxycodone: 12 %) than the proportion for the control medication (metformin: 0.3 %). Our automatic classification approach achieves 82 % accuracy overall (medication abuse class recall: 0.51, precision: 0.41, F measure: 0.46). To illustrate the utility of automatic classification, we show how the classification data can be used to analyze abuse patterns over time. Conclusion Our study indicates that social media can be a crucial resource for obtaining abuse-related information for medications, and that automatic approaches involving supervised classification and natural language processing hold promises for essential future monitoring and intervention tasks.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
Food crises imply responses that are not what people and organisations would normally do, if one or more threats (health, economic, etc.) were not present. At an individual level, this motivates individuals to implement coping strategies aimed at adaptation to the threat that has been presented, as well as the reduction of stressful experiences. In this regard, microblogging channels such as Twitter emerge as a valuable resource to access individuals’ expressions of coping. Accordingly, Twitter expressions are generally more natural, spontaneous and heterogeneous — in cognitive, affective and behavioural dimensions — than expressions found on other types of social media (e.g. blogs). Moreover, as a social media channel, it provides access not only to an individual but also to a social level of analysis, i.e. a psychosocial media analysis. To show the potential in this regard, our study analysed Twitter messages produced by individuals during the 2011 EHEC/E. coli bacteria outbreak in Europe, due to contaminated food products. This involved more than 3,100 cases of bloody diarrhoea and 850 of haemolytic uremic syndrome (HUS), and 53 confirmed deaths across the EU. Based on data collected in Spain, the country initially thought to be the source of the outbreak, an initial quantitative analysis considered 11 411 tweets, of which 2099 were further analysed through a qualitative content analysis. This aimed at identifying: 1) the ways of coping expressed during the crisis; and 2) how uncertainty about the contaminated product, expressed through hazard notifications, influenced the former. Results revealed coping expressions as being dynamic, flexible and social, with a predominance of accommodation, information seeking and opposition (e.g. anger) strategies. The latter were more likely during a period of uncertainty, with the opposite being true for strategies relying on the identification of the contaminated product (e.g. avoid consumption/purchase). Implications for food crisis communication and monitoring systems are discussed.
Article
Full-text available
The tragic death of 18-year-old Ryan Haight highlighted the ethical, public health, and youth patient safety concerns posed by illicit online nonmedical use of prescription drugs (NUPM) sourcing, leading to a federal law in an effort to address this concern. Yet despite the tragedy and resulting law, the NUPM epidemic in the United States has continued to escalate and represents a dangerous and growing trend among youth and adolescents. A critical point of access associated with youth NUPM is the Internet. Internet use among this vulnerable patient group is ubiquitous and includes new, emerging, and rapidly developing technologies-particularly social media networking (eg, Facebook and Twitter). These unregulated technologies may pose a potential risk for enabling youth NUPM behavior. In order to address limitations of current regulations and promote online safety, we advocate for legislative reform to specifically address NUPM promotion via social media and other new online platforms. Using more comprehensive and modernized federal legislation that anticipates future online developments is critical in substantively addressing youth NUPM behavior occurring through the Internet.
Conference Paper
Full-text available
Online social networking sites like My Space, Facebook, and Flickr have become a popular way to share and disseminate content. Their massive popularity has led to viral marketing techniques that attempt to spread content, products, and ideas on these sites. However, there is little data publicly available on viral propagation in the real world and few studies have characterized how information spreads over current online social networks. In this paper, we collect and analyze large-scale traces of information dissemination in the Flickr social network. Our analysis, based on crawls of the favorite markings of 2.5 million users on 11 million photos, aims at answering three key questions: (a) how widely does information propagate in the social network? (b) how quickly does information propagate? and (c) what is the role of word-of-mouth exchanges between friends in the overall propagation of information in the network? Contrary to viral marketing "intuition," we find that (a) even popular photos do not spread widely throughout the network, (b) even popular photos spread slowly through the network, and (c) information exchanged between friends is likely to account for over 50% of all favorite-markings, but with a significant delay at each hop. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Article
Full-text available
Surveys are popular methods to measure public perceptions in emergencies but can be costly and time consuming. We suggest and evaluate a complementary "infoveillance" approach using Twitter during the 2009 H1N1 pandemic. Our study aimed to: 1) monitor the use of the terms "H1N1" versus "swine flu" over time; 2) conduct a content analysis of "tweets"; and 3) validate Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Between May 1 and December 31, 2009, we archived over 2 million Twitter posts containing keywords "swine flu," "swineflu," and/or "H1N1." using Infovigil, an infoveillance system. Tweets using "H1N1" increased from 8.8% to 40.5% (R(2) = .788; p<.001), indicating a gradual adoption of World Health Organization-recommended terminology. 5,395 tweets were randomly selected from 9 days, 4 weeks apart and coded using a tri-axial coding scheme. To track tweet content and to test the feasibility of automated coding, we created database queries for keywords and correlated these results with manual coding. Content analysis indicated resource-related posts were most commonly shared (52.6%). 4.5% of cases were identified as misinformation. News websites were the most popular sources (23.2%), while government and health agencies were linked only 1.5% of the time. 7/10 automated queries correlated with manual coding. Several Twitter activity peaks coincided with major news stories. Our results correlated well with H1N1 incidence data. This study illustrates the potential of using social media to conduct "infodemiology" studies for public health. 2009 H1N1-related tweets were primarily used to disseminate information from credible sources, but were also a source of opinions and experiences. Tweets can be used for real-time content analysis and knowledge translation research, allowing health authorities to respond to public concerns.
Article
Full-text available
Nationally endorsed, clinical performance measures are available that allow for quality reporting using electronic health records (EHRs). To our knowledge, how well they reflect actual quality of care has not been studied. We sought to evaluate the validity of performance measures for coronary artery disease (CAD) using an ambulatory EHR. We performed a retrospective electronic medical chart review comparing automated measurement with a 2-step process of automated measurement supplemented by review of free-text notes for apparent quality failures for all patients with CAD from a large internal medicine practice using a commercial EHR. The 7 performance measures included the following: antiplatelet drug, lipid-lowering drug, beta-blocker following myocardial infarction, blood pressure measurement, lipid measurement, low-density lipoprotein cholesterol control, and angiotensin-converting enzyme inhibitor or angiotensin receptor blocker for patients with diabetes mellitus or left ventricular systolic dysfunction. Performance varied from 81.6% for lipid measurement to 97.6% for blood pressure measurement based on automated measurement. A review of free-text notes for cases failing an automated measure revealed that misclassification was common and that 15% to 81% of apparent quality failures either satisfied the performance measure or met valid exclusion criteria. After including free-text data, the adherence rate ranged from 87.5% for lipid measurement and low-density lipoprotein cholesterol control to 99.2% for blood pressure measurement. Profiling the quality of outpatient CAD care using data from an EHR has significant limitations. Changes in how data are routinely recorded in an EHR are needed to improve the accuracy of this type of quality measurement. Validity testing in different settings is required.
Article
While many studies have explored the use of social media and behavioral changes of individuals, few examined the utility of using social media for suicide detection and prevention. The study by Jashinsky et al. identified specific language patterns associated with a set of twelve suicide risk factors. The authors extended these methods to assess the significance of the language used on Twitter for suicide detection. This article quantifies the use of Twitter to express suicide related language, and its potential to detect users at high risk of suicide. The authors searched Twitter for tweets indicative of 12 suicide risk factors. This paper divided Twitter users into two groups: “high risk” and “at risk” based on two of the risk factors (“self-harm” and “prior suicide attempts”) and examined language patterns by computing co-occurrences of terms in tweets which helped identify relationships between suicide risk factors in both groups.
Article
Dimensionality reduction methods are usually applied on molecular dynamics simulations of macromolecules for analysis and visualization purpose. It is normally desired that suitable dimensionality reduction methods could clearly distinguish functionally important states with different conformations for the systems of interest. However, common dimensionality reduction methods for macromolecules simulations, including pre-defined order parameters and collective variables (CVs), principal component analysis (PCA), and time-structure based independent component analysis (t-ICA), only have limited success due to significant key structural information loss. Here, we introduced t‐distributed stochastic neighbor embedding (t-SNE) method as a dimensionality reduction method with minimum structural information loss widely used in bioinformatics for analyses of macromolecules, especially biomacromolecules simulations. It is demonstrated that both one-dimensional (1D) and two-dimensional (2D) models of t-SNE method are superior to distinguish important functional states of a model allosteric protein system for free energy and mechanistic analysis. Projections of the model protein simulations onto 1D and 2D t-SNE surfaces provide both clear visual cues and quantitative information, which is not readily available using other methods, regarding to the transition mechanism between two important functional states of this protein.
Article
Background: This paper goes beyond detecting specific themes within Zika-related chatter on Twitter, to identify the key actors who influence the diffusive process through which some themes become more amplified than others. Methods: We collected all Zika-related tweets during the 3 months immediately after the first U.S. case of Zika. After the tweets were categorized into 12 themes, a cross-section were grouped into weekly datasets, to capture 12 amplifier/user groups, and analyzed by 4 amplification modes: mentions, retweets, talkers, and Twitter-wide amplifiers. Results: We analyzed 3,057,130 tweets in the United States and categorized 4997 users. The most talked about theme was Zika transmission (~58%). News media, public health institutions, and grassroots users were the most visible and frequent sources and disseminators of Zika-related Twitter content. Grassroots users were the primary sources and disseminators of conspiracy theories. Conclusions: Social media analytics enable public health institutions to quickly learn what information is being disseminated, and by whom, regarding infectious diseases. Such information can help public health institutions identify and engage with news media and other active information providers. It also provides insights into media and public concerns, accuracy of information on Twitter, and information gaps. The study identifies implications for pandemic preparedness and response in the digital era and presents the agenda for future research and practice.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
The widely known vocabulary gap between health consumers and healthcare professionals hinders information seeking and health dialogue of consumers on end-user health applications. The Open Access and Collaborative Consumer Health Vocabulary (OAC CHV), which contains health-related terms used by lay consumers, has been created to bridge such a gap. Specifically, the OAC CHV facilitates consumers' health information retrieval by enabling consumer-facing health applications to translate between professional language and consumer friendly language. To keep up with the constantly evolving medical knowledge and language use, new terms need to be identified and added to the OAC CHV. User-generated content on social media, including social question and answer (social Q&A) sites, afford us an enormous opportunity in mining consumer health terms. Existing methods of identifying new consumer terms from text typically use ad-hoc lexical syntactic patterns and human review. Our study extends an existing method by extracting n-grams from a social Q&A textual corpus and representing them with a rich set of contextual and syntactic features. Using K-means clustering, our method, simiTerm, was able to identify terms that are both contextually and syntactically similar to the existing OAC CHV terms. We tested our method on social Q&A corpora on two disease domains: diabetes and cancer. Our method outperformed three baseline ranking methods. A post-hoc qualitative evaluation by human experts further validated that our method can effectively identify meaningful new consumer terms on social Q&A.
Article
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance.
Article
In order to improve the efficiency of the multi-class classifiers based on support vector machine (SVM), the multi-sphere method was introduced to supervised learning. By training one-class SVM (1-SVM) on the samples class by class, a classifier composed of multiple spheres was obtained. To remove the redundant region in the spheres, a compacted one-vs-rest classifier was used to separate the mixed samples. These two complementary classifiers can be combined into a weighted classifier of one-vs-rest and multi-spheres. The regularization method of the weight factor and other parameters was given based on cross validation. Simulation showed that the novel classifier has higher accuracy with less training time when compared with one-vs-rest classifier, and its decision rate is faster than that of one-vs-one classifier. Consequently, the novel classifier is helpful for solving multi-class problems on large scale systems.
Article
A longitudinal analysis of panel data from users of a popular online social network site, Facebook, investigated the relationship between intensity of Facebook use, measures of psychological well-being, and bridging social capital. Two surveys conducted a year apart at a large U.S. university, complemented with in-depth interviews with 18 Facebook users, provide the study data. Intensity of Facebook use in year one strongly predicted bridging social capital outcomes in year two, even after controlling for measures of self-esteem and satisfaction with life. These latter psychological variables were also strongly associated with social capital outcomes. Self-esteem served to moderate the relationship between Facebook usage intensity and bridging social capital: those with lower self-esteem gained more from their use of Facebook in terms of bridging social capital than higher self-esteem participants. We suggest that Facebook affordances help reduce barriers that lower self-esteem students might experience in forming the kinds of large, heterogeneous networks that are sources of bridging social capital.
Article
Electronic medical records (EMR) provide a unique opportunity for efficient, large-scale clinical investigation in psychiatry. However, such studies will require development of tools to define treatment outcome. Natural language processing (NLP) was applied to classify notes from 127 504 patients with a billing diagnosis of major depressive disorder, drawn from out-patient psychiatry practices affiliated with multiple, large New England hospitals. Classifications were compared with results using billing data (ICD-9 codes) alone and to a clinical gold standard based on chart review by a panel of senior clinicians. These cross-sectional classifications were then used to define longitudinal treatment outcomes, which were compared with a clinician-rated gold standard. Models incorporating NLP were superior to those relying on billing data alone for classifying current mood state (area under receiver operating characteristic curve of 0.85-0.88 v. 0.54-0.55). When these cross-sectional visits were integrated to define longitudinal outcomes and incorporate treatment data, 15% of the cohort remitted with a single antidepressant treatment, while 13% were identified as failing to remit despite at least two antidepressant trials. Non-remitting patients were more likely to be non-Caucasian (p<0.001). The application of bioinformatics tools such as NLP should enable accurate and efficient determination of longitudinal outcomes, enabling existing EMR data to be applied to clinical research, including biomarker investigations. Continued development will be required to better address moderators of outcome such as adherence and co-morbidity.
Article
We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings; we study datasets from Epinions, Slashdot and Wikipedia. We find that the signs of links in the underlying social networks can be predicted with high accuracy, using models that generalize across this diverse range of sites. These models provide insight into some of the fundamental principles that drive the formation of signed links in networks, shedding light on theories of balance and status from social psychology; they also suggest social computing applications by which the attitude of one user toward another can be estimated from evidence provided by their relationships with other members of the surrounding social network.
Article
Only a few formal assessments of websites with drug-related contents have been carried out. We aimed here at fostering collection and analysis of data from web pages related to information on consumption, manufacture and sales of psychoactive substances. GENERAL METHODS: An 8-language, two-engine, assessment of the information available in a purposeful sample of 1633 unique websites was carried out. A pro-drug and a harm reduction approach were evident, respectively, in 18% and 10% of websites accessed. About 1 in 10 websites offered either psychoactive compounds for sale or detailed data on drugs' synthesis/extraction procedures. Information on a number of psychoactive substances and on unusual drugs' combinations not found in the Medline was elicited. This represents the first review which is both comprehensive and multilingual of the online available information on psychoactive compounds. Health professionals may need to be aware of the web being a new drug resource for information and possibly purchase.
Article
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning. 1
Leveraging Twitter to better identify suicide risk
  • Fodeh
Utilizing social media to combat opioid addiction epidemic: automatic detection of opioid users from twitter
  • Zhang
Xlnet: generalized autoregressive pretraining for language understanding
  • Yang
Predicting the future with social media
  • S Asur
  • B A Huberman
S. Asur, B.A. Huberman, Predicting the future with social media, IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 2010, pp. 492-499.
Facebook versus twitter: differences in selfdisclosure and trait prediction
  • K Jaidka
  • S C Guntuku
  • L H Ungar
K. Jaidka, S.C. Guntuku, L.H. Ungar, Facebook versus twitter: differences in selfdisclosure and trait prediction. Twelfth International AAAI Conference on Web and Social Media, 2018.
  • F Schifano
  • P Deluca
  • A Baldacchino
  • T Peltoniemi
  • N Scherbaum
  • M Torrens
F. Schifano, P. Deluca, A. Baldacchino, T. Peltoniemi, N. Scherbaum, M. Torrens, et al., Drugs on the web; the Psychonaut 2002 EU project, in: Progress in Neuro-Psychopharmacology and Biological Psychiatry, vol. 30, 2006, pp. 640-646.
Leveraging Twitter to better identify suicide risk
  • S Fodeh
  • J Goulet
  • C Brandt
  • A.-T Hamada
S. Fodeh, J. Goulet, C. Brandt, A.-T. Hamada, Leveraging Twitter to better identify suicide risk. Medical Informatics and Healthcare, 2017, pp. 1-7.
Utilizing social media to combat opioid addiction epidemic: automatic detection of opioid users from twitter
  • Y Zhang
  • Y Fan
  • Y Ye
  • X Li
  • E L Winstanley
Y. Zhang, Y. Fan, Y. Ye, X. Li, E.L. Winstanley, Utilizing social media to combat opioid addiction epidemic: automatic detection of opioid users from twitter. Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling
  • P Zhou
  • Z Qi
  • S Zheng
  • J Xu
  • H Bao
  • B Xu
P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, B. Xu, Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling, 2016 arXiv preprint arXiv:1611.06639.
Bert: pre-training of deep bidirectional transformers for language understanding
  • J Devlin
  • M.-W Chang
  • K Lee
  • K Toutanova
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding, 2018 arXiv preprint arXiv: 1810.04805.
Xlnet: generalized autoregressive pretraining for language understanding
  • Z Yang
  • Z Dai
  • Y Yang
  • J Carbonell
  • R R Salakhutdinov
  • Q V Le
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet: generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 2019, pp. 5754-5764.