Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

While social media has evolved into a useful resource for studying medication-related information, observational studies of medications have continued to rely on other sources of data. Towards advancing the use of social media data for medication-related observational studies, we analyze an annotated corpus of 27,941 tweets designed for training machine learning algorithms to automatically detect users' medication intake. In particular, we assess how a baseline classifier trained on the general corpus-that is, on various types of medication-performs for specific types. For most types, the classifier performs significantly better than it does overall; however, for nervous system medications, it performs significantly worse. These results suggest that, while the general corpus may have utility for observational studies focusing on most types of medication, studying nervous system medications may benefit from training a classifier exclusively for this type. We will explore this data-level approach in future work.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Article
Full-text available
Introduction Adverse effects of medications taken during pregnancy are traditionally studied through post-marketing pregnancy registries, which have limitations. Social media data may be an alternative data source for pregnancy surveillance studies. Objective The objective of this study was to assess the feasibility of using social media data as an alternative source for pregnancy surveillance for regulatory decision making. Methods We created an automated method to identify Twitter accounts of pregnant women. We identified 196 pregnant women with a mention of a birth defect in relation to their baby and 196 without a mention of a birth defect in relation to their baby. We extracted information on pregnancy and maternal demographics, medication intake and timing, and birth defects. Results Although often incomplete, we extracted data for the majority of the pregnancies. Among women that reported birth defects, 35% reported taking one or more medications during pregnancy compared with 17% of controls. After accounting for age, race, and place of residence, a higher medication intake was observed in women who reported birth defects. The rate of birth defects in the pregnancy cohort was lower (0.44%) compared with the rate in the general population (3%). Conclusions Twitter data capture information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. Development of improved methods to automatically extract and annotate social media data may increase their value to support regulatory decision making regarding pregnancy outcomes in women using medications during their pregnancies.
Article
Full-text available
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).
Article
Full-text available
Background: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias. Objective: The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. Methods: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. Results: Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F1 score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. Conclusions: Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them.
Article
Full-text available
Introduction Prescription medication overdose is the fastest growing drug-related problem in the USA. The growing nature of this problem necessitates the implementation of improved monitoring strategies for investigating the prevalence and patterns of abuse of specific medications. Objectives Our primary aims were to assess the possibility of utilizing social media as a resource for automatic monitoring of prescription medication abuse and to devise an automatic classification technique that can identify potentially abuse-indicating user posts. Methods We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall®, oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time. Results Our analyses show that clear signals of medication abuse can be drawn from Twitter posts and the percentage of tweets containing abuse signals are significantly higher for the three case medications (Adderall®: 23 %, quetiapine: 5.0 %, oxycodone: 12 %) than the proportion for the control medication (metformin: 0.3 %). Our automatic classification approach achieves 82 % accuracy overall (medication abuse class recall: 0.51, precision: 0.41, F measure: 0.46). To illustrate the utility of automatic classification, we show how the classification data can be used to analyze abuse patterns over time. Conclusion Our study indicates that social media can be a crucial resource for obtaining abuse-related information for medications, and that automatic approaches involving supervised classification and natural language processing hold promises for essential future monitoring and intervention tasks.
Article
Full-text available
The extent of preventable medication-related hospital admissions and medication-related issues in primary care is significant enough to justify developing decision support systems for medication safety surveillance. The prerequisite for such systems is defining a relevant set of medication safety-related indicators and understanding the influence of both patient and general practice characteristics on medication prescribing and monitoring. The aim of the study was to investigate the feasibility of linked primary and secondary care electronic health record data for surveillance of medication safety, examining not only prescribing but also monitoring, and associations with patient- and general practice-level characteristics. A cross-sectional study was conducted using linked records of patients served by one hospital and over 50 general practices in Salford, UK. Statistical analysis consisted of mixed-effects logistic models, relating prescribing safety indicators to potential determinants. The overall prevalence (proportion of patients with at least one medication safety hazard) was 5.45 % for prescribing indicators and 7.65 % for monitoring indicators. Older patients and those on multiple medications were at higher risk of prescribing hazards, but at lower risk of missed monitoring. The odds of missed monitoring among all patients were 25 % less for males, 50 % less for patients in practices that provide general practitioner training, and threefold higher in practices serving the most deprived compared with the least deprived areas. Practices with more prescribing hazards did not tend to show more monitoring issues. Systematic collection, collation, and analysis of linked primary and secondary care records produce plausible and useful information about medication safety for a health system. Medication safety surveillance systems should pay close attention to patient age and polypharmacy with respect to both prescribing and monitoring failures; treat prescribing and monitoring as different statistical processes, rather than a combined measure of prescribing safety; and audit the socio-economic equity of missed monitoring.
Article
Full-text available
Social media postings are rich in information that often remain hidden and inaccessible for automatic extraction due to inherent limitations of the site's APIs, which mostly limit access via specific keyword-based searches (and limit both the number of keywords and the number of postings that are returned). When mining social media for drug mentions, one of the first problems to solve is how to derive a list of variants of the drug name (common misspellings) that can capture a sufficient number of postings. We present here an approach that filters the potential variants based on the intuition that, faced with the task of writing an unfamiliar, complex word (the drug name), users will tend to revert to phonetic spelling, and we thus give preference to variants that reflect the phonemes of the correct spelling. The algorithm allowed us to capture 50.4 - 56.0 % of the user comments using only about 18% of the variants.
Article
Full-text available
Automatic monitoring of Adverse Drug Reactions (ADRs), defined as adverse patient outcomes caused by medications, is a challenging research problem that is currently receiving significant attention from the medical informatics community. In recent years, user-posted data on social media, primarily due to its sheer volume, has become a useful resource for ADR monitoring. Research using social media data has progressed using various data sources and techniques, making it difficult to compare distinct systems and their performances. In this paper, we perform a methodical review to characterize the different approaches to ADR detection/extraction from social media, and their applicability to pharmacovigilance. In addition, we present a potential systematic pathway to ADR monitoring from social media.
Article
Full-text available
Prescription drug abuse has become a major public health problem. Relationships and social context are important contributing factors. Social media provides online channels for people to build relationships that may influence attitudes and behaviors. To determine whether people who show signs of prescription drug abuse connect online with others who reinforce this behavior, and to observe the conversation and engagement of these networks with regard to prescription drug abuse. Twitter statuses mentioning prescription drugs were collected from November 2011 to November 2012. From this set, 25 Twitter users were selected who discussed topics indicative of prescription drug abuse. Social circles of 100 people were discovered around each of these Twitter users; the tweets of the Twitter users in these networks were collected and analyzed according to prescription drug abuse discussion and interaction with other users about the topic. From November 2011 to November 2012, 3,389,771 mentions of prescription drug terms were observed. For the 25 social circles (n=100 for each circle), on average 53.96% (SD 24.3) of the Twitter users used prescription drug terms at least once in their posts, and 37.76% (SD 20.8) mentioned another Twitter user by name in a post with a prescription drug term. Strong correlation was found between the kinds of drugs mentioned by the index user and his or her network (mean r=0.73), and between the amount of interaction about prescription drugs and a level of abusiveness shown by the network (r=0.85, P<.001). Twitter users who discuss prescription drug abuse online are surrounded by others who also discuss it-potentially reinforcing a negative behavior and social norm.
Article
Full-text available
Adderall is the most commonly abused prescription stimulant among college students. Social media provides a real-time avenue for monitoring public health, specifically for this population. This study explores discussion of Adderall on Twitter to identify variations in volume around college exam periods, differences across sets of colleges and universities, and commonly mentioned side effects and co-ingested substances. Public-facing Twitter status messages containing the term "Adderall" were monitored from November 2011 to May 2012. Tweets were examined for mention of side effects and other commonly abused substances. Tweets from likely students containing GPS data were identified with clusters of nearby colleges and universities for regional comparison. 213,633 tweets from 132,099 unique user accounts mentioned "Adderall." The number of Adderall tweets peaked during traditional college and university final exam periods. Rates of Adderall tweeters were highest among college and university clusters in the northeast and south regions of the United States. 27,473 (12.9%) mentioned an alternative motive (eg, study aid) in the same tweet. The most common substances mentioned with Adderall were alcohol (4.8%) and stimulants (4.7%), and the most common side effects were sleep deprivation (5.0%) and loss of appetite (2.6%). Twitter posts confirm the use of Adderall as a study aid among college students. Adderall discussions through social media such as Twitter may contribute to normative behavior regarding its abuse.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.
Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy
  • S Golder
  • S Chiuve
  • D Weissenbacher
  • A Klein
  • K O'connor
  • M Bland
  • M Malin
  • M Bhattacharya
  • L J Scarazinni
  • G Gonzalez-Hernandez
Golder S, Chiuve S, Weissenbacher D, Klein A, O'Connor K, Bland M, Malin M, Bhattacharya M, Scarazinni LJ, Gonzalez-Hernandez G. Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy. Drug Saf. 2018.