ArticlePDF Available

Comment on: “Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts”

Authors:
Correspondence
Comment on: “Deep learning for pharmacovigilance:
recurrent neural network architectures for labeling
adverse drug reactions in Twitter posts”
Arjun Magge,
1
Abeed Sarker,
2
Azadeh Nikfarjam,
2
and Graciela Gonzalez-Hernandez
2
1
College of Health Solutions, Arizona State University, Scottsdale, Arizona, USA, and
2
Perelman School of Medicine, University of
Pennsylvania, Philadelphia, Pennsylvania, USA
Received 28 November 2018; Editorial Decision 20 December 2018; Accepted 21 January 2019
Dear Editor,
We read with great interest the article by Cocos et al.
1
In it, the
authors use one of the datasets made public by our lab in parallel
with a publication in Journal of the American Medical Informatics
Association,
2
referred to by them as the Twitter ADR Dataset (v1.0)
(henceforth the ADRMine Dataset). Cocos et al use state-of-the-art
recurrent neural network (RNN) models for extracting adverse drug
reaction (ADR) mentions in Twitter posts. We commend the authors
for their clear description of the workings of neural models, and on
their experiments on the use of fixed versus trainable embeddings,
which can be very valuable to the natural language processing
(NLP) research community. We believe that using deep learning
models offer greater opportunities for mining ADR posts on social
media.
However, there are key choices made by the authors that require
clarification to avoid a misunderstanding on the impact of their find-
ings. In a nutshell, because the authors did not use the ADRMine
Dataset in its entirety, discarding upfront all tweets with no human
annotations (ie, those that do not contain any ADRs), the resulting
train and test sets are biased toward the positive class. Thus, the per-
formance measures reported for the task in Cocos et al are not com-
parable to those reported in Nikfarjam et al,
2
contrary to what the
manuscript reports.
After discarding tweets with no human annotation from the
ADRMine Dataset, the authors downloaded available tweets from
Twitter, and added a small set (203 tweets) to form the dataset used
for their experiments. While downloading from Twitter results in an
almost unavoidable reduction in the dataset size—as not all tweets
are available as time goes by—it would not generally affect the class
balance. The elimination of the tweets with no human annotations
from the ADRMine Dataset, however, is a choice that is not
discussed by Cocos et al, even though it severely impacts the
positive-to-negative class balance of the dataset, leaving it at the 95
to 5 that they report, and, as our experiments show, has a significant
impact on the reported performance. Our comparisons of ADRMine
with the system proposed by Cocos et al reveal that, actually, when
the two systems are employed on the dataset with the original bal-
ance, ADRMine
2
performs significantly better than their proposed
approach (last two rows of Table 1). Thus, the claim in the Results
and Conclusion sections of Cocos et al that their model “represents
new state-of-the-art performance” and that “RNN models ... estab-
lish new state-of-the-art performance by achieving statistically sig-
nificant superior F-measure performance compared to the CRF-
based model” is premature. We expand on these points next.
To give some context to the ADRMine dataset, it contains a set
of tweets collected on medication name as a keyword. Retweets
were removed, and tweets with a URL were omitted, given that our
analysis showed that they were mostly advertisements. To balance
the data in a way that reflected what was automatically possible at
the time, a binary classifier with precision around 0.4-0.5 was as-
sumed. Thus, negative (non-ADR) instances were kept at around
50%, down from approximately 89% non-ADR tweets that come
naturally when collecting on medication name as a keyword,
2
a bal-
ance one would expect for this task utilizing state-of-the-art auto-
matic methods for classification before attempting extraction. It is
thus a realistic, justified, balance.
Regarding the Cocos et al approach, although controlled experi-
ments training with different ratios of class examples are not un-
usual in machine learning, results for different positive-to-negative
ratios are usually reported and are noted upfront. Cocos et al use a
95-to-5 positive-to-negative split, and only report on the perfor-
mance on this altered dataset, making no mention of the alteration
or class imbalance in the abstract. The statement in the abstract
summarizes their results as follows: “Our best-performing RNN
V
CThe Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com 577
Journal of the American Medical Informatics Association, 26(6), 2019, 577–579
doi: 10.1093/jamia/ocz013
Advance Access Publication Date: 11 April 2019
Correspondence
model ... achieved an approximate match F-measure of 0.755 for
ADR identification on the dataset, compared to 0.631 for a baseline
lexicon system and 0.65 for the state-of-the-art conditional random
fields model.” Although further in the manuscript Cocos et al refer
to having implemented a CRF model “as described for previous
state-of-the-art results,” citing Nikfarjam et al,
2
the statement in the
abstract could be misconstrued as directly comparing it to Nikfar-
jam et al, which is the state-of-the-art conditional random fields
(CRF) model. In reality, the results are not comparable, given the
changes to the dataset. Their implementation of a CRF model must
have been significantly different to ADRMine as described in Nik-
farjam et al, given that the reported performance in Cocos et al for a
CRF model (0.65) is much lower than when both systems are used
on the unaltered ADRMine Dataset, as our experiments show (last
two rows of Table 1).
2
Please note that Cocos et al did not make
available their CRF model implementation, so any differences to the
ADRMine model could not be verified directly, only inferred from
the reported results. The binaries of ADRMine were available at the
time of publication, and we have since made available the full code
to facilitate reproducibility.
a
In machine learning research, authors decide how the model is
trained and how the data are algorithmically filtered before training,
apply accepted practices for balancing the data, or include addi-
tional weakly supervised examples.
3
However, such methods are ap-
plied to the training data only, leaving the evaluation data intact in
order to be able to compare approaches. By excluding tweets that
are negative for the presence of ADRs and other entities from their
training, the authors built a model that is biased to the positive class.
This might not be immediately obvious in Cocos et al, as the model
is evaluated against a similarly biased test set. However, when run
against the balanced test set, the problem becomes evident. The
authors do note this, stating that “including a significant number of
posts without ADRs in the training data produced relatively poor
results for both the RNN and baseline models,” but they did not in-
clude a report of these results or altered their experimental approach
to make this more evident.
To illustrate the impact of the dataset modifications on the over-
all results, we ran the training and evaluation experiments on the
ADRMine Dataset for tweets available as of October 2018 using the
authors’ publicly available implementation
b
and summarize the
results in Table 1. Under the same settings as Cocos et al (eliminat-
ing virtually all tweets in the negative class), the performance
reported (row 1) and our replication (row 2) can be considered a
match with a slight drop that could be attributed to fewer tweets
available as of October 2018 compared with when they ran it. How-
ever, evaluating the Cocos et al model on the balanced test set (row
3) shows a drop of 10 percentage points compared with evaluating
against the mostly positive set (row 2). Training on all available pos-
itive and negative tweets from the October 2018 set (row 4) leads to
an improved model but continues to show significantly lower per-
formance (0.64) with respect to when the same model is trained and
tested on the biased set (0.73 in row 2). Additionally, and to be able
to do a direct comparison, we trained and tested the Cocos et al sys-
tem as provided by them (except for the download script) on the
original, balanced, ADRMine Dataset containing 1784 tweets. We
found a mean performance of 0.67 over 10 runs (row 5), 5 points
lower than the 0.72 F
1
-score reported in Nikfarjam et al on the same
dataset (row 6).
2
Furthermore, referring to the ADRMine Dataset,
2
Cocos et al re-
port, “Of the 957 identifiers in the original dataset...,” which is in-
correct. The original dataset, publicly available and unchanged since
its first publication in 2015, contains a total of 1784 tweets (1340 in
the training set and 444 in the evaluation or test set). As of October
2018, 1012 of the 1784 original training set tweets were still avail-
able in Twitter (including 267 of the 444 original evaluation tweets).
Cocos et al do not mention the additional 827 tweets that were in
the ADRMine Dataset, even though many of them were still avail-
able at the time of their publication. They used only 149 tweets
from the 444 in the evaluation set. From our analysis, the 957 men-
tioned in Cocos et al correspond to the number of tweets in the
ADRMine Dataset that are manually annotated for the presence of
ADRs and other entities, such as indications, drug, and other (mis-
cellaneous) entities. The rest (827 tweets) mentioning medications
but with no other entities present, are discarded upfront, as can be
observed by running Cocos et al’s code, the download_tweets.py
script. Although the Cocos et al code points researchers to the origi-
nal site to download the ADRMine Dataset, once they move on to
the said script with that data, they lose all the unannotated negative
tweets. The authors do not discuss the rationale as to why the data-
set was modified in such a manner. From the time that Cocos et al
was published, subsequent papers have also used the 95-to-5 posi-
tive-to-negative split, presumably because they reuse the python
script.
47
We have made available with this letter, a modification to
the download_tweets.py script that will keep previously discarded
tweets.
c
In conclusion, the performance reported for the RNN model in
Cocos et al is not comparable to any prior published approach, and
Table 1. Performance comparison of NERs under different training and testing modes
Mode Dataset Size Precision Recall F
1
-score
Cocos et al on MostlyPos dataset as published 844 tweets 0.70 (0.66-0.74) 0.82 (0.76-0.89) 0.75 (0.74-0.76)
October 2018: train
MostlyPos
and test
MostlyPos
526 tweets 0.76 (0.70-0.82) 0.72 (0.63-0.81) 0.73 (0.70-0.76)
October 2018: train
MostlyPos
and test
Standard
644 tweets 0.60 (0.54-0.65) 0.70 (0.62-0.77) 0.63 (0.60-0.66)
October 2018: train
Standard
and test
Standard
1012 tweets 0.73 (0.66-0.79) 0.60 (0.52-0.68) 0.64 (0.62-0.66)
Cocos et al on ADRMine Dataset 1784 tweets 0.68 (0.62-0.73) 0.69 (0.62-0.75) 0.67 (0.66-0.69)
ADRMine on ADRMine Dataset as published
1
1784 tweets 0.76 0.68 0.72
Values are mean (95% confidence interval). Scores were achieved by each model over 10 training and evaluation rounds. MostlyPos refers to how the dataset is
used by Cocos et al (ie, removing tweets without span annotations), hence leaving mostly positive tweets. Standard refers to the dataset including a roughly 50-50
balance of positive to negative tweets as in Nikfarjam et al,
2
and the balance of the ADRMine Dataset.
a https://github.com/azinik/ADRMine Accessed November 21, 2018.
b https://github.com/chop-dbhi/twitter-adr-blstm Accessed November
21, 2018.
c https://bitbucket.org/pennhlp/twitter-adr-blstm-download-tweets
Accessed November 21, 2018.
578 Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 6
in effect, when trained and tested with the full dataset, its perfor-
mance (0.64) is significantly lower than the state of the art for the
task (0.72).
2
ADR mentions are very rare events on social media, as
has become evident through shared tasks on ADR detection in social
media. Even after three years, the best classifier reaches only a preci-
sion of 0.44, recall of 0.63, for an F-measure of 0.52.
8
The upfront
stripping of negative examples, whereby 95% of the dataset con-
tains at least 1 ADR or indication mention, as done in Cocos et al,
results in an extremely biased dataset, which in turn results in a
model biased to the positive class that does not reflect any realistic
deployment of a solution to the original problem.
FUNDING
This work was supported by National Institutes of Health National Library
of Medicine grant number 5R01LM011176. The content is solely the respon-
sibility of the authors and does not necessarily represent the official views of
the National Library of Medicine or National Institutes of Health.
AUTHOR CONTRIBUTORS
AM first noted the data use problem, ran the experiments and wrote
the initial draft of the manuscript. AS and AN contributed to some
sections and made edits to the manuscript. GG designed the experi-
ments and wrote the final version of the manuscript.
Conflict of interest statement: None declared.
REFERENCES
1. Cocos A, Fiks AG, Masino AJ. Deep learning for pharmacovigilance: recur-
rent neural network architectures for labeling adverse drug reactions in
Twitter posts. J Am Med Inform Assoc 2017; 24 (4): 813–21.
2. Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. Pharmacovigi-
lance from social media: mining adverse drug reaction mentions using se-
quence labeling with word embedding cluster features. J Am Med Inform
Assoc 2015; 22 (3): 671–81.
3. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on
ensembles for the class imbalance problem: bagging-, boosting-, and
hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 2012; 42
(4): 463–84.
4. Gupta S, Gupta M, Varma V, Pawar S, Ramrakhiyani N, Palshikar GK.
Multi-task learning for extraction of adverse drug reaction mentions from
tweets. In: European Conference on Information Retrieval, 2018; 59–71.
5. Shashank G, Sachin P, Nitin R, Girish Keshav P. Vasudeva V. Semi- super-
vised recurrent neural network for adverse drug reaction mention extrac-
tion. BMC Bioinformatics 2018; 19 (8): 212.
6. Gupta S, Gupta M, Varma V, Pawar S, Ramrakhiyani N, Palshikar GK.
Co-training for extraction of adverse drug reaction mentions from tweets.
In: European Conference on Information Retrieval, 2018; 556–62.
7. Shaika C, Chenwei Z, Yu PS. Multi-task pharmacovigilance mining from
social media posts In: Proceedings of the 2018 World Wide Web Confer-
ence on World Wide Web, 2018; 117–26.
8. Weissenbacher D, Sarker A, Paul MJ, Gonzalez-Hernandez G. Overview of
the third social media mining for health (SMM4H) shared tasks at EMNLP
2018. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd
Social Media Mining for Health Applications Workshop and Shared Task,
2018; 13–16.
Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 6 579
... With the advancement of deep learning technology, comparing with the traditional CRF model [13][14][15][16] , It has been established that deep neural network techniques require less manual intervention compared to traditional methods. They have the capability to achieve higher accuracy and recall by automatically extracting features from words, thereby reducing the subjectivity involved in feature selection and enhancing the accuracy of recognition results. ...
Article
Full-text available
Name Entity Recognition (NER) aims to recognize entities in the engine room domain from unstructured engine room domain text. But in the engine room domain, the entities are diverse and complex, and there is a nesting phenomenon, resulting in a low entity recognition rate. In this paper, a deep learning method incorporating language models is proposed to enhance the entity recognition performance within the engine room. domain. Firstly, the Bidirectional Encoder Representation from Transformers (BERT) language model is employed to train text feature extraction, acquiring a matrix of vector representations at the word level. Secondly, the trained word vectors are fed into the Bidirectional Gated Recurrent Unit (BiGRU) for contextual semantic entity feature extraction. Finally, the global optimal sequence is extracted by combining with the Conditional Random Field (CRF) model to obtain the named entities in the ship cabin semantics. The experimental results show that the proposed algorithm can obtain better F1 values for all three types of entity recognition. Compared with BERT-BiGRU, the overall accuracy of entity identification, recall rate and F1 value are improved by 1.35%, 1.45% and 1.40%, respectively.
... During the 2010's, natural language processing (NLP) tools (9) coupled to deep neural networks have shown promising results for AEs detection in text data (10), and a variety of architectures have been proposed to detect AE in text sources (11)(12)(13). Automated AEs were screened scrutinizing electronic health records of patients (14)(15)(16)(17), and Web data including social networks like X (formerly Twitter) (6,(18)(19)(20)(21)(22). Most of these works aimed at identifying the mention to an AEs and did not focus specifically on the detection of a particular AE. ...
Preprint
Full-text available
Objective Healthcare websites allow patients to share their experiences with their treatments. Drug testimonials provide useful information for real-world evidence, particularly on the occurrence of side effects that may be underreported. We investigated the potential of large language models (LLMs) for detecting signals of body weight change as under-reported side effect of antidepressants in user-generated online content. Materials and Methods A database of 8,000 user-generated comments about the 32 FDA-approved antidepressants was collected from healthcare social websites. These comments were manually annotated under the supervision of drug experts. Several pre-trained LLMs derived from BERT were fine-tuned to automatically classify comments describing weight gain, weight loss, or the absence of reference to a weight change. Zero-shot classification was also performed. Performance was evaluated on a test set by measuring the weighted precision, recall, F1-score and the prediction accuracy. Results After fine-tuning, most of the BERT-derived LLMs showed weighted F1-scores above 97%. LLMs with higher number of parameters used in zero-shot classification almost reached the same performance. The main source of errors in predictions came from situations where the machine predicted falsely weight gain or loss, because the text mentioned these elements but for a different molecule than the one for which the comment was written. Conclusion Even fine-tuned LLMs with limited numbers of parameters showed interesting results for the detection of adverse events from online patient testimonials, suggesting they can be used at scale for real-world evidence.
... For such applications, deep learning NLP techniques outperform traditional machine learning methods. Therefore, to exploit the optimization power of deep learning, researchers have applied this method to detect ADRs from unstructured, manually generated social media texts [23,24]. A primary study in using deep learning for pharmacovigilance was the work by Cocos et al. [25]. ...
Article
Full-text available
Social forums offer a lot of new channels for collecting patients’ opinions to construct predictive models of adverse drug reactions (ADRs) for post-marketing surveillance. However, due to the characteristics of social posts, there are many challenges still to be solved when deriving such models, mainly including problems caused by data sparseness, data features with a high-dimensionality, and term diversity in data. To tackle these crucial issues related to identifying ADRs from social posts, we perform data analytics from the perspectives of data balance, feature selection, and feature learning. Meanwhile, we design a comprehensive experimental analysis to investigate the performance of different data processing techniques and data modeling methods. Most importantly, we present a deep learning-based approach that adopts the BERT (Bidirectional Encoder Representations from Transformers) model with a new batch-wise adaptive strategy to enhance the predictive performance. A series of experiments have been conducted to evaluate the machine learning methods with both manual and automated feature engineering processes. The results prove that with their own advantages both types of methods are effective in ADR prediction. In contrast to the traditional machine learning methods, our feature learning approach can automatically achieve the required task to save the manual effort for the large number of experiments.
... Free text medical notes are an example of this kind of data. A similar case is that of Twitter messages (tweets [18]). Both types of data consist of noisy entities that lack implicit linguistic formalism (e.g. ...
Article
Full-text available
Background: To parse free text medical notes into structured data such as disease names, drugs, procedures, and other important medical information first, it is necessary to detect medical entities. It is important for an Electronic Medical Record (EMR) to have structured data with semantic interoperability to serve as a seamless communication platform whenever a patient migrates from one physician to another. However, in free text notes, medical entities are often expressed using informal abbreviations. An informal abbreviation is a non-standard or undetermined abbreviation, made in diverse writing styles, which may burden the semantic interoperability between EMR systems. Therefore, a detection of informal abbreviations is required to tackle this issue. Objectives: We attempt to achieve highly reliable detection of informal abbreviations made in diverse writing styles. Methods: In this study, we apply the Long Short-Term Memory (LSTM) model to detect informal abbreviations in free text medical notes. Additionally, we use sliding windows to tackle the limited data issue and sample generator for the imbalance class issue, while introducing additional pre-trained features (bag of words and word2vec vectors) to the model. Results: The LSTM model was able to detect informal abbreviations with precision of 93.6%, recall of 57.6%, and F1-score of 68.9%. Conclusion: Our method was able to recognize informal abbreviations using small data set with high precision. The detection can be used to recognize informal abbreviations in real-time while the physician is typing it and raise appropriate indicators for the informal abbreviation meaning confirmation, thus increase the semantic interoperability.
... We found that training and testing on such an extremely unbalanced dataset with mostly positive tweets creates sub-optimal models. 14,15 Here, we show that training on modified datasets under such unrealistic assumptions of ADE classification performance merely gives a false sense of the individual component's performance. Building such systems will invariably result in a large drop in performance in the end-to-end ADE resolution pipeline when executing on a dataset with the inherent imbalance of a Twitter collection. ...
Article
Full-text available
Objective Research on pharmacovigilance from social media data has focused on mining adverse drug events (ADEs) using annotated datasets, with publications generally focusing on 1 of 3 tasks: ADE classification, named entity recognition for identifying the span of ADE mentions, and ADE mention normalization to standardized terminologies. While the common goal of such systems is to detect ADE signals that can be used to inform public policy, it has been impeded largely by limited end-to-end solutions for large-scale analysis of social media reports for different drugs. Materials and Methods We present a dataset for training and evaluation of ADE pipelines where the ADE distribution is closer to the average ‘natural balance’ with ADEs present in about 7% of the tweets. The deep learning architecture involves an ADE extraction pipeline with individual components for all 3 tasks. Results The system presented achieved state-of-the-art performance on comparable datasets and scored a classification performance of F1 = 0.63, span extraction performance of F1 = 0.44 and an end-to-end entity resolution performance of F1 = 0.34 on the presented dataset. Discussion The performance of the models continues to highlight multiple challenges when deploying pharmacovigilance systems that use social media data. We discuss the implications of such models in the downstream tasks of signal detection and suggest future enhancements. Conclusion Mining ADEs from Twitter posts using a pipeline architecture requires the different components to be trained and tuned based on input data imbalance in order to ensure optimal performance on the end-to-end resolution task.
... Social media posts experience high deletion rates with more than 40% of posts from one study being deleted from the platform after the study was published. 13 Researchers must preserve their own copies of data used for a particular study to ensure reproducibility. The publishing of the contents of social media posts in scientific journals may disclose potentially sensitive information about users, such as illicit drug use or mental health issues. ...
Article
Full-text available
Adverse drug reactions (ADRs) are a major concern for patients, clinicians, and regulatory agencies. The discovery of serious ADRs leading to substantial morbidity and mortality has resulted in mandatory Phase IV clinical trials, black box warnings, and withdrawal of drugs from the market. Real World Data, data collected during routine clinical care, is being adopted by innovators, regulators, payors, and providers to inform decision making throughout the product life cycle. We outline several different approaches to modern pharmacovigilance, including spontaneous reporting databases, electronic health record monitoring and research frameworks, social media surveillance, and the use of digital devices. Some of these platforms are well established while others are still emerging, or experimental. We highlight both the potential opportunity, as well as the existing challenges within these pharmacovigilance systems that have already begun to impact the drug development process, as well as the landscape of postmarket drug safety monitoring. Further research and investment into different and complementary pharmacovigilance systems is needed to ensure the continued safety of pharmacotherapy.
... However, this assumed balance for the task of ADE extraction has gone as high as 0.95 positive, in essence ignoring the ADE negative tweets in the dataset. We found that training and testing on such extremely unbalanced dataset with mostly positive tweets creates sub-optimal models [12,13]. Here, we show that training on modified datasets under such unrealistic assumptions of ADE classification performance merely gives a false sense of the individual component's performance. ...
Preprint
Full-text available
Objective: Research on pharmacovigilance from social media data has focused on mining adverse drug effects (ADEs) using annotated datasets, with publications generally focusing on one of three tasks: (i) ADE classification, (ii) named entity recognition (NER) for identifying the span of an ADE mentions, and (iii) ADE mention normalization to standardized vocabularies. While the common goal of such systems is to detect ADE signals that can be used to inform public policy, it has been impeded largely by limited end-to-end solutions to the three tasks for large-scale analysis of social media reports for different drugs. Materials and Methods: We present a dataset for training and evaluation of ADE pipelines where the ADE distribution is closer to the average `natural balance' with ADEs present in about 7% of the Tweets. The deep learning architecture involves an ADE extraction pipeline with individual components for all three tasks. Results: The system presented achieved a classification performance of F1 = 0.63, span detection performance of F1 = 0.44 and an end-to-end entity resolution performance of F1 = 0.34 on the presented dataset. Discussion: The performance of the models continue to highlight multiple challenges when deploying pharmacovigilance systems that use social media data. We discuss the implications of such models in the downstream tasks of signal detection and suggest future enhancements. Conclusion: Mining ADEs from Twitter posts using a pipeline architecture requires the different components to be trained and tuned based on input data imbalance in order to ensure optimal performance on the end-to-end resolution task.
... The cutoff for the lexical filter was tuned using a development set of labeled misspellings. They are similar to our effort in their use of both semantic and lexical 52 information, but differ in their use of manually curated labels for tuning a cutoff threshold 2 Recent independent efforts to replicate studies that used Twitter data were only able to retrieve 1,012 of the 1,784 tweets included in the original 2015 study, 57% of the original 132 data set [25]. ...
Preprint
Full-text available
Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of >0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2,978 drugs and a word embedding model trained on a health-oriented subset of Reddit.
Article
Twitter is a popular social media site on which people post millions of Tweets every day. As patients often share their experiences with drugs on Twitter, Tweets can also be considered as a rich alternative source of adverse drug reaction (ADR)-related information. This information can be useful for health authorities and drug manufacturing companies to monitor the post-marketing effectiveness of drugs. However, the automatic detection of ADRs in Tweets is challenging, as Tweets are informal and prone to grammatical errors. The existing approaches to automatically detecting ADRs do not consider the cause-effect relationships between a drug and an ADR. In this paper, we propose a novel shared causal attention network that exploits such cause-effect relationships to detect ADRs in Tweets. In our approach, we split a Tweet into the prefix, midfix, and postfix segments based on the position of the drug name in the Tweet and separately extract causal features from the segments. We then share these separate causal features with both word and parts-of-speech features, and apply the multi-head self-attention mechanism. We run extensive experiments on three publicly available benchmark datasets to illustrate the effectiveness of the proposed approach.
Article
Objective Accurate risk prediction is important for evaluating early medical treatment effects and improving health care quality. Existing methods are usually designed for dynamic medical data, which require long-term observations. Meanwhile, important personalized static information is ignored due to the underlying uncertainty and unquantifiable ambiguity. It is urgent to develop an early risk prediction method that can adaptively integrate both static and dynamic health data. Materials and Methods Data were from 6367 patients with Peptic Ulcer Bleeding between 2007 and 2016. This article develops a novel End-to-end Importance-Aware Personalized Deep Learning Approach (eiPDLA) to achieve accurate early clinical risk prediction. Specifically, eiPDLA introduces a long short-term memory with temporal attention to learn sequential dependencies from time-stamped records and simultaneously incorporating a residual network with correlation attention to capture their influencing relationship with static medical data. Furthermore, a new multi-residual multi-scale network with the importance-aware mechanism is designed to adaptively fuse the learned multisource features, automatically assigning larger weights to important features while weakening the influence of less important features. Results Extensive experimental results on a real-world dataset illustrate that our method significantly outperforms the state-of-the-arts for early risk prediction under various settings (eg, achieving an AUC score of 0.944 at 1 year ahead of risk prediction). Case studies indicate that the achieved prediction results are highly interpretable. Conclusion These results reflect the importance of combining static and dynamic health data, mining their influencing relationship, and incorporating the importance-aware mechanism to automatically identify important features. The achieved accurate early risk prediction results save precious time for doctors to timely design effective treatments and improve clinical outcomes.
Article
Full-text available
Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art methods in ADR mention extraction use Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a semi-supervised method based on co-training which can exploit a large pool of unlabeled tweets to augment the limited supervised training data, and as a result enhance the performance. Experiments with 0.1M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 5% in terms of F1 score.
Article
Full-text available
Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art in ADR mention extraction uses Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a multi-task learning based method which can utilize a similar auxiliary task (adverse drug event detection) to enhance the performance of the main task, i.e., ADR extraction. Furthermore, in the absence of auxiliary task dataset, we propose a novel joint multi-task learning method to automatically generate weak supervision dataset for the auxiliary task when a large pool of unlabeled tweets is available. Experiments with 0.48M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 7.2% in terms of F1 score.
Conference Paper
Full-text available
Social media has grown to be a crucial information source for pharmacovigilance studies where an increasing number of people post adverse reactions to medical drugs that are previously unreported. Aiming to effectively monitor various aspects of Adverse Drug Reactions (ADRs) from diversely expressed social medical posts, we propose a multi-task neural network framework that learns several tasks associated with ADR monitoring with different levels of supervisions collectively. Besides being able to correctly classify ADR posts and accurately extract ADR mentions from online posts, the proposed framework is also able to further understand reasons for which the drug is being taken, known as 'indications', from the given social media post. A \textit{coverage-based} attention mechanism is adopted in our framework to help the model properly identify `phrasal' ADRs and Indications that are attentive to multiple words in a post. Our framework is applicable in situations where limited parallel data for different pharmacovigilance tasks are available. We evaluate the proposed framework on real-world Twitter datasets, where the proposed model outperforms the state-of-the-art alternatives of each individual task consistently.
Conference Paper
Full-text available
The volume of data encapsulated within social media continues to grow, and, consequently, there is a growing interest in developing effective systems that can convert this data into usable knowledge. Over recent years, initiatives have been taken to enable and promote the utilization of knowledge derived from social media to perform health related tasks. These initiatives include the development of data mining systems and the preparation of datasets that can be used to train such systems. The overarching focus of the SMM4H shared tasks is to release annotated social media based health related datasets to the research community, and to compare the performances of distinct natural language processing and machine learning systems on tasks involving these datasets. The second execution of the SMM4H shared tasks comprised of three subtasks involving annotated user posts from Twitter (tweets): (i) automatic classification of tweets mentioning an adverse drug reaction (ADR) (ii) automatic classification of tweets containing reports of first-person medication intake, and (iii) automatic normalization of ADR mentions to MedDRA concepts. A total of 15 teams participated and 55 system runs were submitted. The best performing systems for tasks 2 and 3 outperformed the current state of the art systems.
Article
Full-text available
Background: Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Methods: Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. Results: To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. Conclusion: In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.
Article
Full-text available
Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Article
Full-text available
Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.
Article
Objective: Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media. Materials and methods: We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training. Results: Our best-performing RNN model used pretrained word embeddings created from a large, non-domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision. Discussion: Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models. Conclusion: ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets.