PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Breast cancer patients often discontinue their long-term treatments, such as hormone therapy, increasing the risk of cancer recurrence. These discontinuations are often caused by adverse patient-centered outcomes (PCOs) due to hormonal drug side effects or other factors. PCOs are not detectable through laboratory tests and are sparsely documented in electronic health records. Thus, there is a need to explore other sources of information for PCOs associated with breast cancer treatments. Social media is a promising resource, but extracting true PCOs from it first requires the accurate detection of breast cancer patients. We describe a natural language processing (NLP) architecture for automatically detecting breast cancer patients from Twitter based on their self-reports. The architecture employs breast cancer-related keywords to collect streaming data from Twitter, applies NLP patterns to pre-filter noisy posts, and then employs a machine learning classifier trained using manually-annotated data (n=5019) for distinguishing firsthand self-reports of breast cancer from other tweets. A classifier based on bidirectional encoder representations from transformers (BERT) showed human-like performance and achieved F1-score of 0.857 (inter-annotator agreement: 0.845; Cohen's kappa) for the positive class, considerably outperforming the next best classifier--a deep neural network (F1-score: 0.665). Qualitative analyses of posts from automatically-detected users revealed discussions about side effects, non-adherence, and mental health conditions, illustrating the feasibility of our social media-based approach for studying breast cancer-related PCOs from a large population.
Content may be subject to copyright.
Automatic Breast Cancer Cohort Detection
from Social Media for Studying Factors
Affecting Patient Centered Outcomes
Mohammed Ali Al-Garadi1, Yuan-Chi Yang1, Sahithi Lakamana1, Jie Lin3,
Sabrina Li3, Angel Xie3, Whitney Hogg-Bremer1, Mylin Torres2, Imon
Banerjee1,4, and Abeed Sarker1
1Department of Biomedical Informatics, School of Medicine, Emory University,
Atlanta GA 30322, USA
{m.a.al-garadi,yuan-chi.yang,slakama,whitney.hogg,
imon.banerjee,abeed.sarker}@emory.edu
2Department of Radiation Oncology, School of Medicine, Emory University, Atlanta
GA 30322, USA
matorre@emory.edu
3Department of Computer Science, College of Arts and Sciences
{linyi.li,jie.lin,angel.xie}@emory.edu
4Department of Radiology, School of Medicine, Emory University, Atlanta GA
30322, USA
Abstract. Breast cancer patients often discontinue their long-term treat-
ments, such as hormone therapy, increasing the risk of cancer recurrence.
These discontinuations may be caused by adverse patient-centered out-
comes (PCOs) due to hormonal drug side effects or other factors. PCOs
are not detectable through laboratory tests, and are sparsely documented
in electronic health records. Thus, there is a need to explore comple-
mentary sources of information for PCOs associated with breast cancer
treatments. Social media is a promising resource, but extracting true
PCOs from it first requires the accurate detection of breast cancer pa-
tients. We describe a natural language processing (NLP) architecture for
automatically detecting breast cancer patients from Twitter based on
their self-reports. The architecture employs breast cancer related key-
words to collect streaming data from Twitter, applies NLP patterns to
pre-filter noisy posts, and then employs a machine learning classifier
trained using manually-annotated data (n=5019) for distinguishing first-
hand self-reports of breast cancer from other tweets. A classifier based on
bidirectional encoder representations from transformers (BERT) showed
human-like performance and achieved F1-score of 0.857 (inter-annotator
agreement: 0.845; Cohen’s kappa) for the positive class, considerably
outperforming the next best classifier—a deep neural network (F1-score:
0.665). Qualitative analyses of posts from automatically-detected users
revealed discussions about side effects, non-adherence and mental health
conditions, illustrating the feasibility of our social media-based approach
for studying breast cancer related PCOs from a large population.
Keywords: breast cancer ·social media ·natural language processing.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
2 Al-Garadi et al.
1 Introduction
1.1 Background
Women with breast cancer comprise the largest group of cancer survivorsin
high-income countries such as the United States, particularly due to the avail-
ability of advanced treatments (e.g., hormone therapy) that have significantly
reduced mortality rates. Due to the treatment-driven increased life expectancy of
breast cancer survivors, their physical and psychological well-being are regarded
as important patient-centered outcomes (PCOs), specifically among younger pa-
tients. Breast cancer patients often suffer from various treatment-related side
effects and other negative outcomes, which range from short-term pain, nausea
and fatigue, to lingering psychological dysfunctions such as depression, anxiety,
and suicidal tendency. Consequently, one-third to half of young breast cancer
patients discontinue their treatments, such as endocrine therapy, thus increasing
the risk of cancer recurrence and therefore of death [7, 8]. In addition, non-
adherence to prescribed therapy is associated with poor quality of life, more
physician visits and hospitalizations, and longer hospital stays [9].
PCOs, including treatment-related side-effects, are not captured in labo-
ratory or diagnostic tests, but are gathered through patient communications.
Sometimes these outcomes are captured as free text in clinical narratives writ-
ten by caregivers. PCOs documented in this manner, however, are often subject
to biases and incompleteness of data in the Electronic Health Records (EHR).
In many cases PCOs are not documented at all. We demonstrated the under-
documentation of PCOs of oncology patients in EHRs in a recent study [2].
Specifically, with the approval of Stanford Institutional Review Board (IRB),
we deployed a simple rule-based NLP pipeline for breast cancer, which searched
for documentation of physical and mental PCOs affecting patient well-being in
EHRs. Physical PCOs (type 1 PCOs) consisted of pain, nausea, hot flush, fatigue,
while mental PCOs (type 2) included anxiety, depression and suicidal tendency.
On 100 randomly selected clinical notes of breast cancer patients, the model
achieved 0.9 F1-score when validated against manually-labeled ground truth. We
applied the validated model on the Stanford breast cancer dataset (Oncoshare),
which contains an assortment of clinical notes (e.g., progress notes, oncology
notes, discharge summaries, nursing notes) associated with 8,956 women di-
agnosed with breast cancer from 2008 to 2018. As depicted in Table 1, only
8% of clinical notes and 12% of progress notes contained any documentation
(affirm/negation) of PCOs. Importantly, for as many as 30% of breast cancer
patients, there were no documented PCOs at any time point at all.
The under-documentation of PCOs acts as a limiting factor to study the
long-term treatment outcomes of young breast cancer patients. Most of the past
studies focusing on PCOs have either relied on only small populations of clin-
ical trial patients or analyzed short-term side effects collected during frequent
clinic visit periods. Another important limiting factor to understanding the out-
comes that matter to patients is that studies focusing on EHRs only capture
We use the terms ‘survivor’ and ‘patient’ interchangeably in this paper.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
Title Suppressed Due to Excessive Length 3
Table 1. Results of patient-centered outcome extraction from clinic notes of Stanford
Breast Cancer Cohort (2008 - 2018).
Data Total Counts Documentation of type 1 PCO
(lymphedema, nausea, fatigue)
Documentation of type 2 PCO
(anxiety, depression, suicidal)
SHC (2008 - 2018) breast cancer patients
Number of patients 9755 6970 6726
Number of clinical notes 1003210 85039 (8.47%) 82466 (8.22%)
Outpatient progress notes 240486 30219 (12.56%) 29701 (12.35%)
Inpatient progress notes 153915 18714 (12.16%) 15754 (10.23%)
History and Physical 21475 3216 (14.97%) 2531 (11.78%)
Consultation note 25557 3824 (14.96%) 3979 (15.57%)
Nursing note 58859 3690 (6.26%) 2404 (4.08%)
Discharge summary 10334 1126 (10.89%) 2404 (23.26%)
Other notes (ED, letters etc.) 492584 24250 (4.92%) 25693 (5.21%)
clinical information, not other relevant factors and patient characteristics that
influence their long- and short-term outcomes. Some studies have investigated
the feasibility of monitoring patient-reported outcomes (PROs) among oncology
patients using sources other than EHRs, such as web portals, mobile applica-
tions and automated telephone calls, and their findings suggest that monitoring
PROs outside of clinic visits may be more effective and reduce adverse outcomes.
However, engaging oncology patients in such routine monitoring activities is ex-
tremely resource intensive (expensive) and they only enable the collection of lim-
ited information from homogeneous cohorts. Given the under-documentation in
EHRs and the laborious process of conducting patient surveys, there is a need to
identify complementary sources of information for PCOs associated with breast
cancer patients/survivors, and to develop new strategies for capturing diverse
patient-level and population-level health-related outcomes.
One promising, albeit challenging, source of information for population-level
breast cancer PCOs/PROs is social media. Several studies, including our own,
have utilized social media to identify large cohorts of users with common health-
related conditions, and then mine relevant longitudinal information about the
cohorts using NLP methods. For example, in our past research, we showed that
carefully-designed NLP pipelines can be used to discover cohorts of pregnant
women [11] or patients suffering from opioid use disorder [6] from social media,
and then mine important information from their social media posts (e.g., medi-
cation usage and recovery strategies). For cancer, studies have investigated the
role of social media platforms for tasks such as spreading breast cancer aware-
ness, health promotion, and cancer prevention [1,3]. However, to the best of our
knowledge, no past research has attempted to accurately detect cancer cohorts
from social media to study long-term cohort-specific information at scale.
1.2 Objectives
We had the following 3 specific objectives for this study, each dependent on the
previous one:
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
4 Al-Garadi et al.
(a) Assess if breast cancer patients discuss personal health-related information
on Twitter, including the self-reporting of their positive breast cancer diag-
nosis/status.
(b) Develop a social media mining pipeline for detecting self-reports of breast
cancer using NLP and machine learning methods from Twitter (the primary
aim of the paper).
(c) Gather longitudinal information from the profiles of the automatically-detected
users, and qualitatively analyze the information to ascertain if long-term re-
search can be conducted on this cohort.
2 Materials and Methods
2.1 Data and Annotation
We collected data from Twitter using keywords and hashtags via the public
streaming application programming interface (API). We used four keywords:
(i) cancer, (ii) breast cancer, (iii) tamoxifen, (iv) survivor, and their hashtag
equivalents. An inspection of Twitter data retrieved by these keywords showed
that while there are many health-related posts from real breast cancer patients,
they were hidden within large amounts of noise. Table 2 shows examples of
tweets mentioning these keywords, including breast cancer self-reports (category:
S), and tweets that were not relevant (category: NR). We filtered out most of
the irrelevant tweets by employing several simple rule- and pattern-matching
methods, only keeping tweets that matched the patterns, which were as follows:
Tweet contains [#]breast & [#]cancer & [#]survivor; OR
Tweet contains [#]breastcancer & #survivor; OR
Tweet contains [#]tamoxifen AND ([#]cancer OR [#]survivor)
Tweet contains a personal pronoun (e.g., ‘my’, ‘I’, ‘me’, ‘us’) AND [#]breast
& [#]cancer
These patterns were developed via a brief manual analysis of Twitter chatter
using the website (i.e., the search option). From Table 2, we see that the pattern-
based filter does not remove all irrelevant tweets. To fully automate the detection
and collection of a Twitter breast cancer cohort, it is necessary to detect self-
reports with higher accuracy. Therefore, we employed supervised classification,
similar to our past research focusing on Twitter and a pregnancy cohort [11].
We chose a random sample of the pre-filtered tweets for manual annotations.
We excluded duplicate tweets, retweets and tweets shorter than 50 characters.
Four annotators performed the annotation of tweets, with a random number
of overlapping tweets between each pair of annotators. Each tweet was labeled
as one of three classes–(i) self-report of breast cancer (S), (ii) report of breast
cancer of a family member or friend (F), or (iii) not relevant (NR). We computed
pair-wise inter-annotator agreements using Cohen’s kappa [4]. Since we were only
interested in first person self-reports of breast cancer for this study, we combined
classes F and NR for the supervised machine learning experiments.k
kWe intend to use information from tweets labeled as F in our future studies.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
Title Suppressed Due to Excessive Length 5
Table 2. Sample tweets from keyword-based retrieval of data from Twitter. Tweets
have been modified to preserve anonymity. ‘*’ - tweet filtered by pattern-matching; ‘**’
- tweet not filtered by pattern-matching (requiring supervised classification).
Tweet Pattern/Keyword Match Category
Iam blessed. I know this. As one of the lucky ones, my
breast cancer was caught early on. Almost five years ago.
@USERNAME URL #survivor #amwriting #writingcommu-
nity #writerlift screenwriters
breast & cancer & survivor S
It’s damn hard to fght cancer when you cold, hungry & live with
constant financial stress.
cancer* NR
Check out Shelby J‘s latest single regarding her recent struggle
with breast cancer and what sustained her throughout. #Sur-
vivor #EarlyDetectionSavesLives #MusicMonday
breast & cancer & survivor** NR
Im officially a 16 year breast cancer survivor , mammogram came
back all clear no evidence of recurring disease. So grateful
breast & cancer & survivor S
2.2 Supervised Classification
We experimented with multiple supervised classification approaches and com-
pared their performances on the same dataset. These approaches were na¨ıve
Bayes (NB), random forest (RF), support vector machine (SVM), deep neu-
ral network (NN), and a classifier based on bidirectional encoder representa-
tions from transformers (BERT). For the NB, RF, and SVM classifiers, we
pre-processed by lowercasing, stemming, removing URLs, usernames, and non-
English characters. Following the pre-processing, we converted the text into fea-
tures: n-grams (contiguous sequences of n words ranging from 1 to 3), and word
clusters (a generalized representations of words learned from medication-related
chatter collected from Twitter) [12]. For these classifiers, we used count vector
representations—each tweet is represented as a sparse vector whose length is
the size of the entire feature-set/vocabulary and each vector position represents
the number of times a specific feature (e.g., a word or bi-gram) appears in the
tweet. In addition to being sparse (i.e., most of the vector numbers are 0), these
count-based representations do not capture word meanings or their similarities.
For instance, the terms ‘bad’ and ‘worst’ will be represented by orthogonal vec-
tors. Word embedding based representations such as GloVe [10] capture word
meanings and we used them for the NN classifier. However, such representations
do not capture contextual differences in the meanings of words.
Transformer-based approaches, such as BERT, encode contextual semantics
at the sentence or word-sequence level, and have vastly improved the state-of-the-
art in many NLP tasks [5]. BERT-based classifiers had not been previously used
for health cohort detection from Twitter, and in this study, we used the BERT
large model [5] which consists of 16 layers (transformer blocks), 1024 hidden size
16 attention heads with total of 340M parameters. The tweets are converted into
the BERT model, which captures contextual meanings of character sequences.
Following vectorization, a neural network (dense layer) with a softmax activation
is used to predict whether the tweets is (NR or S).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
6 Al-Garadi et al.
2.3 Post-classification Analyses
Following the classification experiments, we conducted manual analyses to (i)
study causes of classification errors, (ii) analyze the association between training
set size and classification performance for all classifiers, and (iii) verify if the users
detected by the classification approach discussed factors that influenced PCOs
on Twitter. For (i) we manually reviewed a sample of the misclassified tweets
to identify potential patterns. For (ii), our objective was to assess if the number
of tweets required to obtain acceptable classification performance was practical
and feasible. We drew stratified samples of the training set consisting of 20%,
40%, 60% and 80% of the set, and computed the F1-scores over the same test set.
For (iii), we collected, via the API, the past posts of a subset of automatically-
detected breast cancer positive users, and then qualitatively analyzed them. We
used simple string-matching to identify potentially relevant tweets.
3 Results
3.1 Annotation and Supervised Classification Results
We annotated a total of 5,019 unique tweets (training: 3513; validation: 302;
evaluation: 1204). 3736 (74%) tweets belonged to the NR class (training: 2615;
validation: 225; test: 896) and 1283 (26%) belonged to the S class (training:898;
validation: 77; test: 308). Micro-average of the pair-wise agreements among all
annotators was 0.845 (Cohen’s κ) [4], which represents significant agreement
[13]. Table 3.1 presents IAA for each pair of annotators.
Table 3. Pair-wise IAAs, numbers of overlapping tweets, and overall micro average.
Annotator pair Overlap N Inter-annotator agreement
A1 & A2 86 0.898
A1 & A3 145 0.830
A1 & A4 185 0.907
A2 & A3 221 0.806
A2 & A4 168 0.836
A3 & A4 212 0.828
Micro average 1017 0.845
Table 4 shows the performances of the learning algorithms on the held-out
test set. The BERT-based classifier yields the highest F1-score for class S (0.857),
significantly outperforming the other classifiers.
3.2 Post Classification Analyses Results
Classification error analyses: As per our analysis, the possible reasons for
misclassification could be attributed to factors that are common with social me-
dia data, primarily the lack of context, ambiguous references, and the use of
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
Title Suppressed Due to Excessive Length 7
Fig. 1. Classifier performances at different different training set sizes.
Table 4. Performances of learning models in terms of class-specific recall, precision,
F1-scores, and overall accuracy. Best F1-score on the 1 class is shown in bold. tables.
Classifiers Precision(NR) Precision(S) Recall(NR) Recall(S) F1-score(NR) F1-score(S) Accuracy
SVM 0.861 0.767 0.941 0.55 0.899 0.646 0.843
RF 0.826 0.849 0.975 0.402 0894 0.546 0.828
NN 0.877 0.701 0.907 0.633 0.892 0.665 0.837
NB 0.953 0.361 0.430 0.938 0.593 0.522 0.560
BERT large 0945 0.877 0.959 0.837 0.952 0.857 0.928
colloquial language. The following following tweets are classified by the annota-
tor as S, but BERT misclassified them:
Tweet-1:“we are sisters in this breast cancer club we never wanted to
join. bless you my friend. you are an inspiration to all of us.”
Tweet-2:“when the breast cancer center calls and asks you to donate for
the patients’ medication and you’re just like ”i can barely afford my own”
Learning curve at different training data sizes: Figure 1 shows the classi-
fier performances at different training data sizes with increments of 20% of the
full training set. From the figure, we see that the BERT-based classifier shows re-
markable performance even at small training set sizes. However, the performance
of this classifier does not improve further as more training data is added.
Content exploration: We found many informative tweets that covered a wide
variety of health-related, and potentially cancer-related, information. Table 5
presents some examples of tweets that were potentially relevant to the users’
PCOs. A number of users reported that they suffered from anxiety/depression,
although it was not immediately clear how their mental health conditions were
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
8 Al-Garadi et al.
related their cancer diagnoses and treatments. Similarly, users report experi-
encing or worrying about the side effects of prescribed medications, including
Tamoxifen, and their intentions to not adhere to the treatment. These tweets
could provide crucial information about how these survivors cope with their
treatment and medications, complementing their EHRs.
4 Discussion
The capability to detect self-reports of breast cancer very accurately is a neces-
sary condition for utilizing Twitter to study PCOs associated with treatment,
and our approach has produced promising results. The transformer-based clas-
sifier (BERT), is capable of producing performances that far outperform tradi-
tional approaches. Thus, our study demonstrates that it is indeed possible to
build a large breast cancer cohort from Twitter via an automatic NLP pipeline.
Manual annotation of data is a very time-consuming task and the need to
annotate large numbers of samples for supervised classification often act as a
barrier to practical deployment. Our experiments show that the BERT-based
model overcomes this obstacle, making full automation feasible. However, we
also discovered that it is difficult to raise the performance of this classifier simply
by annotating more data. Despite the context-incorporating sentence vectors
that are used for BERT, the model still lacks the ability to infer meanings that
are typically evident to humans. Also, our annotators benefited from implicit
knowledge of the topic and additional contextual cues, which the transformer-
based model is not able to capture. In the future, it will be important to study
how such implicit information may be encoded in numeric vectors.
5 Conclusion
We investigated the potential of using Twitter as a resource for studying PCOs
associated with breast cancer treatment by studying information posted directly
by patients. We particularly focused on (i) assessing if breast cancer patients dis-
cuss health-related information on Twitter, including the self-reporting of their
positive breast cancer status; (ii) developing a NLP-based social media mining
pipeline for detecting self-reports via supervised classification; and (iii) analyz-
ing health-related longitudinal information of automatically-detected users. We
showed that using NLP patterns and a supervised classifier, we are able to detect
breast cancer patients with high accuracy. The BERT-based classifier achieves
human-like performance with an F1-score of 0.857 over the positive class. Quali-
tative analyses of the tweets retrieved from the users’ profiles revealed that they
contain information relevant to PCOs, such as mental health issues, side effects
of medications, and medication adherence. These findings verify the potential
value of social media for studying PCOs that are rarely captured in EHRs. Our
future work will focus on collecting large samples of breast cancer patients from
Twitter using the methods described, and then implementing further NLP-based
methods for studying breast cancer related PCOs from a large cohort.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
Title Suppressed Due to Excessive Length 9
Table 5. Sample posts that are relevant to the users’ health conditions, collected from
the timelines of automatically-detected users. The posts were manually curated and
categorized. URLs and emoji’s have been removed; usernames have been anonymized.
# Tweet Comment
# 1 Sooooo..... the doc put me on an anxiety/anti-depression
med the other day (cuz cancer is still a b*tch). She told
me to take in the morning. Uh no. I’ve been asleep for 2
days almost. Taking that joint at night.
Mental health issues
# 2 my #mentalhealth suffered unnecessarily and drastically due
to #thyroid medications that didn’t work for me, for my
body. even when #hypothyroid (on paper) is treated it can
make you feel even more unwell. keep asking for help from
new medical professionals until one listens.
Mental health issues
# 3 Here we are at my Oncology follow up appointment. I didnt
really get on with the tablets prescribed for hot flushes. They
made me so sleepy I felt like a zombie and a lower mood
than usual so I stopped them. Hopefully get echocardiogram
results today too
Side effects, nonad-
herence intention
# 4 I’m learning something new every day about my #breast-
cancer. While seeing the oncologist yesterday, I said I know
if I stay on my 5 year hormone therapy plan, there is a 9%
chance of recurrence. So I asked what if I stop taking the
medicine so I no longer have joint pain...
Side effects
# 5 New drug today Docetaxel. Not got my usual anti sickness
prescribed so I’m feeling quite nervous about how it’s going
to take me I was vomiting on the EC treatment. But on the
positive this is number 5 of 8. #breastcancer #chemotherapy
Side effects
# 6 And Im having a mentally poor day. For all its benefits in
preventing #breastcancer recurrence, I think I am going to
have to stop taking #Tamoxifen I have a review at the hos-
pital shortly to discuss. Yes, I am grateful that this drug is
available but the quality of life is poor
Side effects, nonad-
herence intention
# 7 Another night, another with lack of sleep. How Im supposed
to continue getting by on 3-4hrs sleep every night is be-
yond me and definitely contributing to my emotional state of
mind. I havent had one night since pre #breastcancer where
Ive slept all night #mentalhealth #tamoxifen
Mental health issues,
side effects
# 8 The prize for finishing chemo is taking a drug that can cause
uterine cancer. #oneroundleft #breastcancer #tamoxifen
Side effects
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
10 Al-Garadi et al.
References
1. Attai, D.J., Cowher, M.S., Al-Hamadani, M., Schoger, J.M., Staley, A.C., Lan-
dercasper, J.: Twitter social media is an effective tool for breast cancer patient
education and support: patient-reported outcomes by survey. Journal of medical
Internet research 17(7), e188 (2015)
2. Banerjee, I., Bozkurt, S., Caswell-Jin, J.L., Kurian, A.W., Rubin, D.L.: Natural
language processing approaches to detect the timeline of metastatic recurrence of
breast cancer. JCO clinical cancer informatics 3, 1–12 (2019)
3. Bottorff, J.L., Struik, L.L., Bissell, L.J., Graham, R., Stevens, J., Richardson, C.G.:
A social media approach to inform youth about breast cancer and smoking: An
exploratory descriptive study. Collegian 21(2), 159–168 (2014)
4. Cohen, J.: A coefficient of agreement for nominal scales. Educational and psycho-
logical measurement 20(1), 37–46 (1960)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
6. Graves, R.L., Sarker, A., Al-Garadi, M.A., Yang, Y.c., Love, J.S., O’Connor, K.,
Gonzalez-Hernandez, G., Perrone, J.: Effective buprenorphine use and tapering
strategies: Endorsements and insights by people in recovery from opioid use disor-
der on a reddit forum. bioRxiv p. 871608 (2019)
7. van Herk-Sukel, M.P.P., van de Poll-Franse, L.V., Voogd, A.C., Nieuwenhuijzen,
G.A.P., Coebergh, J.W.W., Herings, R.M.C.: Half of breast cancer patients discon-
tinue tamoxifen and any endocrine treatment before the end of the recommended
treatment period of 5 years: a population-based analysis. Breast Cancer Research
and Treatment 122(3), 843–851 (2010). https://doi.org/10.1007/s10549-009-0724-
3, https://doi.org/10.1007/s10549-009-0724-3
8. McCowan, C., Shearer, J., Donnan, P.T., Dewar, J.A., Crilly, M., Thompson, A.M.,
Fahey, T.P.: Cohort study examining tamoxifen adherence and its relationship to
mortality in women with breast cancer. British journal of cancer 99(11), 1763–1768
(dec 2008). https://doi.org/10.1038/sj.bjc.6604758
9. Milata, J.L., Otte, J.L., Carpenter, J.S.: Oral Endocrine Therapy Non-
adherence, Adverse Effects, Decisional Support, and Decisional Needs
in Women With Breast Cancer. Cancer nursing 41(1), E9–E18 (2018).
https://doi.org/10.1097/NCC.0000000000000430
10. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP). pp. 1532–1543. Association for Computa-
tional Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1162,
https://www.aclweb.org/anthology/D14-1162
11. Sarker, A., Chandrashekar, P., Magge, A., Cai, H., Klein, A., Gonzalez, G.: Dis-
covering cohorts of pregnant women from social media for safety surveillance and
analysis. Journal of medical Internet research 19(10), e361 (2017)
12. Sarker, A., Gonzalez, G.: A corpus for mining drug-related knowledge from twitter
chatter: Language models and their utilities. Data in brief 10, 122–131 (2017)
13. Viera, A.J., Garrett, J.M., et al.: Understanding interobserver agreement: the
kappa statistic. Fam med 37(5), 360–363 (2005)
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2020. .https://doi.org/10.1101/2020.05.17.20104778doi: medRxiv preprint
... Manually annotated data for all these tasks were either publicly available or had been made available through shared tasks. The tasks covered diverse topics including, but not limited to, adverse drug reactions (ADRs), 29 cohort identification for breast cancer, 31 non-medical prescription medication use (NPMU), 32 informative COVID-19 content detection, 33 medication consumption, 34 pregnancy outcome detection, 35 symptom classification, 36 suicidal ideation detection, 37 identification of drug addiction and recovery intervention, 38 signs of pathological gambling and self-harm detection, 39 and sentiment analysis and factuality classification in e-health forums. 40 Table 1 Table 1. ...
Preprint
Full-text available
Motivation Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks. There is a need to benchmark such models for targeted NLP tasks, and to explore effective pretraining strategies to improve machine learning performance. Results In this work, we addressed the task of health-related social media text classification. We benchmarked five models-RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT on 22 tasks. We attempted to boost performance for the best models by comparing distinct pretraining strategies-domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and topic-specific pretraining (TSPT). RoBERTa and BERTweet performed comparably in most tasks, and better than others. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT+TSPT showed consistently high performance, with statistically significant improvement in one task. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.
... Therefore, we specifically focused on three tasks-breast cancer, non-medical PM use, and WNUT-20-task2 in our benchmarking tasks corresponding to three health-care topics-breast cancer, prescription medication use, and COVID-19. For breast cancer and NPMU, we used the same filter described in Al-Garadi et al. 20 and Al-Garadi et al. 21 , and for COVID-19, we filtered the data using the keywords 'covid', 'corona virus', and 'coronavirus', as the on-topic data. ...
Preprint
Full-text available
Ongoing work on comparing pre-trained models and their transferability.
Article
Full-text available
Background: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias. Objective: The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. Methods: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. Results: Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F1 score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. Conclusions: Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them.
Article
Full-text available
In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period— November, 2014 to February, 2015. The posts mention over 250 drug-related key words. The language models encapsulate semantic and sequential properties of the texts.
Article
Full-text available
Background: Despite reported benefits, many women do not attend breast cancer support groups. Abundant online resources for support exist, but information regarding the effectiveness of participation is lacking. We report the results of a Twitter breast cancer support community participant survey. Objective: The aim was to determine the effectiveness of social media as a tool for breast cancer patient education and decreasing anxiety. Methods: The Breast Cancer Social Media Twitter support community (#BCSM) began in July 2011. Institutional review board approval with a waiver of informed consent was obtained for a deidentified survey that was posted for 2 weeks on Twitter and on the #BCSM blog and Facebook page. Results: There were 206 respondents to the survey. In all, 92.7% (191/206 )were female. Respondents reported increased knowledge about breast cancer in the following domains: overall knowledge (80.9%, 153/189), survivorship (85.7%, 162/189), metastatic breast cancer (79.4%, 150/189), cancer types and biology (70.9%, 134/189), clinical trials and research (66.1%, 125/189), treatment options (55.6%, 105/189), breast imaging (56.6%, 107/189), genetic testing and risk assessment (53.9%, 102/189), and radiotherapy (43.4%, 82/189). Participation led 31.2% (59/189) to seek a second opinion or bring additional information to the attention of their treatment team and 71.9% (136/189) reported plans to increase their outreach and advocacy efforts as a result of participation. Levels of reported anxiety before and after participation were analyzed: 29 of 43 (67%) patients who initially reported “high or extreme” anxiety reported “low or no” anxiety after participation (P<.001). Also, no patients initially reporting low or no anxiety before participation reported an increase to high or extreme anxiety after participation. Conclusion: This study demonstrates that breast cancer patients’ perceived knowledge increases and their anxiety decreases by participation in a Twitter social media support group.
Article
Full-text available
Tobacco exposure during periods of breast development has been shown to increase risk of premenopausal breast cancer. An urgent need exists, therefore, to raise awareness among adolescent girls about this new evidence, and for adolescent girls and boys who smoke to understand how their smoking puts their female peers at risk for breast cancer. The purpose of this study was to develop two youth-informed, gender specific YouTube-style videos designed to raise awareness among adolescent girls and boys about tobacco exposure as a modifiable risk factor for breast cancer and to assess youths’ responses to the videos and their potential for inclusion on social media platforms. Both videos consisted of a combination of moving text, novel images, animations, and youth-friendly music. A brief questionnaire was used to gather feedback on two videos using a convenience sample of 135 youth in British Columbia, Canada. The overall positive responses by girls and boys to their respective videos and their reported interest in sharing these videos via social networking suggests that this approach holds potential for other types of health promotion messaging targeting youth. The videos offer a promising messaging strategy for raising awareness about tobacco exposure as a modifiable risk factor for breast cancer. Tailored, gender-specific messages for use on social media hold the potential for cost-effective, health promotion and cancer prevention initiatives targeting youth.
Article
Full-text available
Observational studies on long-term endocrine treatment among breast cancer patients have presented discontinuation rates on tamoxifen, but lack information on the continuance of any endocrine treatment [both tamoxifen and aromatase inhibitors (AIs)] within the same cohort. In this study we determined switching rates from tamoxifen to AIs, discontinuation rates of tamoxifen only, discontinuation rates of any endocrine treatment and determinants of first treatment switch and treatment discontinuation. Patients with early stage breast cancer (stage I-IIIa) starting on tamoxifen were selected from the linked Eindhoven Cancer Registry-PHARMO RLS cohort in the period 1998-2006. Continuous use (allowing a 60 days gap between refills) of tamoxifen only and any endocrine treatment were determined after various follow-up periods: 1, 2, 3, 4, and 5 years. Time to first switch from tamoxifen to an AI was assessed. Cox regression was used to identify determinants of first treatment switch, discontinuation of tamoxifen, and discontinuation of any endocrine treatment. A total of 1,451 new early stage breast cancer patients started on tamoxifen. Of those, 380 had a treatment switch to an AI during follow-up. Of the patients followed for 5 years, 40% continuously used tamoxifen, which was 49% for any endocrine treatment. Older age (older than 70 versus 50-69 years) was independently associated with increased discontinuation of tamoxifen and any endocrine therapy. Patients with two or more concomitant diseases (versus no comorbidity) showed an increased likelihood to stop any endocrine treatment or switch treatment from tamoxifen to an AI. In conclusion, up to half of the breast cancer patients starting tamoxifen continued 5 years of endocrine treatment. Identification of patients at risk of discontinuation will assist in the development of interventions to improve treatment continuation comparable to that of patients included in clinical trials.
Article
Full-text available
Increasing duration of tamoxifen therapy improves survival in women with breast cancer but the impact of adherence to tamoxifen on mortality is unclear. This study investigated whether women prescribed tamoxifen after surgery for breast cancer adhered to their prescription and whether adherence influenced survival. A retrospective cohort study of all women with incident breast cancer in the Tayside region of Scotland between 1993 and 2002 was linked to encashed prescription records to calculate adherence to tamoxifen. Survival analysis was used to determine the effect of adherence on all-cause mortality. In all 2080 patients formed the study cohort with 1633 (79%) prescribed tamoxifen. The median duration of use was 2.42 years (IQR=1.04-4.89 years). Longer duration was associated with better survival but this varied over time. The hazard ratio for mortality in relation to duration at 2.4 years was 0.85, 95% CI=0.83-0.87. Median adherence to tamoxifen was 93% (interquartile range=84-100%). Adherence <80% was associated with poorer survival, hazard ratio 1.10, 95% CI=1.001-1.21. Persistence with tamoxifen was modest with only 49% continuing therapy for 5 years of those followed up for 5 years or more. Increased duration of tamoxifen reduces the risk of death, although one in two women do not complete the recommended 5-year course of treatment. A significant proportion of women have low adherence to tamoxifen and are at increased risk of death.
Article
Purpose: Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. Patients and methods: We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. Results: The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P < .001). Conclusion: We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.
Article
Background: Oral endocrine therapy (OET) such as tamoxifen or aromatase inhibitors reduces recurrence and mortality for the 75% of breast cancer survivors (BCSs) with a diagnosis of estrogen receptor-positive breast cancer. Because many BCSs decide not take OET as recommended because of adverse effects, understanding BCSs' decisional supports and needs is foundational to supporting quality OET decision making about whether to adhere to OET. Objective: The aim of this study was to examine literature pertaining to OET nonadherence and adverse effects using the Ottawa Decision Support Framework categories of decisional supports and decisional needs because these factors potentially influence OET use. Methods: A systematic literature search was performed in PubMed and CINAHL using combined search terms "aromatase inhibitors and adherence" and "tamoxifen and adherence." Studies that did not meet criteria were excluded. Relevant data from 25 publications were extracted into tables and reviewed by 2 authors. Results: Findings identified the impact of adverse effects on OET nonadherence, an absence of decisional supports provided to or available for BCSs who are experiencing OET adverse effects, and the likelihood of unmet decisional needs related to OET. Conclusions: Adverse effects contribute to BCSs decisions to stop OET, yet there has been little investigation of the process through which that occurs. This review serves as a call to action for providers to provide support to BCSs experiencing OET adverse effects and facing decisions related to nonadherence. Implications for practice: Findings suggest BCSs prescribed OET have unmet decisional needs, and more decisional supports are needed for BCSs experiencing OET adverse effects.