ArticlePDF Available

LexExp: A system for automatically expanding concept lexicons for noisy biomedical texts

Authors:

Abstract

LexExp is an open-source, data-centric lexicon expansion system that generates spelling variants of lexical expressions in a lexicon using a phrase embedding model, lexical similarity-based natural language processing methods, and a set of tunable threshold decay functions. The system is customizable, can be optimized for recall or precision, and can generate variants for multi-word expressions. Availability and implementation Code available at: https://bitbucket.org/asarker/lexexp; Data and resources available at: https://sarkerlab.org/lexexp.
Bioinformatics, YYYY, 00
doi: 10.1093/bioinformatics/xxxxx
Advance Access Publication Date: DD Month YYYY
Application Note
Data and Text Mining
LexExp: A system for automatically expanding
concept lexicons for noisy biomedical texts
Abeed Sarker1,*
1Department of Biomedical Informatics, School of Medicine, Emory University, 101 Woodruff Circle,
Atlanta, GA 30322, USA
*To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Summary: LexExp is an open-source, data-centric lexicon expansion system that generates spelling
variants of lexical expressions in a lexicon using a phrase embedding model, lexical similarity-based
natural language processing methods, and a set of tunable threshold decay functions. The system is
customizable, can be optimized for recall or precision, and can generate variants for multi-word
expressions.
Availability and implementation: Code available at: https://bitbucket.org/asarker/lexexp; Data and
resources available at: https://sarkerlab.org/lexexp.
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Lexicon- or dictionary-based biomedical concept detection approaches
require the manual curation of relevant lexical expressions, generally by
domain experts (Rebholz-Schuhmann et al., 2013; Shivade et al., 2014;
Demner-Fushman and Elhadad, 2016; Ghiassi and Lee, 2018). To aid the
tedious process of lexicon creation, automated lexicon expansion methods
have received considerable research attention, leading to resources such
as the UMLS SPECIALIST system (McCray et al., 1993), which is
utilized in MetaMap (Aronson and Lang, 2010) and cTAKES (Savova et
al., 2010). Such systems perform well for formal biomedical texts, but not
noisy texts from sources such as social media and electronic health
records. Due to the presence of non-standard expressions, misspellings,
and abbreviations, it is not possible to capture all possible concept variants
in noisy texts using manual or traditional lexicon expansion approaches.
The number of lexical variants of a given concept that are used, although
finite, cannot be predetermined. Biomedical concepts, such as symptoms
and medications, are specifically likely to be misspelled compared to non-
biomedical concepts (Soualmia et al., 2012; Zhou et al., 2015). Despite
advances in machine learning sequence labeling approaches, which
typically outperform lexicon-based approaches and can detect inexact
concept expressions, lexicon-based approaches are frequently used in
biomedical research. This is non-exclusively because machine learning
methods require manually annotated datasets, which may be time-
consuming and expensive to create, and training and executing state-of-
the-art machine learning approaches may require technical expertise and
high-performance computers, which may not be available. In this article,
we describe an unsupervised lexicon expansion system (LexExp), which
automatically generates many lexical variants of expressions encoded in a
lexicon.
2 Materials and methods
LexExp builds on recent studies, including our own, which utilize the
semantic similarities captured by word2vec-type dense-vector models
(Mikolov et al., 2013) to automatically identify similar terms and variants
(Percha et al., 2018; Sarker and Gonzalez-Hernandez, 2018; Viani et al.,
2019). LexExp employs customizable threshold-decay functions
(constant, linear, cosine, exponential), combined with dense-vector and
lexical similarities, to generate many variants of lexicon entries. Similarity
thresholding for determining lexical matches is popular (Fischer, 1982);
most approaches apply static thresholding while some recent studies have
attempted to employ dynamic thresholding for misspelling correction or
generation (Savary, 2002; Sarker and Gonzalez-Hernandez, 2018).
However, there is no existing tool that enables the use of customizable
A Sarker
thresholding options for these tasks. Note that the objective of LexExp is
to generate lexical variants of multi-word expressions, not lexically
dissimilar semantic variants (e.g., synonyms).
Given a lexicon entry, LexExp first generates word n-grams (n = 1 & 2)
from the entry.
1
For each n-gram within the entry, a dense embedding
model is used to retrieve n most semantically similar words/phrases using
cosine similarity, if the n-gram is present in the model. Next, all the
words/phrases whose semantic similarities with the n-grams are higher
than a threshold are included as candidate variants. For each candidate, its
Levenshtein ratio is computed against the original n-gram and a separate
threshold for lexical similarity (t) is applied. All candidates below the
threshold are removed from the list of possible variants. The same process
is applied recursively on each remaining candidate until no new variants
with similarity above t are found. While we used our own embedding
model (Sarker and Gonzalez, 2017), any can be used for identifying
semantically similar terms in LexExp.
2.1 Lexical similarity thresholding functions
LexExp provides the user with four functions that can be used to vary t
based on the character lengths of the input n-grams. Typically, longer
terms/phrases have true variants that are lexically more distant from the
original entry. So, adjusting t based on the length of an expression may
lead to better precision and/or recall.
2
LexExp provides four possible
functions that maybe used to vary t:
Static: t = ti
Linear decay:
  
 ,
Cosine decay:
   

Exponential decay:
        

where m and n are constants, ti is the initial threshold, and tl is the lower
bound for t. Figure 1 illustrates how these thresholding methods vary for
expressions of length 130 characters. These thresholding functions are
carefully designed to provide the user with flexibility to vary them as per
the needs of a task.
2.2 Multi-word variants
A key functionality of LexExp is its ability to generate variants for multi-
word expressions. Capturing variants of multi-word expressions
comprehensively is particularly challenging via manual annotation since
the number of possible word combinations can be very high. Also, phrase
embedding models cannot capture the semantics of long multi-word
expressions due to the sparsity of their occurrences.
1
For one-word expressions, only unigrams are generated.
2
Note that in this context, recall and precision are both ill-defined.
Recallbecause there is no known bound for the total number of
variants; Precisionbecause the set of true variants depends on the
research task.
LexExp uses two functions for generating multi-word variants. The first
is a unigram variant generation function that generates variants for each
word based on a specific value of t, and then generates all combinations
of the original expression based on the variants identified, keeping the
ordering of the variants unchanged. Examples of variants generated by this
function are shown below:
Original expression: eyes were excruciatingly sensitive and sore
Sample variants:
1: eyes were excrusiatingly sensistive and sore
2: eyes were excruciatingly sensitve and sore
3: eyes were excrusiatingly sensitive and sore
4: eyes were excrusiatingly sensetive and soer
The second is a bigram generation function, which first tokenizes the
expressions into bigrams, then generates variants of the bigrams. These
variants maybe uni or multi-grams (e.g., stomach ache: stomachache, mild
stomach ache). After the variants are generated, they are tokenized to
unigrams and then all combinations of all unigrams are generated as
described before. Recombining the bigrams following the generation of
the variants can be complicated in some cases, as a term and its partial
variant may both be present in a combination. For example, in #3 above,
‘heartburn’, a unigram variant of ‘heart burning’, is followed by ‘burn
after’ (variant of ‘burning after’) when the initial combination is
generated. LexExp attempts to resolve these using a simple forward and
backward pass through the list of words, removing all words identical to
or substrings of the next/previous one.
3 Conclusion
We ran LexExp on multiple lexicons, including a COVID-19 symptoms
Twitter lexicon
3
(Sarker et al., 2020), adverse drug reactions (Sarker and
Gonzalez, 2015), subset of consumer health vocabulary (Zeng and Tse,
2006), and psychosis symptoms from electronic health records (Viani et
al., 2019). We also compared tweet retrieval numbers for COVID19
symptom-mentioning tweets using the abovementioned lexicon with and
without variants, and observed an increase of 16.6%. Further details about
these experiments are provided in the supplementary material. As a
lexicon expansion system, the purpose of LexExp is not to obtain perfect
accuracyin fact, accuracy is not well-defined for this generation task.
The objective, instead, is to automatically generate large sets of possible
variants that can be readily used by human experts for information
retrieval and extraction.
Funding
Research reported in this publication was supported by NIDA of the NIH under
award number R01DA046619. The content is solely the responsibility of the
authors and does not necessarily represent the official views of the NIH.
Conflict of Interest: none declared.
3
Our initial intent was to rapidly expand a lexicon of COVID-19
symptom, but the developed system is useful for tasks beyond this initial
intent.
LexExp: Automatic lexicon expander
References
Aronson, A. R. and Lang, F.-M. (2010) ‘An overview of MetaMap:
historical perspective and recent advances.’, Journal of the American
Medical Informatics Association : JAMIA. American Medical
Informatics Association, 17(3), pp. 22936. doi:
10.1136/jamia.2009.002733.
Demner-Fushman, D. and Elhadad, N. (2016) ‘Aspiring to Unintended
Consequences of Natural Language Processing: A Review of Recent
Developments in Clinical and Consumer-Generated Text Processing’,
IMIA Yearbook, (1), pp. 224233. doi: 10.15265/IY-2016-017.
Fischer, R.-J. (1982) ‘A Threshold Method of Approximate String
Matching’, in. Springer, Berlin, Heidelberg, pp. 843–849. doi:
10.1007/978-3-642-93201-4_150.
Ghiassi, M. and Lee, S. (2018) ‘A domain transferable lexicon set for
Twitter sentiment analysis using a supervised machine learning
approach’, Expert Systems with Applications. Elsevier Ltd, 106, pp. 197
216. doi: 10.1016/j.eswa.2018.04.006.
McCray, A. T. et al. (1993) ‘UMLS® knowledge for biomedical
language processing’, Bulletin of the Medical Library Association. Bull
Med Libr Assoc, 81(2), pp. 184194.
Mikolov, T. et al. (2013) ‘Distributed Representations of Words and
Phrases and their Compositionality’, Nips, pp. 19. doi:
10.1162/jmlr.2003.3.4-5.951.
Percha, B. et al. (2018) ‘Expanding a radiology lexicon using contextual
patterns in radiology reports’, Journal of the American Medical
Informatics Association. doi: 10.1093/jamia/ocx152.
Rebholz-Schuhmann, D. et al. (2013) ‘Evaluating gold standard corpora
against gene/protein tagging solutions and lexical resources’, Journal of
Biomedical Semantics, 4(1), p. 28. doi: 10.1186/2041-1480-4-28.
Sarker, A. et al. (2020) ‘Self-reported COVID-19 symptoms on Twitter:
An analysis and a research resource’, medRxiv. Cold Spring Harbor
Laboratory Press, p. 2020.04.16.20067421. doi:
10.1101/2020.04.16.20067421.
Sarker, A. and Gonzalez-Hernandez, G. (2018) ‘An unsupervised and
customizable misspelling generator for mining noisy health-related text
sources’, Journal of Biomedical Informatics, 88. doi:
10.1016/j.jbi.2018.11.007.
Sarker, A. and Gonzalez, G. (2015) ‘Portable automatic text
classification for adverse drug reaction detection via multi-corpus
training’, Journal of Biomedical Informatics, 53. doi:
10.1016/j.jbi.2014.11.002.
Sarker, A. and Gonzalez, G. (2017) ‘A corpus for mining drug-related
knowledge from Twitter chatter: Language models and their utilities’,
Data in Brief. doi: 10.1016/j.dib.2016.11.056.
Savary, A. (2002) ‘Typographical nearest-neighbor search in a finite-
state lexicon and its application to spelling correction’, in Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). Springer Verlag, pp.
251260. doi: 10.1007/3-540-36390-4_21.
Savova, G. K. et al. (2010) ‘Mayo clinical Text Analysis and Knowledge
Extraction System (cTAKES): architecture, component evaluation and
applications’, Journal of the American Medical Informatics Association:
JAMIA, 17(5), pp. 507513. doi: 10.1136/jamia.2009.001560.
Shivade, C. et al. (2014) ‘A review of approaches to identifying patient
phenotype cohorts using electronic health records’, Journal of the
American Medical Informatics Association. American Medical
Informatics Association, 21(2), pp. 221230. doi: 10.1136/amiajnl-2013-
001935.
Soualmia, L. F. et al. (2012) ‘Matching health information seekers’
queries to medical terms’, BMC Bioinformatics. BioMed Central,
13(SUPPL 1), p. S11. doi: 10.1186/1471-2105-13-S14-S11.
Viani, N. et al. (2019) ‘Generating positive psychosis symptom
keywords from electronic health records’, in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics). Springer Verlag, pp. 298303. doi:
10.1007/978-3-030-21642-9_38.
Zeng, Q. T. and Tse, T. (2006) ‘Exploring and developing consumer
health vocabularies’, Journal of the American Medical Informatics
Association. doi: 10.1197/jamia.M1761.
Zhou, X. et al. (2015) ‘Context-Sensitive Spelling Correction of
Consumer-Generated Content on Health Care’, JMIR Medical
Informatics. JMIR Publications Inc., 3(3), p. e27. doi:
10.2196/medinform.4211.
... The first application of NLP was to generate lexical variants (e.g., misspellings) of the substances. Since drug names and related expressions are often misspelled on social media, we generated commonly used lexical variants of the terms using the LexExp tool [27]. We found that some non-standard terms and lexical variants tend to have high noise associated with them (i.e., expressions not actually referring to a stimulant or opioid, e.g., oxy clean). ...
Article
Full-text available
Background Despite recent rises in fatal overdoses involving multiple substances, there is a paucity of knowledge about stimulant co-use patterns among people who use opioids (PWUO) or people being treated with medications for opioid use disorder (PTMOUD). A better understanding of the timing and patterns in stimulant co-use among PWUO based on mentions of these substances on social media can help inform prevention programs, policy, and future research directions. This study examines stimulant co-mention trends among PWUO/PTMOUD on social media over multiple years. Methods We collected publicly available data from 14 forums on Reddit (subreddits) that focused on prescription and illicit opioids, and medications for opioid use disorder (MOUD). Collected data ranged from 2011 to 2020, and we also collected timelines comprising past posts from a sample of Reddit users (Redditors) on these forums. We applied natural language processing to generate lexical variants of all included prescription and illicit opioids and stimulants and detect mentions of them on the chosen subreddits. Finally, we analyzed and described trends and patterns in co-mentions. Results Posts collected for 13,812 Redditors showed that 12,306 (89.1%) mentioned at least 1 opioid, opioid-related medication, or stimulant. Analyses revealed that the number and proportion of Redditors mentioning both opioids and/or opioid-related medications and stimulants steadily increased over time. Relative rates of co-mentions by the same Redditor of heroin and methamphetamine, the substances most commonly co-mentioned, decreased in recent years, while co-mentions of both fentanyl and MOUD with methamphetamine increased. Conclusion Our analyses reflect increasing mentions of stimulants, particularly methamphetamine, among PWUO/PTMOUD, which closely resembles the growth in overdose deaths involving both opioids and stimulants. These findings are consistent with recent reports suggesting increasing stimulant use among people receiving treatment for opioid use disorder. These data offer insights on emerging trends in the overdose epidemic and underscore the importance of scaling efforts to address co-occurring opioid and stimulant use including harm reduction and comprehensive healthcare access spanning mental-health services and substance use disorder treatment.
... Counts were grouped by year to evaluate for observable popularity trends. We used an automatic tool to generate lexical variants, including misspellings, and manually added variants that were not lexically similar but often used in the subreddits (see Table A2 of Supplementary material for all lexical variants) [18]. ...
Article
Full-text available
Background: Induction of buprenorphine, an evidence-based treatment for opioid use disorder (OUD), has been reported to be difficult for people with heavy use of fentanyl, the most prevalent opioid in many areas of the country. In this population, precipitated opioid withdrawal (POW) may occur even after individuals have completed a period of opioid abstinence prior to induction. Our objective was to study potential associations between fentanyl, buprenorphine induction, and POW, using social media data. Methods: This is a mixed methods study of data from seven opioid-related forums (subreddits) on Reddit. We retrieved publicly available data from the subreddits via an application programming interface, and applied natural language processing to identify subsets of posts relevant to buprenorphine induction, POW, and fentanyl and analogs (F&A). We computed mention frequencies for keywords/phrases of interest specified by our medical toxicology experts. We further conducted manual, qualitative, and thematic analyses of automatically identified posts to characterize the information presented. Results: In 267,136 retrieved posts, substantial increases in mentions of F&A (3 in 2013 to 3870 in 2020) and POW (2 in 2012 to 332 in 2020) were observed. F&A mentions from 2013 to 2021 were strongly correlated with mentions of POW (Spearman's ρ: 0.882; p = .0016), and mentions of the Bernese method (BM), a microdosing induction strategy (Spearman's ρ: 0.917; p = .0005). Manual review of 384 POW- and 106 BM-mentioning posts revealed that common discussion themes included "specific triggers of POW" (55.1%), "buprenorphine dosing strategies" (38.2%) and "experiences of OUD" (36.1%). Many reported experiencing POW despite prolonged opioid abstinence periods, and recommended induction via microdosing, including specifically via the BM. Conclusions: Reddit subscribers often associate POW with F&A use and describe self-managed buprenorphine induction strategies involving microdosing to avoid POW. Further objective studies in patients with fentanyl use and OUD initiating buprenorphine are needed to corroborate these findings.HIGHLIGHTSIncrease in mentions of precipitated opioid withdrawal (POW) on Reddit from 2012 to 2021 was associated with the increase in fentanyl and analog mentions.Experiences of precipitated opioid withdrawal (POW) were described by individuals despite reporting prolonged periods of abstinence compared to standard buprenorphine induction protocols.People with Opioid Use Disorder (OUD) on Reddit are using and recommending microdosing strategies with buprenorphine to avoid POW.People who used fentanyl report experiencing POW following statistically longer periods of abstinence than people who use heroin.
... Since drug names and related expressions are often misspelled on social media, we generated commonly used lexical variants of the terms using the LexExp tool. 20 We found that some non-standard terms and lexical variants tend to have high noise associated with them (i.e., expressions not actually referring to a stimulant or opioid, e.g., oxy clean). Thus, we included additional lters for the terms stimulant, meth, and oxy (Appendix A3). ...
Preprint
Full-text available
Background Despite recent increasing focus on fatal overdoses involving multiple substances, there is a paucity of knowledge about stimulant co-use patterns among people who use opioids (PWUO) or people being treated with medication for opioid use disorder (PTMOUD). This study examines stimulant co-mention trends among PWUO/PTMOUD on social media. Methods We collected publicly-available data from 14 prescription and illicit opioid and MOUD-related forums on Reddit (subreddits) between 2011-2020 and timelines comprising past posts from a sample of Reddit users (Redditors) on these forums. We applied natural language processing to detect mentions of opioids, opioid-related medications, and stimulants and described trends and patterns in co-mentions. Results Posts collected for 13,812 Redditors indicated 12,306 (89.1%) mentioned ≥1 opioid, opioid-related medication or stimulant. Analyses showed the number and proportion of Redditors mentioning both opioids and/or opioid-related medications and stimulants steadily increased over time. Relative rates of co-mentions of heroin and methamphetamine, substances most commonly co-mentioned, decreased in recent years while those of fentanyl and MOUD with methamphetamine increased. Conclusion Data from Reddit reflect increasing mentions of stimulants, particularly methamphetamine, among PWUO/PTMOUD and closely resemble the growth in overdose deaths involving both opioids and stimulants. These findings are consistent with recent reports suggesting increasing stimulant use among people receiving treatment for opioid use disorder. These data offer insights on emerging trends in the overdose epidemic and underscore the importance of scaling efforts to address co-occurring opioid and stimulant use including harm reduction and comprehensive healthcare access spanning mental-health services and substance use disorder treatment.
... Counts were grouped by year to evaluate for observable popularity trends. We used an automatic tool to generate lexical variants 18 and manually added variants that were not lexically similar but often used in the subreddits (see Table A2 of Supplementary material for all lexical variants). ...
Preprint
Full-text available
Background: Buprenorphine is an evidence-based treatment for Opioid Use Disorder (OUD). Standard buprenorphine induction requires a period of opioid abstinence to minimize risk of precipitated opioid withdrawal (POW). Our objective was to study the impact of the increasing presence of fentanyl and its analogs in the opioid supply of the United States, on buprenorphine induction and POW, using social media data from Reddit. Methods: This is a data-driven, mixed methods study of opioid-related forums, called subreddits, on Reddit to analyze posts related to fentanyl, POW, and buprenorphine induction. The posts were collected from seven subreddits using an application programming interface for Reddit. We applied natural language processing to identify subsets of salient posts relevant to buprenorphine induction, and performed manual, qualitative, thematic analyses of them. Results: 267,136 posts were retrieved from seven subreddits. Fentanyl mentions increased from 3 in 2013 to 3870 in 2020, and POW mentions increased from 2 (2012) to 332 (2020). Manual review of 384 POW-mentioning posts and 106 'Bernese method' (a microdosing induction strategy) mentioning posts revealed common themes and peoples' experiences. Specifically, presence of fentanyl caused POWs despite long abstinence durations, and alternative induction via microdosing were frequently recommended in peer-to-peer discussions. Conclusions: This study found that increased social media chatter on Reddit about POW correlated with fentanyl mentions. A subset of posts described microdosing as a self-management strategy to avoid POW. Reddit posts suggest that people are utilizing these strategies to initiate buprenorphine due to challenges arising from fentanyl prevalence in the opioid supply.
... We further expanded the meta-lexicon automatically using the LexExp tool, which generates multi-word spelling variants. 29 To detect symptoms from free text, we applied an inexact matching method. Exact matching on social media free texts typically results in low recall due to the presence of nonstandard expressions and misspellings. ...
Article
Full-text available
Our objective was to mine Reddit to discover long-COVID symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon. We retrieved posts from the /r/covidlonghaulers subreddit and extracted symptoms via approximate matching using an expanded meta-lexicon. We mapped the extracted symptoms to standard concept IDs, compared their distributions with those reported in recent literature and analyzed their distributions over time. From 42 995 posts by 4249 users, we identified 1744 users who expressed at least 1 symptom. The most frequently reported long-COVID symptoms were mental health-related symptoms (55.2%), fatigue (51.2%), general ache/pain (48.4%), brain fog/confusion (32.8%), and dyspnea (28.9%) among users reporting at least 1 symptom. Comparison with recent literature revealed a large variance in reported symptoms across studies. Temporal analysis showed several persistent symptoms up to 15 months after infection. The spectrum of symptoms identified from Reddit may provide early insights about long-COVID.
... We then further expanded the meta-lexicon automatically using the LexExp tool. 27 During the manual annotation process, we discovered that users often expressed the resolution or absence of symptoms using negation expressions such as 'no', 'do not', 'never had', and so we created a small lexicon of such negations. To detect symptoms from free text, we applied an inexact matching method. ...
Preprint
Full-text available
Objective To mine Reddit to discover long-COVID symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon. Materials and Methods We retrieved posts from the /r/covidlonghaulers subreddit and extracted symptoms via approximate matching using an expanded meta-lexicon. We mapped the extracted symptoms to standard concept IDs, compared their distributions with those reported in recent literature and analyzed their distributions over time. Results From 42,995 posts by 4249 users, we identified 1744 users who expressed at least 1 symptom. The most frequently reported long-COVID symptoms were mental health-related symptoms (55.2%), fatigue (51.2%), general ache/pain (48.4%), brain fog/confusion (32.8%) and dyspnea (28.9%) amongst users reporting at least 1 symptom. Comparison with recent literature revealed a large variance in reported symptoms across studies. Temporal analysis showed several persistent symptoms up to 15 months after infection. Conclusion The spectrum of symptoms identified from Reddit may provide early insights about long-COVID.
Article
Full-text available
Objective: To mine Twitter and quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon for future research. Materials and Methods: We retrieved tweets using COVID-19-related keywords, and performed semiautomatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard concept IDs in the Unified Medical Language System, and compared the distributions to those reported in early studies from clinical settings. Results: We identified 203 positive-tested users who reported 1002 symptoms using 668 unique expressions. The most frequently-reported symptoms were fever/pyrexia (66.1%), cough (57.9%), body ache/pain (42.7%), fatigue (42.1%), headache (37.4%), and dyspnea (36.3%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (28.7%) and ageusia (28.1%), were frequently reported on Twitter, but not in clinical studies. Conclusion: The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings.
Preprint
Full-text available
Objective To mine Twitter to quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions against clinical studies, and create a symptom lexicon for the research community. Materials and methods We retrieved tweets using COVID-19-related keywords, and performed several layers of semi-automatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard IDs, and compared the distributions with multiple studies conducted in clinical settings. Results We identified 203 positive-tested users who reported 932 symptoms using 598 unique expressions. The most frequently-reported symptoms were fever/pyrexia (65%), cough (56%), body aches/pain (40%), headache (35%), fatigue (35%), and dyspnea (34%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (26%) and ageusia (24%) were frequently reported on Twitter, but not in clinical studies. Conclusion The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings.
Article
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Article
Full-text available
In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period— November, 2014 to February, 2015. The posts mention over 250 drug-related key words. The language models encapsulate semantic and sequential properties of the texts.
Article
The Twitter messaging service has become a platform for customers and news consumers to express sentiments. Accurately capturing these sentiments has been challenging for researchers. The traditional approaches to Twitter Sentiment Analysis (TSA) include dictionary-based and use supervised machine learning tools for sentiment classification. This research follows the supervised machine learning approach. A major challenge for the machine learning approach is feature selection, which is often domain dependent. We address this specific challenge and present a novel approach to identify a lexicon set unique to TSA. We show that this Twitter Specific Lexicon Set (TSLS) is small, and most importantly, is domain transferable. This identification process generates a collection of vectorized tweets for input to machine learning tools. In traditional approaches, this vectorization often results in a highly sparse input matrix which produces low accuracy measures. In this research, we hierarchically reduce the feature set to a small set of seven “meta features” to reduce sparsity. We show that TSA based on these features can produce highly accurate results using a dynamic architecture for neural networks (DAN2) and SVM (machine learning tools) as measured by recall, precision, and F1 metrics (the harmonic average of precision and recall). Our results show that a Twitter Generic Feature Set (TGFS) derived from two datasets (@JustinBieber and @Starbucks) is domain transferable and when combined with only a few Twitter Domain Specific Features (TDSF) (less than 3%), can produce excellent sentiment classification values. We evaluate the effectiveness and transferability of the TGFS across three new and distinct domains (@GovChristie, @SouthwestAir, and @VerizonWireless).
Article
Objective: Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus. Materials and methods: We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building. Results: Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others. Discussion: The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.
Article
Objectives: This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts. Methods: We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations. Results: Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community- wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues. Conclusions: Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.
Chapter
A threshold method of approximate string matching is discussed. It is applied to problems which require both high precision and recall, such as patient identification retrieval and automated error correction. The paper presents the calculation of a subdistance matrix omitting serious constraints required by other algorithms. In addition, strings are grouped into subsets of strings of equal lengths, stored in a special tree structure. Furthermore, a threshold frequently allows omitting the calculation of most of the subdistance matrix without decrease in recall and precision. Some practical results are presented and discussed.