Bioinformatics, YYYY, 0–0
Advance Access Publication Date: DD Month YYYY
Data and Text Mining
LexExp: A system for automatically expanding
concept lexicons for noisy biomedical texts
1Department of Biomedical Informatics, School of Medicine, Emory University, 101 Woodruff Circle,
Atlanta, GA 30322, USA
*To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Summary: LexExp is an open-source, data-centric lexicon expansion system that generates spelling
variants of lexical expressions in a lexicon using a phrase embedding model, lexical similarity-based
natural language processing methods, and a set of tunable threshold decay functions. The system is
customizable, can be optimized for recall or precision, and can generate variants for multi-word
Availability and implementation: Code available at: https://bitbucket.org/asarker/lexexp; Data and
resources available at: https://sarkerlab.org/lexexp.
Supplementary information: Supplementary data are available at Bioinformatics online.
Lexicon- or dictionary-based biomedical concept detection approaches
require the manual curation of relevant lexical expressions, generally by
domain experts (Rebholz-Schuhmann et al., 2013; Shivade et al., 2014;
Demner-Fushman and Elhadad, 2016; Ghiassi and Lee, 2018). To aid the
tedious process of lexicon creation, automated lexicon expansion methods
have received considerable research attention, leading to resources such
as the UMLS SPECIALIST system (McCray et al., 1993), which is
utilized in MetaMap (Aronson and Lang, 2010) and cTAKES (Savova et
al., 2010). Such systems perform well for formal biomedical texts, but not
noisy texts from sources such as social media and electronic health
records. Due to the presence of non-standard expressions, misspellings,
and abbreviations, it is not possible to capture all possible concept variants
in noisy texts using manual or traditional lexicon expansion approaches.
The number of lexical variants of a given concept that are used, although
finite, cannot be predetermined. Biomedical concepts, such as symptoms
and medications, are specifically likely to be misspelled compared to non-
biomedical concepts (Soualmia et al., 2012; Zhou et al., 2015). Despite
advances in machine learning sequence labeling approaches, which
typically outperform lexicon-based approaches and can detect inexact
concept expressions, lexicon-based approaches are frequently used in
biomedical research. This is non-exclusively because machine learning
methods require manually annotated datasets, which may be time-
consuming and expensive to create, and training and executing state-of-
the-art machine learning approaches may require technical expertise and
high-performance computers, which may not be available. In this article,
we describe an unsupervised lexicon expansion system (LexExp), which
automatically generates many lexical variants of expressions encoded in a
2 Materials and methods
LexExp builds on recent studies, including our own, which utilize the
semantic similarities captured by word2vec-type dense-vector models
(Mikolov et al., 2013) to automatically identify similar terms and variants
(Percha et al., 2018; Sarker and Gonzalez-Hernandez, 2018; Viani et al.,
2019). LexExp employs customizable threshold-decay functions
(constant, linear, cosine, exponential), combined with dense-vector and
lexical similarities, to generate many variants of lexicon entries. Similarity
thresholding for determining lexical matches is popular (Fischer, 1982);
most approaches apply static thresholding while some recent studies have
attempted to employ dynamic thresholding for misspelling correction or
generation (Savary, 2002; Sarker and Gonzalez-Hernandez, 2018).
However, there is no existing tool that enables the use of customizable
thresholding options for these tasks. Note that the objective of LexExp is
to generate lexical variants of multi-word expressions, not lexically
dissimilar semantic variants (e.g., synonyms).
Given a lexicon entry, LexExp first generates word n-grams (n = 1 & 2)
from the entry.
For each n-gram within the entry, a dense embedding
model is used to retrieve n most semantically similar words/phrases using
cosine similarity, if the n-gram is present in the model. Next, all the
words/phrases whose semantic similarities with the n-grams are higher
than a threshold are included as candidate variants. For each candidate, its
Levenshtein ratio is computed against the original n-gram and a separate
threshold for lexical similarity (t) is applied. All candidates below the
threshold are removed from the list of possible variants. The same process
is applied recursively on each remaining candidate until no new variants
with similarity above t are found. While we used our own embedding
model (Sarker and Gonzalez, 2017), any can be used for identifying
semantically similar terms in LexExp.
2.1 Lexical similarity thresholding functions
LexExp provides the user with four functions that can be used to vary t
based on the character lengths of the input n-grams. Typically, longer
terms/phrases have true variants that are lexically more distant from the
original entry. So, adjusting t based on the length of an expression may
lead to better precision and/or recall.
LexExp provides four possible
functions that maybe used to vary t:
Static: t = ti
where m and n are constants, ti is the initial threshold, and tl is the lower
bound for t. Figure 1 illustrates how these thresholding methods vary for
expressions of length 1–30 characters. These thresholding functions are
carefully designed to provide the user with flexibility to vary them as per
the needs of a task.
2.2 Multi-word variants
A key functionality of LexExp is its ability to generate variants for multi-
word expressions. Capturing variants of multi-word expressions
comprehensively is particularly challenging via manual annotation since
the number of possible word combinations can be very high. Also, phrase
embedding models cannot capture the semantics of long multi-word
expressions due to the sparsity of their occurrences.
For one-word expressions, only unigrams are generated.
Note that in this context, recall and precision are both ill-defined.
Recall—because there is no known bound for the total number of
variants; Precision—because the set of true variants depends on the
LexExp uses two functions for generating multi-word variants. The first
is a unigram variant generation function that generates variants for each
word based on a specific value of t, and then generates all combinations
of the original expression based on the variants identified, keeping the
ordering of the variants unchanged. Examples of variants generated by this
function are shown below:
Original expression: eyes were excruciatingly sensitive and sore
1: eyes were excrusiatingly sensistive and sore
2: eyes were excruciatingly sensitve and sore
3: eyes were excrusiatingly sensitive and sore
4: eyes were excrusiatingly sensetive and soer
The second is a bigram generation function, which first tokenizes the
expressions into bigrams, then generates variants of the bigrams. These
variants maybe uni or multi-grams (e.g., stomach ache: stomachache, mild
stomach ache). After the variants are generated, they are tokenized to
unigrams and then all combinations of all unigrams are generated as
described before. Recombining the bigrams following the generation of
the variants can be complicated in some cases, as a term and its partial
variant may both be present in a combination. For example, in #3 above,
‘heartburn’, a unigram variant of ‘heart burning’, is followed by ‘burn
after’ (variant of ‘burning after’) when the initial combination is
generated. LexExp attempts to resolve these using a simple forward and
backward pass through the list of words, removing all words identical to
or substrings of the next/previous one.
We ran LexExp on multiple lexicons, including a COVID-19 symptoms
(Sarker et al., 2020), adverse drug reactions (Sarker and
Gonzalez, 2015), subset of consumer health vocabulary (Zeng and Tse,
2006), and psychosis symptoms from electronic health records (Viani et
al., 2019). We also compared tweet retrieval numbers for COVID19
symptom-mentioning tweets using the abovementioned lexicon with and
without variants, and observed an increase of 16.6%. Further details about
these experiments are provided in the supplementary material. As a
lexicon expansion system, the purpose of LexExp is not to obtain perfect
accuracy—in fact, accuracy is not well-defined for this generation task.
The objective, instead, is to automatically generate large sets of possible
variants that can be readily used by human experts for information
retrieval and extraction.
Research reported in this publication was supported by NIDA of the NIH under
award number R01DA046619. The content is solely the responsibility of the
authors and does not necessarily represent the official views of the NIH.
Conflict of Interest: none declared.
Our initial intent was to rapidly expand a lexicon of COVID-19
symptom, but the developed system is useful for tasks beyond this initial
LexExp: Automatic lexicon expander
Aronson, A. R. and Lang, F.-M. (2010) ‘An overview of MetaMap:
historical perspective and recent advances.’, Journal of the American
Medical Informatics Association : JAMIA. American Medical
Informatics Association, 17(3), pp. 229–36. doi:
Demner-Fushman, D. and Elhadad, N. (2016) ‘Aspiring to Unintended
Consequences of Natural Language Processing: A Review of Recent
Developments in Clinical and Consumer-Generated Text Processing’,
IMIA Yearbook, (1), pp. 224–233. doi: 10.15265/IY-2016-017.
Fischer, R.-J. (1982) ‘A Threshold Method of Approximate String
Matching’, in. Springer, Berlin, Heidelberg, pp. 843–849. doi:
Ghiassi, M. and Lee, S. (2018) ‘A domain transferable lexicon set for
Twitter sentiment analysis using a supervised machine learning
approach’, Expert Systems with Applications. Elsevier Ltd, 106, pp. 197–
216. doi: 10.1016/j.eswa.2018.04.006.
McCray, A. T. et al. (1993) ‘UMLS® knowledge for biomedical
language processing’, Bulletin of the Medical Library Association. Bull
Med Libr Assoc, 81(2), pp. 184–194.
Mikolov, T. et al. (2013) ‘Distributed Representations of Words and
Phrases and their Compositionality’, Nips, pp. 1–9. doi:
Percha, B. et al. (2018) ‘Expanding a radiology lexicon using contextual
patterns in radiology reports’, Journal of the American Medical
Informatics Association. doi: 10.1093/jamia/ocx152.
Rebholz-Schuhmann, D. et al. (2013) ‘Evaluating gold standard corpora
against gene/protein tagging solutions and lexical resources’, Journal of
Biomedical Semantics, 4(1), p. 28. doi: 10.1186/2041-1480-4-28.
Sarker, A. et al. (2020) ‘Self-reported COVID-19 symptoms on Twitter:
An analysis and a research resource’, medRxiv. Cold Spring Harbor
Laboratory Press, p. 2020.04.16.20067421. doi:
Sarker, A. and Gonzalez-Hernandez, G. (2018) ‘An unsupervised and
customizable misspelling generator for mining noisy health-related text
sources’, Journal of Biomedical Informatics, 88. doi:
Sarker, A. and Gonzalez, G. (2015) ‘Portable automatic text
classification for adverse drug reaction detection via multi-corpus
training’, Journal of Biomedical Informatics, 53. doi:
Sarker, A. and Gonzalez, G. (2017) ‘A corpus for mining drug-related
knowledge from Twitter chatter: Language models and their utilities’,
Data in Brief. doi: 10.1016/j.dib.2016.11.056.
Savary, A. (2002) ‘Typographical nearest-neighbor search in a finite-
state lexicon and its application to spelling correction’, in Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). Springer Verlag, pp.
251–260. doi: 10.1007/3-540-36390-4_21.
Savova, G. K. et al. (2010) ‘Mayo clinical Text Analysis and Knowledge
Extraction System (cTAKES): architecture, component evaluation and
applications’, Journal of the American Medical Informatics Association:
JAMIA, 17(5), pp. 507–513. doi: 10.1136/jamia.2009.001560.
Shivade, C. et al. (2014) ‘A review of approaches to identifying patient
phenotype cohorts using electronic health records’, Journal of the
American Medical Informatics Association. American Medical
Informatics Association, 21(2), pp. 221–230. doi: 10.1136/amiajnl-2013-
Soualmia, L. F. et al. (2012) ‘Matching health information seekers’
queries to medical terms’, BMC Bioinformatics. BioMed Central,
13(SUPPL 1), p. S11. doi: 10.1186/1471-2105-13-S14-S11.
Viani, N. et al. (2019) ‘Generating positive psychosis symptom
keywords from electronic health records’, in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics). Springer Verlag, pp. 298–303. doi:
Zeng, Q. T. and Tse, T. (2006) ‘Exploring and developing consumer
health vocabularies’, Journal of the American Medical Informatics
Association. doi: 10.1197/jamia.M1761.
Zhou, X. et al. (2015) ‘Context-Sensitive Spelling Correction of
Consumer-Generated Content on Health Care’, JMIR Medical
Informatics. JMIR Publications Inc., 3(3), p. e27. doi: