Content uploaded by Oliver Bonham-carter
Author content
All content in this area was uploaded by Oliver Bonham-carter on Mar 07, 2020
Content may be subject to copyright.
BeagleTM: An adaptable text mining method
for relationship discovery in literature
Oliver Bonham-Carter
Allegheny College,
Department of Computer Science,
Meadville, PA 16335, USA,
obonhamcarter@allegheny.edu
www.cs.allegheny.edu/sites/obonhamcarter/
Abstract. Investigators in bioinformatics are often confronted with the
difficult task of connecting ideas, which are found scattered around the
literature, using robust keyword searches. It is often customary to iden-
tify only a few keywords in a research article to facilitate search algo-
rithms, which is usually completed in absence of a general approach that
would serve to index all possible keywords of an article’s characteristic
attributes. Based on only a hand-full of keywords, articles are there-
fore prioritized by search algorithms that point investigators to seeming
subsets of their knowledge. In addition, many articles escape algorithm
search strategies due to the fact that their keywords were vague, or have
become unfashionable terms. In this case, the article, as well as its source
of knowledge, may be lost to the community.
Owing to the growing size of the literature, we introduce a text mining
method and tool, (BeagleTM ), for knowledge harvesting from papers in
a literature corpus without the use of article meta-data. Unlike other text
mining tools that only highlight found keywords in articles, our method
allows users to visually ascertain which keywords have been featured in
studies together with others in peer-reviewed work. Drawing from an
arbitrarily-sized corpus, BeagleTM creates visual networks describing
interrelationships between user-defined terms to facilitate the discovery
of connected or parallel studies.
We report the effectiveness of BeagleTM by illustrating its ability to
connect the keywords from types of PTMs (post-translational modifica-
tions), stress-factors, and disorders together according to their relation-
ships. These relationships facilitate the discovery of connected studies,
which is often challenging to determine due to the frequently unrelated
keywords that were tied to relevant articles containing this type of infor-
mation.
Keywords: text mining, literature analysis, relationship networks, re-
lationship models
1 Introduction
When performing a literature review using searching algorithms, locating articles
is difficult due to the obscure nature of the keywords necessary. For any project,
2 Bonham-Carter
one must supply the literature’s search engines with keywords that are closely
associated to one’s research. Correct keywords are carefully selected to isolate
relevant works, however, there are three general problems inherent to locating
knowledge based on the non-uniform links that have been provided by the diverse
authors of the articles.
The first major problem is that the terms for the particular desired infor-
mation may not follow a popular convention of keyword-naming. It would seem
that some authors choose words which are no-longer current with their fields.
Their articles are hence, found with other non-contemporary research due to
an antiquated use of language. In other cases, authors of seminal articles may
invent their own terms to describe the details of their work. This implies that
one must know an exact term or a particular usage of word(s), according to spe-
cific authors, to locate their articles. Simultaneously, other research teams may
be working along similar research themes, yet use entirely different keywords to
put their work in a scientific context. Therefore, when locating articles across
different researchers, multiple sets of specific keywords must be applied to search
engines to retrieve articles from a particular area of research.
The second general problem concerns the growth that many research areas
enjoy as a result of their popularity. As an area of research evolves, some of
its terminology, including keywords, may gradually be replaced by others as a
consequence. The natural evolution of a research field may cause disconnections
between current and former work, and to continue to locate new developments
one must be knowledgeable of the modernized keywords, in addition to the for-
mer ones. In a single field of research, we already see the creation of a widening
gap between different generations of knowledge in function of its evolution – if
former keywords were queried by search engines, then one may find only the for-
mer research while the latest literature remains undiscovered. In Figure 1, we use
terms from network research to illustrate the phenomenon caused by keyword
obsolescence, as a result of field evolution.
The third general problem of searching for articles by keywords is that al-
though particular knowledge is likely to exist in the literature, specific insights
may be obtained in articles of completely alternative keywords. For instance,
a particular fact or detail may be briefly mentioned in one or several articles,
for which the associated keywords are irrelevant to one’s research interests. The
discovery of knowledge in seemingly unrelated articles is occasional and unpre-
dictable. Often, the researcher may read numerous articles of diverse keywords
to discover pieces of valuable knowledge. One becomes familiar with the articles
of many types of alternative research, from which to derive header and footnote
knowledge to weave together into parts of one’s informed literature review.
In bioinformatics, the keywords of articles often do not describe the total
wealth of information contained within the article. For instance, a gene or a pro-
tein may be often included in many types of articles but there may be no formal
keyword(s) to declare that they have been included in a particular article. This
lack of information necessitates one to be familiar with many types of articles,
in addition to those which directly concern one’s field of research. For instance,
BeagleTM: Relationship Discovery 3
Fig. 1. Forgotten networks; The drifting and replacement of keywords in research
fields may leave articles “locked” outside of the reach of the general research commu-
nity. Often, articles of non-contemporary research are still relevant to the field, yet are
associated with keywords which no longer carry the same meaning for contemporary
investigators. For example, shown are the keywords concerning a large, popular, com-
puter network whose terms have changed several times during the evolution of its own
study. We note that, “the Information Superhighway” (popular during the 1990’s) and
“the Cloud” (popular during the 2010’s) and the “Internet of Things” (popular as of
this writing) still share much of the same foundational knowledge but may be confused
as different areas of research.
an investigator in bioinformatics would have to include the entire corpus of the
supporting disciplines (i.e., biology, computer science, mathematics and others),
in order to have an opportunity of discovering discussion of an interesting gene
or protein. Unfortunately, finding relevant knowledge in the articles is frustrated
by the fact that the name of the gene or protein may never appear in article’s
keyword rubric or metadata.
Ontologies [1] have gained much popularity as they are able to bridge gaps
between alternative concepts. However, convenient as they may be for connecting
articles in the literature, the searching for knowledge is still likely to depend on
author-selected keywords. These keywords may be poorly selected, and/or dis-
used by their fields, resulting in the loss of relevant articles in literature reviews.
During the evolution of research in bioinformatics, as well as other disciplines,
new keywords are constantly being created, which causes further searching dif-
ficulties.
The amount of published work in biomedical research, making up seemingly
all biomedical knowledge is written by people who apply diverse styles and lan-
guage, is highly unstructured, and is expanding at an astounding rate. During
this expansion, these alternative styles work to introduce noise into the litera-
ture. In the health sciences, biomedical knowledge may be located in noisy sets
of data and it is only by the use of computer-driven technology that meaningful
information may be harvested.
To help them locate and access the knowledge hidden in a corpus, researchers
turn to refining and exploring text mining algorithms discussed in [2], techniques
and tools [3], [4]. Text mining algorithms and tools have been developed for
specific types of research in bioinformatics, as discussed in [5].
4 Bonham-Carter
In this article, we present a text mining framework and method to assist inves-
tigators to locate articles, while countering the annoyances caused by the three
general problems discussed above that would otherwise persist to inhibit knowl-
edge discovery. We have applied this method to develop and create a tool called
BeagleTM which performs text mining for researchers without being limited to
the sometimes cryptic keywords of article meta-data. Our method processes the
abstracts sections of articles, provided by the PubMed, to locate information
that may be aggregated with that of other articles to infer relationships which
are outputted as visual networks. Also in this article, we provide the details of
our method and discuss how selected keywords are used to drive the network cre-
ation system. Finally, we explain how to read the resulting relationship networks
to obtain knowledge of the connections from the processed articles.
1.1 Text Processing
The text of an article’s abstract relates the goals and context of the information
in an article. Since there is limited space in abstract sections, this text is often
written exactly and unambiguously, and it is likely to be a better source of
information than what is inferred using keywords alone. Our approach is to
process the abstracts of the corpus articles to connect the words and concepts
to those of abstracts in other articles. The user inputs chosen terms into our
algorithm, which are then used to create focus points during the process to
create customized networks that describe relationships.
When multiple keywords are found together in the same abstract, one may
assume that the keywords have some common thread that runs between them to
connect them. In this article, we propose a text mining method called BeagleTM
which permits the investigator to locate the common threads of multiple key-
words found in article abstracts of PubMed literature. Furthermore, our method
creates relationship networks (i.e., graphical visualizations) of keywords to visu-
ally describe how they are associated to each other according to peer-reviewed
articles of the literature. In Section 1.2 we discuss our approach to text mining
and our method’s ability link terms.
Text mining tools have been successfully applied to extract information for
convenient use (text summarization, document retrieval), assess document simi-
larity (document clustering, key-phrase identification), extract structured infor-
mation (entity extraction, information extraction) [6], and social medial infor-
mation extraction [7]. Additionally, text mining tools also exist as plug-ins or
libraries for programming languages such: TM [8], Rattle [9] and on-line tools
such as, [10].
While text mining tools fulfill important needs for the bioinformatics com-
munity, they are generally hosted by web sites and their automation in pipelines
may be problematic. Furthermore, many of these tools find specific details from
particular articles and do not infer associations between search terms. Tools such
as PubTator [11], PIE The Search [12], Meshable [13] are useful for bringing ar-
ticles to the attention of the investigator where keywords have been highlighted,
BeagleTM: Relationship Discovery 5
however in this task, automation is bottle-necked because the researcher must
manually process the results.
1.2 A Text Mining Approach By BeagleTM
It is already clear that a major challenge in bioinformatics includes the manage-
ment of large volumes of data [14]. Text mining methods that simply highlight
keywords is, therefore, not likely to be fully beneficial. The tool of our method
was written in Python and is able to handle a corpus containing an arbitrary
number of articles since BeagleTM processes each article separately. BeagleTM
processes an entire corpus of articles available for download from PubMed [15]
(maintained by the National Library of Medicine) to track and group articles
where specific user terms are of interest. Although our method and tool provides
graphical representations of the keyword relationships, we stress that the details
regarding the causality shown by relationships must be further explored by the
researcher.
Our method centers around the notion that connected terms likely indicate
some form of shared context. Shown in Figure 2 is a link between two terms
(in a model) indicating that a peer-reviewed study exists to connect them. In
this example, both keywords were found in the same abstract to suggest that
they share some common context. Using networks to describe relationships, re-
searchers may determine connections between keywords across disjoint articles
to gain knowledge of context.
In Figure 3, we illustrate that two keywords (nodes) are related (described
by an edge), according to the literature. The determination of how keywords
are related to others is shown visually in a plot that we call a relationship net-
work. In this example, all three keywords relating to PTMs (post-translational
modifications), stresses and the types of proteins which are featured in the same
studies are illustrated. Here we note that these terms were chosen for this dis-
cussion since they are often featured in articles, yet their discussions and details
are elusive since these terms are not generally mentioned as keywords in articles
where they play prominent roles.
Fig. 2. Existing knowledge in the peer-reviewed literature; Two terms which
share an edge signify that there exists a peer-reviewed article to corroborate their
association by some scientific pursuit. Here, the edge symbolizes that both keywords
were found in the same abstract to suggest that they likely share some common context.
6 Bonham-Carter
Fig. 3. Building networks; A summary of how BeagleTM builds relationship net-
works from text mining articles across the literature. The terms, PTM,stress and
(associated) proteins are often of interest in research but their presence is seldom an-
nounced by the keywords of the articles in which they are featured.
1.3 Relationship Models
A relationship model in our work is an overarching summary of an article’s con-
tents using visual cues to show how the keywords (according to one’s selection)
are connected to others (also user-selected) across the articles of the literature.
Although these models could be used to describe the associations between any
types of user-selected keywords, we give an example in Figure 3 of the rela-
tionships between proteins, stresses and PTMs, according to scientific citations
featured by NCBI’s PubMed server (https://www.ncbi.nlm.nih.gov/pubmed/).
The plots resemble network-models where their information has been taken from
the abstracts of articles and show that particular proteins have been connected
to types of stresses and PTMs. In other words, the occurrences of keywords in
the articles, as well as their relationships to other user-specified terms (shown
by edges) are displayed using a network model. These graphics illustrate links
between keywords to provide investigators with a way to quickly determine how
elements of their projects are acquainted in a, guilt-by-association, manner. Fur-
thermore, we note that a keyword which is connected to another by an edge
symbolizes a study in which both keywords has been found. Any edges out to
other keywords suggest that the studies may themselves be related in regard to
the context of its work.
Links Between Alzheimer’s Disease, Tau Proteins and PTMs To de-
scribe our method, we used keywords from the literature to create relationship
networks manually. In the work of Marcelli et al. [16] the authors set a stage for a
discussion of age-related neurodegenerative disorders (i.e., Alzheimer’s disease)
where the primary actors are a set of proteins (i.e., APP,Aβ,Tau and BACE1 )
which are involved with the ailment. In their article, the interaction between the
proteins, a set of PTMs (i.e., ubiquitination, phosphorylation, SUMOylation,
acetylation and nitrosylation) and stresses are explored. In particular, we are
BeagleTM: Relationship Discovery 7
Fig. 4. Manually-produced relationship network; We describe a guilt-by-
association scenario between proteins which are linked to neurodegenerative disorders
such as Alzheimer’s disease, and PTMs. By “Functional Groups” we imply that these
proteins are likely involved with the disorder. This model was created using the key-
words inherent to the article and work of Marcelli et al. [16]. The actual details to
explain the relationship between terms are not contained in this model and must be
obtained from the original articles.
interested to know which PTMs are linked to proteins and stresses, according
to Marcelli et al.. The manually produced relationship model that follows from
the author’s work is shown in Figure 4. The models created from BeagleTM
describe the connections between the predefined terms automatically, and the
exact causality behind this relationship must be explored in the article by Mar-
celli et al.. Observing the created plot may help researchers to determine the
relevance of an article at a glance.
2 Methods
Once the keywords have been defined for a text mining operation, BeagleTM
scans all abstracts of the corpus to find their occurrence across articles as shown
in Figure 5. A relationship implies that these keywords are relevant to a study
where they play a role. Several keywords found together in an article may likely
signify a central theme that binds them together. In bioinformatics, for example,
learning that a particular protein and a stress have been found in the same article
is very likely to suggest that the protein has been studied in some context of the
stress. If a type of PTM is also mentioned in the text, then there is reason to
suggest that it may be a part of the stress response for the protein, for example.
If several distinct articles are uncovered where these same keywords are found
8 Bonham-Carter
Fig. 5. Flowchart; The abstracts of each article are individually parsed for pre-selected
keywords. The results of this parsing are organized in specific database tables.
together, or are linked by edges in relationship networks, then, again this may
point to a deeper relation or, perhaps, a common mechanism.
In general, by studying keywords and how they are found in the literature, we
may have strong evidence to suggest that they share some form of a relationship.
Although further exploration is necessary to determine the exact details of the
discovered relationship, this is not a limitation because all relationships, no mat-
ter their strengths, may be important parts of a rigorous review. Furthermore,
many discoveries have been suggested by simple guilt-by-association scenarios.
BeagleTM is built from open source software (Python https://www.python.
org/ and SQLite3 https://www.sqlite.org/). Our method provides convenient
customization and its output has been especially formatted to create input files
for populating a SQLite3 database. In time, we plan to release the tool’s source
code to the community.
The corpus data was provided from PubMed [15], maintained by the National
Library of Medicine (ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/), and is their most
recent compilation at the time of this writing. Uncompressed, there are about
134 GB of articles to process (over 1.7 million articles) that are saved in an
nxml format. One of the hardships of text mining is that each keyword must
BeagleTM: Relationship Discovery 9
be queried in the text of each article of the corpus. Approaches similar to TM
for R-statistic [8], where the entire corpus is to be loaded into memory for an
operation, is undesirable for working with PubMed due to its size and growth.
Our method avoids the problem of expanding data sets and memory limi-
tations by loading each individual file to parse for keywords, then adding any
results to a central database. This operation is completed each time the list of
keywords changes. We note that working with such an enormous amount of files
may introduce bottle-necks, however, our analysis was completed on solid-state
hard drives which enabled elevated performance without compromising the exe-
cution time. In addition, any lost time during text processing is recovered when
the database programming (discussed in Section 2.1) is applied.
Since each article from NCBI has a unique PMID number (identification
reference for PubMed citations) that acts as a primary key for the database pro-
gramming element of the method (discussed in Section 2.1, all encountered key-
words are recorded with the PMID number. The associations between keywords
are made by connecting their sources by these PMID numbers. Our method
connects these keywords to each other by finding the intersections according to
PMIDs and uses databases to manage this task.
2.1 Database Support
Each article of the PubMed corpus is in an nxml format, as shown in Figure 6.
From this meta-data, our method is able to determine particular types of data
to be included in the database to discover connections by automatic queries.
For this task, we used SQLite3, which was chosen for its simplicity, power, open
source nature, and for the fact that an entire database may be stored as a single
file with which our tool works. SQLite3 also provides a convenient way to setup
and import data from BeagleTM processing by simple scripts to build the tool’s
database on seemingly any hardware. Our tool extracts information about each
occurrence of a keyword as it is encountered in an article (i.e., PMID number,
article references, the occurrence number, and its associated blurb of text) to be
inserted into the database as shown in Figure 7.
The database is used to perform stringent queries to locate keywords and
associated data according to matching PMID numbers. We have customized
our tool and database to perform analyses of articles for PTMs, stresses and
protein keywords, as discussed above, to help explain its function. Our method
has six main tables: Functional (containing functional origins of types of pro-
teins), MitoSymbols,PTMGeneral (general PTM names such as acetylation or
phosphorylation) and PTMSpecific (containing specific types of PTMs which
are actually subsets of more-general types of PTMs such as, phosphotyrosine
which is a type of phosphorylation, for example). The Stress table concerned
the factors with which the proteins had been exposed, according to the articles.
In Table 1, we provide the details of two of these tables, and note that all tables
have a similar construction.
10 Bonham-Carter
Fig. 6. Article meta-data; Each article is downloaded from NCBI as an nxml for-
matted file. We note that BeagleTM parses each file for specific types of information
to be stored in its internal SQLite database. This information is shown by arrows (i.e.,
PMID, the title of journal, the title of the article, and the abstract block.) The text of
this abstract [17] is parsed for relevant information to the keywords.
Fig. 7. Extracted data; for each article where relevant terms are found, referential
details (such as the PMID, article sources, and a the blurb of text in which the term
is found) is inserted into the database for further analysis with advanced queries.
2.2 Networks
Across all networks, edges between nodes signify that at least one peer-reviewed
study exists in which the keywords have been mentioned together in the same
study (i.e., the keywords share a common PMID number). Cliques in the net-
works represent that keywords originated from the same abstract. To find the
associations of keywords bound by cliques, BeagleTM queries all keywords hav-
ing the same PMID number. In our tool, this output is then relayed to the
BeagleTM: Relationship Discovery 11
Table 1. Database schemas; Here we provide the SQLite3 code to create two of the
tables in our database. The creation code for the other tables is similar. The integrity
constraint NOT NULL was necessary to ensure that the relation for each article was
complete. PubMed’s PMID, an article reference number, was assigned by the NIH
National Library of Medicine and functions as a primary key.
CREATE TABLE Functional (
pmid varchar PRIMARY KEY,
funct varchar NOT NULL,
count integer NOT NULL,
blurb text NOT NULL,
journal text NOT NULL );
CREATE TABLE Stress (
pmid varchar PRIMARY KEY,
stress varchar NOT NULL,
count integer NOT NULL,
blurb text NOT NULL,
journal text NOT NULL);
NetworkX plotting tool [18] to create the relationship networks. In Section 3, we
will discuss the specific results of the networks that suggest interrelationships
according to our method. We note that sometimes when all these cliques are
shown together it may create some confusion when differentiating a particular
clique from another. To ascertain the members for a specific clique, including the
article PMIDs, one may consult the non-graphical data (i.e., the output provided
by BeagleTM, not shown) from which networks are made.
3 Results and Discussion
We obtained a listing of keywords taken from [19] and [20] to be used to demon-
strate the functionality of our method. A sample of these keywords is shown
in Table 2. We note that potentially any keywords may be used by researchers
for analysis with BeagleTM. We begin by discussing the relationship networks
that were created from our keywords. Some of the following networks have been
reduced for simplicity of discussion.
In Figure 8 each protein clique contains red circles represent the article
PMIDs in which the other terms of the clique are found. Blue squares repre-
sent (PTMs), yellow pentagons representing ailment type, and green triangles
representing stress-factors. Since the types of keywords will change from model
to model, the node assignments may also change. In Figures 9 and 10 (functional
cliques), the red circles, blue squares, yellow pentagons and green triangles, rep-
resent PMIDs, gene symbols, stress-types, and related ailments, respectively.
In Figures 11 (acetylation) and 12 (glycosylation), we note the cliques describ-
ing networks where a PMIDs, stresses-factors, and disorders have been linked
by peer-reviewed literature to a single PTM. Since single PTMs have been ob-
served to operate in concert with others for the orchestration of diverse chaperone
functions [21], we note that networks concentrating on a single PTM may not be
completely informative. In this case, it is suggested that networks describe terms
related to stresses-factors, such as those of Figure 13 be studied, for potentially
uncovering cross-talking PTMs in network cliques.
12 Bonham-Carter
Table 2. Sample keywords: Below are the four main rubrics for our curated key-
words, from which we built relationship networks. The total number of keywords is
current as of April 2018. We note that the number of terms increases in tandem with
the expansion of the PubMed corpus.
Rubric Sample Total keywords
Diseases-specific acidosis, ageing, Alzheimer’s, apoptosis, arthritis,
Crohn’s, diabetes, obesity, Parkinson’s and others
46
Mt Gene Symbols oat, pc, opa1, cs, mut, msra, phb, sod1, mtor, aldh2 and
others
619
PTMs (general types) acetylation, glycosylation, methylation, oxidation, phos-
phorylation and others
35
Stresses hypoxia, oxidation, oxidative stress, ROS (reactive oxygen
species), tolerance, toxin, unfolded protein response and
others
47
From the examination of relationship networks, we examine cliques of protein
which was linked to disorders and stresses according to the literature. Each key-
word of a clique may be found in the same abstract, denoted by the PMID of the
circle-nodes. In Figure 8, by the literature, we note that the protein SOD1 has
been linked to several diverse disorders such as, Alzheimer’s Parkinson’s, diabetes
and other neurodegenerative ailments as discussed in [22] (PMID: 22384126).
There are also links to types of stresses – ROS (reactive oxygen species),
oxidative stress, general stress, toxins and others, which have been introduced
by the articles. From this observation, we note that SOD1 may likely be involved
with these disorders and stresses since there is at least one peer-reviewed study
found in which these edge-connected nodes have been mentioned in the same
article.
From the network, we note the article by Milani et al. [23] (PMID: 23983902),
in which the authors discuss the involvement of induced oxidative damage by
ROS in Parkinson’s disease and amyotrophic lateral sclerosis. This suggests the
role played by these actors. Furthermore, the authors study SOD1 for its con-
nection to NRF2, a transcriptional factor and master regulator of the expression
of many antioxidant /detoxification genes. With the exception of the NRF2
protein and the discussion of amyotrophic lateral sclerosis, this relationship to
SOD1 may readily be observed from the network itself. We reserve judgment on
the NRF2 (an important neuroprotective protein in neurodegenerative diseases)
which may be deeply connected to the stresses and ailments in the network of
Figure 8.
Also in Figure 8, we explore the article (PMID: 25998424), relating to the
article by Collins et al. [24] and we note the keywords of the network – {acidosis,
oxidation, acetylation and SOD1 }share a commonality. According to the article,
our network infers the actors of the article, acidosis, SOD1 and ROS, share a
relationship. From the simplicity of the network, these relationships may be used
to determine that SOD1 can be related to the stress of ROS. In addition, since
the authors mentions ROS specifically, we may infer that other oxidative stresses
BeagleTM: Relationship Discovery 13
are likely to play roles by the discussion of redox in the article. It is interesting to
note here that if, upon consultation of the article, there is no discussion about
oxidation, one may form a hypothetical theory that such a relationship may
eventually be discovered.
Fig. 8. Protein clique: Relationship model of SOD1 that has been found according
to the literature to share a relationship to Alzheimer’s, Parkinson’s disease, as well as
others. There are three types of nodes featured in this model: the square represents the
single protein to which each other node is related. The circles and pentagons denote
the PMIDs and stresses, the triangles denote the disorders that have documented
relationships to the other nodes. All edges denote that terms are connected by at least
one common article.
In Figure 9, we note that ageing shares a relationship with stresses ROS,
oxidative stress, pollutants and others. A relationship is also shared with PTMs
such as thyroxine (the main hormone secreted into the bloodstream by the thy-
roid gland), methionine sulfoxide, lactic acid, and triiodothyronine – a thyroid
hormone that plays vital roles in the body’s metabolic rate.
More specifically, when exploring the article (PMID: 27199942) by Bastard
et al. [25], we note that the actual clique for this article is composted of {ageing,
stress and tolerance}. The article concerns the ability of the Gram-positive bac-
terial species, Oenococcus oeni, is used in the production of wine to reduce acidity
and to tolerate stresses caused by the formation of biofilms or planktonic cells.
14 Bonham-Carter
Fig. 9. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate PTMs, the green triangles denote the stress-factors
and the mustard pentagons correspond to the ailment by name, to which all elements
are related by the literature. We note that all these terms are related by peer-reviewed
studies however, we must return to the PMID of each clique to determine the nature
of the relationship. We show the summary plot of text mining tasks for the keyword
aging.
The article provides examples of relationships where stress and tolerance play
major roles in the study.
In Figure 10 which has been reduced from its full size to facilitate discussion,
we note that Alzheimers is related to stress-types such as, {heat shock, oxida-
tion, oxidative stress, reactive oxygen species, (stress) tolerance, and others}.
We have customized our output to show only peer-reviewed journals in which
relationship between Alzheimer’s and the stress-factors are described. For exam-
ple, in Millian’s work [26] (PMID: 25364287) core pathophysiological processes
underlying Alzheimer’s have been studied where methylation, oxidative stress
and other factors were intimately involved.
In Figure 11, the relationships between PTM, acetylation, stress-factors and
ailments are described. Amongst the stresses, there are: {heat, hyperthermia,
hypoxia, microgravity, oxidate stress, ROS and others}and in the ailments, we
note: {ageing, Alzheimer’s, asthma, bone loss, crohns disease, diabetes, epilepsy,
and others}. In Ansari et al. [27] (PMID: 27686535) from the network, SIRT3, a
member of the sirtuin group, is studied for its role for its regulation of energy de-
mand during stress conditions such as fasting and exercise. We note that SIRT3
BeagleTM: Relationship Discovery 15
Fig. 10. Functional clique: The red circles represent the PMID numbers for PubMed
articles, the blue squares indicate stress-factors, the green triangles denote the journal
names, and the mustard pentagon correspond to the ailment by name. This network
allows us to study which types of journals are featuring unique types of research. We
show the summary plot of text mining tasks for the keyword, Alzheimer’s Disease. This
relationship network is actually a subset of the entire network which was too populated
to be legible.
regulates metabolism through the deacetylation and acetylation of mitochon-
drial enzymes and is understood to be able to combat the effects of ROS, and to
prevent cancer by initiating apoptosis. In their article, Ansari et al. review the
molecular functions of SIRT3 and its ability to regulate.
Before having to consult the article itself, its keywords (in absence of the
SIRT3 sirtuin group member) were exhibited graphically in the network Fig-
ure 11. Investigators undergoing literature reviews for articles containing these
keywords may consult these networks to begin some of their work. Due to mul-
tiple PTMs which are likely working together for a process for disorders such as
Parkinson’s [28] and discussed in [29], [30], [20], investigators may also wish to
consult PTM relationship networks, such as that of Figure 12 (glycosylation), to
gain a fuller understanding of other PTMs that may work in tandem.
Some literature reviews may begin by a study of stress-factors and PTMs to
determine effects and/or potential onsets of disorders. In such a case, the research
of stresses, in conjunction with a particular PTM would lead the research team
to articles where potential disorders are explored, where stresses and PTMs
16 Bonham-Carter
Fig. 11. PTM clique: The red circles represent the PMID numbers for PubMed
articles, the blue square indicates a PTM, the green triangles denote stresses, and
the mustard pentagon correspond to the ailment by name. This relationship network
is from the study of the keyword acetylation in light of stress-factors and associated
disorders.
are integral components. To determine some of the disorders, which may result
during exposure of a particular stress and PTM, we created the relationship
networks of Figure 13 (reduced to facilitate discussion) to facilitate this task. In
the relationship networks of the figure, we note that oxidative stress has been
linked to: {apoptosis, diabetes, heart disease, obesity and others}, in concert with
PTMs such as: methionine sulfoxide (oxidation), nitrated tyrosine (nitration),
thyroxine (iodination), and others. More information about the nature of each
PTM of this network is available from UniProt at http://www.uniprot.org/docs/
ptmlist.
4 Conclusion
Due to the problems associated with attaching keywords to articles, we noted
that text mining may be the appropriate remedy to help researchers find concepts
in the literature which has seemingly no uniform method for writing keywords.
In addition, these keywords for articles may exist, yet they do not suggest the
full depth of knowledge that their articles contain. In this study, we discussed ex-
BeagleTM: Relationship Discovery 17
Fig. 12. PTM clique: The red circles represent the PMID numbers for PubMed
articles, the blue square indicates a PTM, the green triangles denote stresses, and the
mustard pentagon correspond to the ailment by name. This relationship network is
from the study of the keyword glycosylation in light of stress-factors and associated
disorders.
amples where our method and tool, BeagleTM, was used to extract relationships
between PTMs, stress-factors, and proteins which may be involved with dis-
orders. We described how to read relationship networks suggesting connections
between keywords and allow researchers to obtain knowledge from the literature.
During the discussion of the technicalities of the method itself, we discussed
how our method is able to process a corpus of arbitrary size since it parses one
article’s abstract at a time. Since abstracts are excellent representations of the
entire work, we used the articles’ abstracts as the inputs to our tool. However, our
method and tool will work similarly on any size of text, including a full article.
We discussed that the method of determining commonalities across keywords
revolves around the idea that each article in our corpus (supplied by PubMed)
is automatically given a PMID number. When a keyword is located, then the
keyword, its reference details, along with its PMID number are inserted into the
local SQL database. Robust SQL queries can then be utilized to determine data
to describe the relations that we require to create relationship networks. We
discussed how to read and understand relationship networks where notes and
edges represent keywords and the existence of articles to support a relationship,
18 Bonham-Carter
Fig. 13. Stress clique: The red circles represent the PMID numbers for PubMed
articles, the blue square indicates a stress, the green triangles denote ailments, and
the mustard pentagon correspond to the PTMs. This relationship network is from the
study of the keyword oxidative stress, in light of, PTMs and associated disorders.
respectively. Finally, we discussed how the use of relationship networks, a visual
representation of the actors in abstracts, may save the investigator time when
sifting through diverse abstracts while searching for specific types of studies.
4.1 Future Work
In the future, we intend to extend our BeagleTM tool to add statistical power
such as Bayesian inference and other methods to enable the prediction of new
keywords which are likely to be related to a particular type of disorder, protein or
stress-factor. This functionality would enable investigators to obtain meaningful
networks, in absence of a complete knowledge of necessary keywords for a subject
area. The addition of this analysis would also allow our tool to discern between
strong and weak types of relationships between keywords.
We intend to add a network interactivity layer to the tool so that researchers
are able to move and re-position the nodes of the relationship networks to aid in
productivity. Finally, after development, we plan to make our tool open source
and to release it to the bioinformatics community by Github or another cloud-
based development platform to allow for community-inspired development.
BeagleTM: Relationship Discovery 19
4.2 Acknowledgment
I would like to thank Janyl Jumadinova for her help in proofing this manuscript.
References
1. A. Splendiani, M. Donato, and S. Dr˘aghici, “Ontologies for bioinformatics,” in
Springer Handbook of Bio-/Neuroinformatics. Springer, 2014, pp. 441–461.
2. K. Schouten, F. Frasincar, R. Dekker, and M. Riezebos, “Heracles: A framework
for developing and evaluating text mining algorithms,” Expert Systems with Ap-
plications, vol. 127, pp. 68–84, 2019.
3. M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and
K. Kochut, “A brief survey of text mining: Classification, clustering and extraction
techniques,” arXiv preprint arXiv:1707.02919, 2017.
4. S. Sharma and S. K. Srivastava, “Review on text mining algorithms,” International
Journal of Computer Applications, vol. 134, no. 8, 2016.
5. A. Lamurias and F. M. Couto, “Text mining for bioinformatics using biomedical
literature,” Encyclopedia of bioinformatics and computational biology, vol. 1, 2019.
6. R. Paynter, L. L. Ba˜nez, E. Berliner, E. Erinoff, J. Lege-Matsuura, S. Potter,
and S. Uhl, “Epc methods: an exploration of the use of text-mining software in
systematic reviews,” 2016.
7. D. Maynard, I. Roberts, M. A. Greenwood, D. Rout, and K. Bontcheva, “A frame-
work for real-time semantic social media analysis,” Web Semantics: Science, Ser-
vices and Agents on the World Wide Web, 2017.
8. I. Feinerer, “Introduction to the tm package text mining in r,” 2017.
9. G. J. Williams et al., “Rattle: a data mining gui for r,” The R Journal, vol. 1,
no. 2, pp. 45–55, 2009.
10. H.-M. M¨uller, K. M. Van Auken, Y. Li, and P. Sternberg, “Textpresso central: a
customizable platform for searching, text mining, viewing, and curating biomedical
literature,” BMC bioinformatics, vol. 19, no. 1, p. 94, 2018.
11. C.-H. Wei, H.-Y. Kao, and Z. Lu, “Pubtator: a web-based text mining tool for
assisting biocuration,” Nucleic acids research, p. gkt441, 2013.
12. S. Kim, D. Kwon, S.-Y. Shin, and W. J. Wilbur, “Pie the search: searching pubmed
literature for protein interaction information,” Bioinformatics, vol. 28, no. 4, pp.
597–598, 2011.
13. S. Kim, L. Yeganova, and W. J. Wilbur, “Meshable: searching pubmed abstracts
by utilizing mesh and mesh-derived topical terms,” Bioinformatics, vol. 32, no. 19,
pp. 3044–3046, 2016.
14. P. Papadopoulou, M. Lytras, and C. Marouli, “Bioinformatics as applied to
medicine: Challenges faced moving from big data to smart data to wise data,”
in Applying Big Data Analytics in Bioinformatics and Medicine. IGI Global,
2018, pp. 1–25.
15. R. C. NCBI, “Database resources of the national center for biotechnology infor-
mation.” Nucleic acids research, vol. 45, no. D1, p. D12, 2017.
16. S. Marcelli, M. Corbo, F. Iannuzzi, L. Negri, F. Blandini, R. Nistic`o, and M. Feli-
gioni, “The involvement of post-translational modifications in alzheimer’s disease.”
Current Alzheimer research, 2017.
17. J. Hunnicut, Y. Liu, A. Richardson, and A. B. Salmon, “Msra overexpression
targeted to the mitochondria, but not cytosol, preserves insulin sensitivity in diet-
induced obese mice,” PloS one, vol. 10, no. 10, p. e0139844, 2015.
20 Bonham-Carter
18. D. A. Schult and P. Swart, “Exploring network structure, dynamics, and function
using networkx,” in Proceedings of the 7th Python in Science Conferences (SciPy
2008), vol. 2008, 2008, pp. 11–16.
19. O. Bonham-Carter, J. Pedersen, and D. Bastola, “A content and structural assess-
ment of oxidative motifs across a diverse set of life forms,” Computers in biology
and medicine, vol. 53, pp. 179–189, 2014.
20. O. Bonham-Carter, J. Pedersen, L. Najjar, and D. Bastola, “Modeling the effects
of microgravity on oxidation in mitochondria: A protein damage assessment across
a diverse set of life forms,” in IEEE Data Mining Workshop (ICDMW). IEEE,
2013, pp. 250–257.
21. C. Thygesen, I. Boll, B. Finsen, M. Modzel, and M. R. Larsen, “Characterizing
disease-associated changes in post-translational modifications by mass spectrome-
try,” Expert review of proteomics, vol. 15, no. 3, pp. 245–258, 2018.
22. Y. Li, S. Chigurupati, H. W. Holloway, M. Mughal, D. Tweedie, D. A. Bruestle,
M. P. Mattson, Y. Wang, B. K. Harvey, B. Ray et al., “Exendin-4 ameliorates
motor neuron degeneration in cellular and animal models of amyotrophic lateral
sclerosis,” PLoS One, vol. 7, no. 2, p. e32008, 2012.
23. P. Milani, G. Ambrosi, O. Gammoh, F. Blandini, and C. Cereda, “Sod1 and dj-1
converge at nrf2 pathway: a clue for antioxidant therapeutic potential in neurode-
generation,” Oxidative medicine and cellular longevity, vol. 2013, 2013.
24. J. A. Collins, R. J. Moots, P. D. Clegg, and P. I. Milner, “Resveratrol and n-
acetylcysteine influence redox balance in equine articular chondrocytes under acidic
and very low oxygen conditions,” Free Radical Biology and Medicine, vol. 86, pp.
57–64, 2015.
25. A. Bastard, C. Coelho, R. Briandet, A. Canette, R. Gougeon, H. Alexandre,
J. Guzzo, and S. Weidmann, “Effect of biofilm formation by oenococcus oeni on
malolactic fermentation and the release of aromatic compounds in wine,” Frontiers
in microbiology, vol. 7, p. 613, 2016.
26. M. J. Millan, “The epigenetic dimension of alzheimer’s disease: causal, conse-
quence, or curiosity?” Dialogues in clinical neuroscience, vol. 16, no. 3, p. 373,
2014.
27. A. Ansari, M. Rahman, S. K. Saha, F. K. Saikot, A. Deep, K.-H. Kim et al.,
“Function of the sirt3 mitochondrial deacetylase in cellular physiology, cancer, and
neurodegenerative disease,” Aging cell, vol. 16, no. 1, pp. 4–16, 2017.
28. I. Ferrer, “Early involvement of the cerebral cortex in parkinson’s disease: conver-
gence of multiple metabolic defects,” Progress in neurobiology, vol. 88, no. 2, pp.
89–103, 2009.
29. G. Stetz, A. Tse, and G. M. Verkhivker, “Dissecting structure-encoded determi-
nants of allosteric cross-talk between post-translational modification sites in the
hsp90 chaperones,” Scientific reports, vol. 8, no. 1, p. 6899, 2018.
30. O. Bonham-Carter, I. Thapa, and D. Bastola, “Evidence of post translational
modification bias extracted from the trna and corresponding amino acid interplay
across a set of diverse organisms,” in Proceedings of the 5th ACM Conference on
Bioinformatics, Computational Biology, and Health Informatics. ACM, 2014, pp.
774–781.