ArticlePDF Available

Prediction of new associations between ncRNAs and diseases exploiting multi-type hierarchical clustering

Authors:

Abstract and Figures

Background. The study of functional associations between ncRNAs and human diseases is a pivotal task of modern research to develop new and more effective therapeutic approaches. Nevertheless, it is not a trivial task since it involves entities of different types, such as microRNAs, lncRNAs or target genes whose expression also depends on endogenous or exogenous factors. Such a complexity can be faced by representing the involved biological entities and their relationships as a network and by exploiting network-based computational approaches able to identify new associations. However, existing methods are limited to homogeneous networks (i.e., consisting of only one type of objects and relationships) or can exploit only a small subset of the features of biological entities, such as the presence of a particular binding domain, enzymatic properties or their involvement in specific diseases. Results. To overcome the limitations of existing approaches, we propose the system LP-HCLUS, which exploits a multi-type hierarchical clustering method to predict possibly unknown ncRNA-disease relationships. In particular, LP-HCLUS analyzes heterogeneous networks consisting of several types of objects and relationships, each possibly described by a set of features, and extracts multi-type clusters that are subsequently exploited to predict new ncRNA-disease associations. The extracted clusters are overlapping, hierarchically organized, involve entities of different types, and allow LP-HCLUS to catch multiple roles of ncRNAs in diseases at different levels of granularity. Our experimental evaluation, performed on heterogeneous attributed networks consisting of microRNAs, lncRNAs, diseases, genes and their known relationships, shows that LP-HCLUS is able to obtain better results with respect to existing approaches. The biological relevance of the obtained results was evaluated according to both quantitative (i.e., TPR@k, Areas Under the TPR@k, ROC and Precision-Recall curves) and qualitative (i.e., according to the consultation of the existing literature) criteria. Conclusions. The obtained results prove the utility of LP-HCLUS to conduct robust predictive studies on the biological role of ncRNAs in human diseases. The produced predictions can therefore be reliably considered as new, previously unknown, relationships among ncRNAs and diseases.
This content is subject to copyright. Terms and conditions apply.
Barracchia et al. BMC Bioinformatics (2020) 21:70
https://doi.org/10.1186/s12859-020-3392-2
METHODOLOGY ARTICLE Open Access
Prediction of new associations between
ncRNAs and diseases exploiting multi-type
hierarchical clustering
Emanuele Pio Barracchia1, Gianvito Pio1* ,DomenicaDElia
3and Michelangelo Ceci1,2,4
Abstract
Background: The study of functional associations between ncRNAs and human diseases is a pivotal task of modern
research to develop new and more effective therapeutic approaches. Nevertheless, it is not a trivial task since it
involves entities of different types, such as microRNAs, lncRNAs or target genes whose expression also depends on
endogenous or exogenous factors. Such a complexity can be faced by representing the involved biological entities
and their relationships as a network and by exploiting network-based computational approaches able to identify new
associations. However, existing methods are limited to homogeneous networks (i.e., consisting of only one type of
objects and relationships) or can exploit only a small subset of the features of biological entities, such as the presence
of a particular binding domain, enzymatic properties or their involvement in specific diseases.
Results: To overcome the limitations of existing approaches, we propose the system LP-HCLUS, which exploits a
multi-type hierarchical clustering method to predict possibly unknown ncRNA-disease relationships. In particular,
LP-HCLUS analyzes heterogeneous networks consisting of several types of objects and relationships, each possibly
described by a set of features, and extracts multi-type clusters that are subsequently exploited to predict new
ncRNA-disease associations. The extracted clusters are overlapping, hierarchically organized, involve entities of
different types, and allow LP-HCLUS to catch multiple roles of ncRNAs in diseases at different levels of granularity. Our
experimental evaluation, performed on heterogeneous attributed networks consisting of microRNAs, lncRNAs,
diseases, genes and their known relationships, shows that LP-HCLUS is able to obtain better results with respect to
existing approaches. The biological relevance of the obtained results was evaluated according to both quantitative
(i.e., TPR@k, Areas Under the TPR@k, ROC and Precision-Recall curves) and qualitative (i.e., according to the
consultation of the existing literature) criteria.
Conclusions: The obtained results prove the utility of LP-HCLUS to conduct robust predictive studies on the
biological role of ncRNAs in human diseases. The produced predictions can therefore be reliably considered as new,
previously unknown, relationships among ncRNAs and diseases.
Keywords: Non-coding RNA (ncRNAs), Diseases, Cancer, Heterogeneous network, Clustering, Link prediction
Background
High-throughput sequencing technologies, together with
recent, more efficient computational approaches have
been fundamental for the rapid advances in functional
genomics. Among the most relevant results, there is the
discovery of thousands of non-coding RNAs (ncRNAs)
*Correspondence: gianvito.pio@uniba.it
1University of Bari Aldo Moro - Department of Computer Science, Via Orabona,
4, 70125 Bari, Italy
Full list of author information is available at the end of the article
with a regulatory function on gene expression [1]. In par-
allel, the number of studies reporting the involvement of
ncRNAs in the development of many different human
diseases has grown exponentially [2]. The first type of
ncRNAs that has been discovered and largely studied is
that of microRNAs (miRNAs), classified as small non-
coding RNAs in contrast with the other main category
represented by long non-coding RNAs (lncRNAs), that are
ncRNAs longer than 200nt [3,4].
© The Author(s). 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 2 of 24
Long non-coding RNAs (lncRNAs) and microRNAs
(miRNAs) [5] are among the largest and heterogeneous
groups of regulators of major cellular processes. How-
ever, lncRNAs, differently from miRNAs which primarily
act as post-transcriptional regulators, have a plethora of
regulatory functions [6]. They are involved in chromatin
remodeling and epigenetic modifications, and organize
functionally different nuclear sub-compartments with an
impact on the nuclear architecture [7]. LncRNAs are
also involved in the regulation of the expression of tran-
scripts at cytoplasmic level by another series of interac-
tions/functions that interfere with the efficiency of trans-
lation of transcripts in their protein products. In partic-
ular, they can directly interfere with miRNAs functions
acting as miRNA sponges [8]. Nevertheless, the number
of lncRNAs for which the functional and molecular mech-
anisms are completely elucidated is still quite poor. This is
due to two main reasons: their recent discovery as master
regulators with respect to miRNAs, and some particular
features, such as the low cross-species conservation, the
low expression levels and the high tissue specificity that
make their characterization or any type of generalization
still very difficult [9]. Therefore, assessing the role and
the molecular mechanisms underlying the involvement of
lncRNAs in human diseases is not a trivial task, and exper-
imental investigations are still too much expensive for
being carried out without any computational pre-analysis.
In the last few years, there have been several attempts to
computationally predict the relationships among biolog-
ical entities, such as genes, miRNAs, lncRNAs, diseases,
etc. [1019]. Such methods are mainly based on a net-
work representation of the entities under study and on the
identification of new links among nodes in the network.
However, most of the existing approaches are able to work
only on homogeneous networks (where nodes and links
are of one single type) [20], are strongly limited by the
number of different node types or are constrained by a
pre-defined network structure. To overcome these limita-
tions we propose the method LP-HCLUS (Link Prediction
through Hierarchical CLUStering), which can discover
previously unknown ncRNA-disease relationships work-
ing on heterogeneous attributed networks (that is, net-
works composed of different biological entities related by
different types of relationships) with arbitrary structure.
This ability allows LP-HCLUS to investigate how different
types of entities interact with each other, possibly lead-
ing to increased prediction accuracy. LP-HCLUS exploits
a combined approach based on hierarchical, multi-type
clustering and link prediction. As we will describe in
detail in the next section, a multi-type cluster is actually
a heterogeneous sub-network. Therefore, the adoption of
a clustering-based approach allows LP-HCLUS to base
its predictions on relevant, highly-cohesive heterogeneous
sub-networks. Moreover, the hierarchical organization of
clusters allows it to perform predictions at different levels
of granularity, taking into account either local/specific or
global/general relationships.
Methodologically, LP-HCLUS estimates an initial score
for each possible relationship involving entities belonging
to the types of interest (in our case, ncRNAs and dis-
eases), by exploiting the whole network. Such scores are
then used to identify a hierarchy of overlapping multi-
type clusters, i.e., groups of objects of different types.
Finally, the identified clusters are exploited to predict new
relationships, each of which is associated with a score
representing its degree of certainty. Therefore, accord-
ing to the classification provided in [21] (see Additional
file 1), LP-HCLUS simultaneously falls in two categories:
i) algorithmic methods, since it strongly relies on a clus-
tering approach to predict new relationships and to asso-
ciate them with a score in [ 0, 1], and ii) similarity-based
approaches, since the first phase (see “Estimation of the
strength of the relationship between ncRNAs and di-
seases” section) exploits the computation of similarities
between target nodes, taking into account the paths in the
network and the attributes of the nodes.
Therestofthepaperisorganizedasfollows:inthenext
section, we describe our method for the identification
of new ncRNA-disease relationships; in “Results”section
we describe our experimental evaluation and in “Discus-
sion” section we discuss the obtained results, including a
qualitative analysis of the obtained predictions; finally, we
conclude the paper and outline some future work. More-
over, in Additional file 1, we discuss the works related
to the present paper; in Additional file 2we report an
analysis of the computational complexity of the proposed
method; finally, in Additional files 3,4and 5we report
some detailed results obtained during the experiments.
Methods
The algorithmic approach followed by LP-HCLUS mainly
relies on the predictive clustering framework [2224].
The motivation behind the adoption of such a framework
comes from its recognized ability of handling data affected
by different forms of autocorrelation, i.e., when close
objects (spatially, temporally, or in a network as in this
work) appear to be more similar than distant objects. This
peculiarity allows LP-HCLUS to catch multiple depen-
dencies among the involved entities, which can represent
relevant cooperative/interfering activities.
Specifically, LP-HCLUS identifies hierarchically orga-
nized, possibly overlapping multi-type clusters from a
heterogeneous network and exploits them for predic-
tive purposes, i.e., to predict the existence of previously
unknown links. The extraction of a hierarchical structure,
rather than a flat structure, allows the biologists to focus
on either more general or more specific interaction activ-
ities. Finally, the possible overlaps among the identified
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 3 of 24
clusters allow LP-HCLUS to consider multiple roles of
the same disease or ncRNA, which may be involved in
multiple interaction networks.
It is noteworthy that, even if the analyzed network may
consist of an arbitrary number of types of nodes and
edges, the prediction of new associations will focus on
edges involving ncRNAs and diseases, called target types.
On the contrary, node types that are only used during the
analysis will be called task-relevant node types.
Intuitively, the approach followed by LP-HCLUS con-
sists of three main steps:
1. estimation of the strength of relationships for all the
possible pairs of ncRNAs and diseases, according to
the paths connecting such nodes in the network and
to the features of nodes involved in such paths;
2. construction of a hierarchy of overlapping multi-type
clusters, on the basis of the strength of relationships
computed in the previous step;
3. identification of predictive functions to predict new
ncRNA-disease relationships on the basis of the
clusters identified at different levels of the hierarchy.
It is noteworthy that the clustering step could be directly
applied on the set of known interactions, without per-
forming the first step. However, such an approach would
lead to discard several potential indirect relationships that
can be caught only through a deep analysis of the network,
which is indeed the main purpose of the first step. A naïve
solution for the prediction task would be the use of the
output of the first step as the final score, ignoring steps
2 and 3. However, this would lead to disregard a more
abstract perspective of the interactions which, instead,
can be caught by the clustering-based approach. Another
effect would be to disregard the network homophily phe-
nomenon and not to catch possible relationships between
ncRNAs and between diseases based on the nodes they
are connected with. On the contrary, the exploitation of
such relationships is in line with the guilt-by-association
(GBA) principle, which states that entities with similar
functions tend to share interactions with other entities.
This principle has been recently applied to and investi-
gated for ncRNAs [25].
Each step will be described in details in the next sub-
sections, while in the following we formally define the
heterogeneous attributed network, that is analyzed by
LP-HCLUS, as well as the solved task.
Definition 1 (Heterogeneous attributed network) A
heterogeneous attributed network is a network G =(V,E),
where V denotes the set of nodes and E denotes the set of
edges, and both nodes and edges can be of different types
(see Fig. 1). Moreover:
T=TtTtr
is the set of node types, where
Tt
is the set
of target types and
Ttr
is the set of task-relevant types;
each node type
TvT
defines a subset of nodes in
the network, that is
VvV
;
each node type
TvT
is associated with a set of
attributes
Av={Av,1,Av,2,...,Av,mv}
, i.e., all the
nodes of a given type
Tv
are described according to
the attributes
Av
;
R
is the set of all the possible edge types;
each edge type
RlR
defines a subset of edges
ElE
.
Definition 2 (Overlapping Multi-type cluster) Given a
heterogeneous attributed network G =(V,E),anover-
lapping multi-type cluster is defined as G=(V,E),
where:
VV
;
vV,v
is a node of a target type;
vV
,
v
may also belong to other clusters besides
G
;
E(Eˆ
E)
is a set of relationships among the nodes
in
V
, belonging either to the set of known
relationships E or to a set of extracted relationships
ˆ
E
,
which are identified by the clustering method.
Fig. 1 An example of a heterogeneous attributed network. On the left, a general overview of the network, where shapes represent different node
types and colors represent different edge types. On the right, a zoom on a small portion of the network, where we can observe node attributes
associated with squares (As,), triangles (At,) and circles (Ac,)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 4 of 24
The details about the strategy adopted to identify ˆ
E
will be discussed in “Estimation of the strength of the
relationship between ncRNAs and diseases”section.
Definition 3 (Hierarchical multi-type clustering) A
hierarchy of multi-type clusters is defined as a list of hier-
archy levels [L1,L2,...,Lk],whereeachL
iconsists of a set
of overlapping multi-type clusters. For each level Li,i=
2, 3, .. ...k, we have that GLiG Li1,suchthat
G is a subnetwork of G(see Fig. 2).
On the basis of these definitions, we formally define the
task considered in this work.
Definition 4 (Predictive hierarchical clustering for
link prediction) Given a heterogeneous attributed network
G=(V,E)and the set of target types Tt,thegoalistofind:
A hierarchy of overlapping multi-type clusters
[L1,L2,...,Lk]
.
A function
ψ(w):Vi1×Vi2[0,1]
for each
hierarchical level
Lw
(
w1, 2, ..., k
), where nodes in
Vi1
are of type
Ti1Tt
and nodes in
Vi2
are of type
Ti2Tt
. Intuitively, each function
ψ(w)
maps each
possible pair of nodes (of types
Ti1
and
Ti2
,
respectively) to a score that represents the degree of
certainty of their relationship.
The learning setting considered in this paper is trans-
ductive. In particular, only the links involving nodes
already known and exploited during the training phase are
considered for link prediction. In other terms, we do not
learn a model from a network and apply this model to a
completely different network (classical inductive learning
setting).
The method proposed in this paper (see Fig. 3for
the general workflow) aims at solving the task formal-
ized in Definition 4, by considering ncRNAs and dis-
eases as target types (Fig. 4). Hence, we determine two
distinct set of nodes denoted by Tnand Td,represen-
ting the set of ncRNAs and the set of diseases, respectively.
Estimation of the strength of the relationship between
ncRNAs and diseases
In the first phase, we estimate the strength of the rela-
tionship among all the possible ncRNA-disease pairs in
the network G. In particular, we aim to compute a score
s(ni,dj)for each possible pair ni,dj, by exploiting the con-
cept of meta-path. According to [26], a meta-path is a set
of sequences of nodes which follow the same sequence of
edge types, and can be used to fruitfully represent concep-
tual (possibly indirect) relationships between two entities
in a heterogeneous network (see Fig. 5). Given the ncRNA
niand the disease dj,foreachmeta-pathP,wecomputea
score pathscore(P,ni,dj), which represents the strength of
their relationship on the basis of the meta-path P.
In order to combine multiple contributions provided
by different meta-paths, we adopt a strategy that follows
the classical formulation of fuzzy sets [27]. In particular,
a relationship between a ncRNA niand a disease djcan
be considered “certain” if there is at least one meta-path
which confirms its certainty. Therefore, by assimilating
the score associated with an interaction to its degree of
certainty, we compute s(ni,dj)as the maximum value
observed over all the possible meta-paths between niand
dj. Formally:
s(ni,dj)=max
Pmetapaths(ni,dj)pathscore(P,ni,dj)(1)
where metapaths(ni,dj)is the set of meta-paths connect-
ing niand dj,andpathscore(P,ni,dj)is the degree of
certainty of the relationship between niand djaccording
to the meta-path P.
As introduced before, each meta-path Prepresents a
finite set of sequences of nodes, where:
the
i
-th node of each sequence in the metapath
P
is
of the same type;
Fig. 2 A hierarchy of overlapping multi-type clusters: aemphasizes the overlapping among multi-type clusters; bshows their hierarchical
organization
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 5 of 24
Fig. 3 Workflow of the method LP-HCLUS
Fig. 4 An example of a ncRNA-disease heterogeneous network. In this example, ncRNAs are represented as triangles, while diseases are represented
as squares. Other (task-relevant) nodes (e.g., target genes, proteins, etc) are represented as gray circles
Fig. 5 Diagram showing three different meta-paths between a disease and a ncRNA. The first meta-path connects diseases and ncRNAs via genes,
the second connects diseases and ncRNAs directly and the third connects diseases and ncRNAs via proteins
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 6 of 24
the first node is a ncRNA and the last node is a
disease;
if two nodes are consecutive in the sequence, then
there is an edge between them in
E
.
According to this definition, if there is a path P
directly connecting a ncRNA nito a disease dj,then
pathscore(P,ni,dj)=1, therefore s(ni,dj)=1.
Otherwise, when there is no direct connection between
niand dj,pathscore(P,ni,dj)is computed as the maximum
similarity between the sequences that start with niand
those that end with dj. Formally:
pathscore(P,ni,dj)=
max
seq,seqP,
seq.first=ni,seq.last=dj
similarity(seq,seq)(2)
The intuition behind this formula is that if niand djare
not directly connected, their score represents the similar-
ity of the nodes and edges they are connected to. In other
words, this is a way to analyze the similarity between the
neighborhood of niand the neighborhood of djin terms
of the (similarity of the) paths they are involved in.
It is noteworthy that, in order to make the neigh-
bors comparable, we exploit the concept of meta-path,
which includes sequences that involve the same types of
nodes. In fact, in Formula (2), the similarity between two
sequences seqand seq is computed as follows:
similarity(seq,seq)=xA(P)sx(seq,seq)
|A(P)|(3)
where:
A(P)is the set of attributes of the nodes involved in
the path
P
;
sx(seq,seq)is the similarity between valx(seq),that
is the value of the attribute
x
in the sequence seq,
and valx(seq), that is the value of the attribute
x
in
the sequence seq.
Following [28], we compute sx(seq,seq)as follows:
if
x
is numeric, then
sx(seq,seq)=1|valx(seq)valx(seq)|
maxxminx,whereminx
(resp. maxx) is the minimum (resp. maximum) value,
for the attribute
x
;
if
x
is not a numeric attribute, then sx(seq,seq)=1
if valx(seq)=valx(seq), 0 otherwise.
An example of the computation of the similarity among
sequences is reported in Fig. 6. In this example, we com-
pute the score between the ncRNA h19 and the disease
asthma. First, we identify the sequences starting with h19
(i.e., 1 and 9, emphasized in yellow) and those ending
with asthma (i.e., 4, 5, 6 and 7, emphasized in blue). Then
we pairwisely compute the similarity between sequences
belonging to the two sets and select the maximum value,
according to Eq. 2. The similarity between two sequences
is computed according to Eq. 3.
In this solution there could be some node types that
are not involved in any meta-path. In order to exploit
the information conveyed by these nodes, we add an
aggregation of their attribute values (the arithmetic mean
for numerical attributes, the mode for non-numerical
attributes) to the nodes that are connected to them and
that appear in at least one meta-path. Such an aggrega-
tion is performed up to a predefined depth of analysis
in the network. In this way, we fully exploit the network
autocorrelation phenomena.
Construction of a hierarchy of overlapping multi-type
clusters
Starting from the set of possible ncRNA-disease pairs,
each associated with a score that represents its degree of
certainty, we construct the first level of the hierarchy by
identifying a set of overlapping multi-type clusters in the
form of bicliques. That is, multi-type clusters where all the
ncRNA-disease relationships have a score greater than (or
Fig. 6 Analysis of sequences between the ncRNA “h19” and the disease “asthma” according to a meta-path. Sequences emphasized in yellow (1 and
9) are those starting with “h19”, while sequences emphasized in blue (4, 5, 6 and 7) are those ending with “asthma”. White rows, although belonging
to P, are not considered during the computation of the similarity in this specific example, since they do not involve “h19” or “asthma”
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 7 of 24
equal to) a given threshold β[0,1](seeFig.7). More for-
mally, in order to construct the first level of the hierarchy
L1,weperformthefollowingsteps:
i) Filtering, which keeps only the ncRNA-disease pairs
with a score greater than (or equal to) β. The result
of this step is the subset {(ni,dj)|s(ni,dj)β}.
ii) Initialization, which builds the initial set of clusters
in the form of bicliques, each consisting of a
ncRNA-disease pair in {(ni,dj)|s(ni,dj)β}.
iii) Merging, which iteratively merges two clusters C
and C into a new cluster C. This step regards the
initial set of clusters as a list sorted according to an
ordering relation <cthat reflects the quality of the
clusters. Each cluster Cis then merged with the first
cluster C in the list that would lead to a cluster C
which still satisfies the biclique constraint. This step
is repeated until no additional clusters that satisfy the
biclique constraint can be obtained.
The ordering relation <cexploited by the merging step
implicitly defines a greedy search strategy that guides the
order in which pairs of clusters are analyzed and possi-
bly merged. <cis based on the cluster cohesiveness h(c),
which corresponds to the average score of the interactions
in the cluster. Formally:
h(C)=1
|pairs(C)|·
(ni,dj)pairs(C)
s(ni,dj)(4)
where pairs(C)is the set of all the possible ncRNA-disease
pairs that can be constructed from the set of ncRNAs and
diseases in the cluster. Numerically, |pairs(C)|=|{ni|ni
CniTn}| · |{dj|djCdjTd}|.
Accordingly, if Cand C are two different clusters, the
ordering relation <cis defined as follows:
C<cC ⇐⇒ h(C)>h(C)(5)
The approach adopted to build the other hierarchical
levels is similar to the merging step performed to obtain
L1. The main difference is that, in this case, we do not
obtain bicliques, but generic multi-type clusters, i.e., the
score associated with each interaction does not need to
satisfy the threshold β. Since the biclique constraint is
removed, we need another stopping criterion for the itera-
tive merging procedure. Coherently with approaches used
in hierarchical co-clustering and following [29], we adopt
a user-defined threshold αon the cohesiveness of the
obtained clusters. In particular, two clusters Cand C can
be merged into a new cluster C if h(C )>α,where
h(C)is the cluster cohesiveness defined in Eq. 4.This
means that αdefines the minimum cluster cohesiveness
that must be satisfied by a cluster obtained after a merging:
small values of αlead to increase the number of merging
operations and, therefore, to a relatively small number of
final clusters containing a large number of nodes.
For every iteration of the merging procedure, a new
hierarchical level is generated. The iterative process stops
when it is not possible to merge more clusters with a
minimum level of cohesiveness α. The output of such a
process is a hierarchy of overlapping multi-type clusters
{L1,L2,...,Lk}(see Definition 3).
A pseudocode description of the proposed algorithm for
the construction of the hierarchy of clusters is reported in
Algorithm 1.
Prediction of new ncRNA-disease relationships
In the last phase, we exploit each level of the identified
hierarchy of multi-type clusters as a prediction model. In
particular, we compute, for each ncRNA-disease pair, a
score representing its degree of certainty on the basis of
the multi-type clusters containing it. Formally, let Cw
ij be
a cluster identified in the w-th hierarchical level in which
Fig. 7 Biclique constraint on two multi-type clusters aAn example of multi-type cluster which satisfies the biclique constraint with β=0.7 (i.e., all
the relationships have a score 0.7). bAn example that does not satisfy such a constraint. It is noteworthy that, with β=0.6, also (b) would satisfy
the biclique constraint
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 8 of 24
Algorithm 1 Construction of the hierarchy of overlapping multi-type clusters
Require:
- Initial set of clusters L0, each containing a single ncRNA-disease pair in {(ni,dj)|s(ni,dj)β};
- An ordering relation <cthat reflects the quality of the clusters;
- A threshold αon the quality of the clusters obtained after a merging.
Ensure:
- The hierarchy of overlapping multi-type clusters L1,L2,...,Lk
k0
repeat
{Define the merging condition: biclique constraint for the first level; threshold on the cluster cohesiveness h(·)for the
subsequent levels}
if k=0then
condition(·)isBiclique(·)
else
condition(·)h(·)>α
end if
LLk
sort Lin according to the ordering relation <c
clusters []
mergedClusters 0
{Loop over the sorted list of clusters. This defines a greedy search strategy: clusters with a higher cohesiveness value are
processed first}
for i1to|L|−1do
CL[i]
{Search for another cluster that can be merged with Cin the ordered list}
ji+1
merged false
while j≤|L|and not merged do
C L[j]
C merge(C,C)
{If Cand C can be merged into C according to the merging condition, merge them}
if condition(C)then
add C to clusters
mergedClusters mergedClusters +1
remove C from L
merged true
end if
jj+1
end while
{If Ccannot be merged with any other cluster, add it to the result as it is}
if merged =false then
add Cto clusters
end if
end for
newLevel false
{Check if there was at least one merging}
if mergedClusters >0then
if k>0then
{If we are not building the first level, define a new hierarchical level}
kk+1
newLevel true;
end if
Lkclusters
else
{End the construction of the first hierarchical level and continue with the others}
if k=0then
kk+1
Lkclusters
newLevel true;
end if
end if
until mergedClusters =0and newLevel =false
return L1,L2,...,Lk
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 9 of 24
the ncRNA niand the disease djappear. We compute the
degree of certainty of the relationship between niand dj
as:
ψ(w)(ni,dj)=hCw
ij ,(6)
that is, we compute the degree of certainty of the new
interaction as the average degree of certainty of the known
relationships in the cluster. In some cases, the same inter-
action may appear in multiple clusters, since the proposed
algorithm is able to identify overlapping clusters. In this
case, Cw
ij represents the list of multi-type clusters (i.e.,
Cw
ij =[C1,C2,...,Cm]), ordered accordingly to relation
<cdefined in Eq. 5, in which both niand djappear, on
which we apply an aggregation function to obtain a single
degree of certainty. In this work, we propose the adoption
of four different aggregation functions:
Maximum:ψ(w)(ni,dj)=maxcCw
ij h(c)
Minimum:ψ(w)(ni,dj)=mincCw
ij h(c)
Average:ψ(w)(ni,dj)=1
|Cw
ij |·cCw
ij h(c)
Evidence Combination:ψ(w)(ni,dj)=ec(Cm),
where:
ec(Cm)=h(C1)if Cm=C1
ec(Cm1)+[1ec(Cm1)]·h(Cm)otherwise
(7)
It is noteworthy that the Evidence Combination func-
tion, already exploited in the literature in the context of
expert systems [30], generally rewards the relationships
appearing in multiple high cohesive clusters.
In the following, we report an example of this predic-
tion step, with the help of Fig. 8. In this example, we have
two overlapping multi-type clusters C1and C2, identified
at the w-th hierarchical level, that suggest two new poten-
tial relationships (dashed lines in the figure), i.e. the pair
n2,d2and the pair n2,d3.
The first relationship only appears in C1, therefore its
degree of certainty is computed according to the cohesive-
ness of C1(see Eq. 4):
ψ(w)(n2,d2)=h(C1)=1
2·3(0.7 +0.8 +0.9)=0.4. (8)
On the contrary, the second relationship is suggested by
both C1and C2, i.e., it appears in their overlapped area.
Therefore, we aggregate the cohesiveness of C1and C2
according to one of the functions we described before. In
particular, since h(C1)=0.4 and h(C2)=1
1·2·0.6 =0.3,
we have:
Maximum:ψ(w)(n2,d3)=maxcCw
ij h(c)=0.4
Minimum:ψ(w)(n2,d3)=mincCw
ij h(c)=0.3
Average:ψ(w)(n2,d3)=1
|Cw
ij |·cCw
ij h(c)=
1
2·(0.4 +0.3)=0.35
Evidence Combination:
ψ(w)(n2,d3)=h(C1)+[1h(C1)]·h(C2)=
0.4 +(10.4)·0.3 =0.58
Results
Theproposedmethodwasevaluatedthroughseveral
experiments. In this section, we present the main adopted
resources, define the experimental setting, introduce the
adopted evaluation measures and compare our system
with the competitors from a quantitative viewpoint.
Datasets
We performed experiments on two different heteroge-
neous networks involving ncRNAs and diseases. In the
following, we report the details of each dataset, together
Fig. 8 Example of the prediction step. Two clusters identified at a given hierarchical level w. Triangles represent ncRNAs, squares represent diseases
and the grey shapes are other type nodes. The clusters suggest two new possible relationships between n2and d2and between n2and d3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 10 of 24
Fig. 9 UML diagram of the dataset HMDD v3.0. The attributes in red are the identifiers of the nodes of a given type (i.e., the primary key in a
relational database), while attributes in green refer to the identifier of nodes of other types (i.e., foreign keys in a relational database)
with UML diagrams that represent their data and struc-
ture, i.e., nodes, links and attributes.
HMDD v3 [31]. This dataset stores information about
diseases, miRNAs and their known relationships. The
network consists of 985 miRNAs, 675 diseases (charac-
terized by 6 attributes) and 20,859 relationships between
diseases and miRNAs (characterized by 3 attributes). A
diagram of this dataset is depicted in Fig. 9, while the
attributes are described in Table 1. The official link of the
dataset is: http://www.cuilab.cn/hmdd. In this evaluation,
we used two versions of the HMDD v3 dataset: the version
released on June 28th, 2018 (v3.0) and the version released
on March 27th, 2019 (v3.2). Both versions are available
at the following link: http://www.di.uniba.it/~gianvitopio/
systems/lphclus/.
Table 1 HMDD v3.0 dataset - Description of the attributes
Type Feature Description
Disease disease Disease name
root_name Category of the disease
doid Disease Ontology Identifiers
icd10cm ICD-10-CM Code
mesh Medical Subject Headings (MeSH)
code
omim Online Mendelian Inheritance in
Man (OMIM) code
hpo Human Phenotype Ontology (HPO)
code
Disease_miRNA id ID of the relationship
category Category of the relationship
mirna miRNA involved in the association
disease Disease involved in the association
pmid PubMed ID of the publication
reporting the association
description Description of the relationship
miRNA mirna miRNA name
Integrated Dataset (ID). This dataset has been built by
integrating multiple public datasets in a complex hetero-
geneous network. The source datasets are:
lncRNA-disease relationships and lncRNA-gene
interactions from [32] (June 2015)1
miRNA-lncRNA interactions from [33]2
disease-gene relationships from DisGeNET v5 [34]3
miRNA-gene and miRNA-disease relationships from
miR2Disease [35]4
From these resources we only kept data related to
H. Sapiens. The integration led to a network consist-
ing of 1015 ncRNAs (either lncRNAs or miRNAs), 7049
diseases, 70 relationships between lncRNAs and miR-
NAs, 3830 relationships between diseases and ncRNAs,
90,242 target genes, 26,522 disease-target associations
and 1055 ncRNA-target relationships. Most of the consid-
ered entities are also characterized by a variable number
of attributes, as shown in Fig. 10 and in Table 2.Thefinal
dataset is available at the following link: http://www.di.
uniba.it/~gianvitopio/systems/lphclus/.
Experimental setting & competitors
LP-HCLUS has been run with different values of its input
parameters, namely: α∈{0.1, 0.2}(we remind that αis
the minimum cohesiveness that a cluster must satisfy) and
β∈{0.3, 0.4}(we remind that βrepresents the minimum
score that each ncRNA-disease pair must satisfy to be con-
sidered as existing), while depth hasbeensetto2inorder
to consider only nodes that are relatively close to those
involved in the meta-paths. We performed a compara-
tive analysis with two competitor systems and a baseline
approach that we describe in the following.
1http://www.cuilab.cn/lncrnadisease
2Dataset “Data S3” in
https://www.sciencedirect.com/science/article/pii/S009286741300439X?via
%3Dihub#mmc3
3http://www.disgenet.org/
4http://www.mir2disease.org/
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 11 of 24
Fig. 10 UML diagram of the Integrated Dataset (ID). The attributes in red are the identifiers of the nodes of a given type (i.e., the primary key in a
relational database), while attributes in green refer to the identifier of nodes of other types (i.e., foreign keys in a relational database)
HOCCLUS2 [29] is a biclustering algorithm that, simi-
larly to LP-HCLUS, is able to identify a hierarchy of (possi-
bly overlapping) heterogeneous clusters. HOCCLUS2 was
initially developed to study miRNA-mRNA associations,
therefore it is inherently limited to two target types. More-
over, besides miRNAs, mRNAs and their associations,
it is not able to take into account other entities in the
network and actually cannot predict new relationships.
We adapted HOCCLUS2 in order to analyze ncRNA-
disease relationships and to be able to predict new associ-
ations. In particular, we fed HOCCLUS2 with the dataset
produced by the first step of LP-HCLUS (see “Estima-
tion of the strength of the relationship between ncRNAs
and diseases” section) and we performed the prediction
according to the strategy we proposed for LP-HCLUS (see
Prediction of new ncRNA-disease relationships”section),
considering all the aggregation functions proposed in this
paper. We emphasize that, since both the initial analy-
sis and the prediction step are performed by LP-HCLUS
modules, the comparison with HOCCLUS2 allows us
to evaluate the effectiveness of the proposed clustering
approach. Since the HOCCLUS2 parameters have a sim-
ilar meaning with respect to LP-HCLUS parameters, we
evaluated its results with the same parameter setting, i.e.,
α∈{0.1, 0.2}and β∈{0.3, 0.4}.
ncPred [14] is a system which was specifically designed
to predict new associations between ncRNAs and dis-
eases. ncPred analyzes two matrices containing informa-
tion about ncRNA-gene and gene-disease relationships.
Therefore, we transformed the considered heterogeneous
networks into matrices and fed ncPred with them. We
again emphasize that ncPred is not able to catch informa-
tion coming from other entities in the network of types
different from ncRNAs and diseases, and that it is not able
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 12 of 24
Table 2 ID dataset - Description of the attributes
Type Feature Description
Disease name Disease name
mesh_disease_class Disease classification by Medical Subject Headings (MeSH)
umls_semantic_type Semantic type provided by the Unified Medical Language System
Disease_ncRNA id ID of the relationship
ncrna ncRNA involved in the association
disease Disease involved in the association
type Type of association
detection Method used to detect the relationship
year Year of the detection
descr Description of the association
chromosome Chromosome
refseq RefSeq identifier
pmid PubMed ID of the publication reporting the association
Disease_target id ID of the relationship
target Target gene involved in the association
disease Disease involved in the association
score DisGENET score for the Gene-Disease association
source Original source reporting the Gene-Disease association
num_pmid Total number of publications reporting the association
num_snp Total number of SNPs associated to the association
cui Concept Unique Identifier (CUI)
lncRNA_miRNA id ID of the relationship
mirna miRNA involved in the association
lncrna lncRNA involved in the association
ncRNA name ncRNA name
biotype Type of ncRNA. The value can be “lncrna” or “mirna”
ncRNA_target id ID of the relationship
ncrna ncRNA involved in the association
target Target genes involved in the association
interaction Elements involved in the associations (e.g. RNA-RNA, RNA-protein)
inter_type Type of interaction (e.g. Regulatory, Binding, etc.)
description Description of the interaction
refseq RefSeq identifier
pmid PubMed ID of the publication reporting the association
pubdate Date of first publication
reference Textual description of the association
Target name Name of target gene
to exploit features associated to nodes and links in the
network. We set ncPred parameter values to their default
values.
LP-HCLUS-NoLP, which corresponds to our system
LP-HCLUS, without the clustering and the link predic-
tion steps. In particular, we consider the score obtained
in the first phase of LP-HCLUS (see “Estimation of the
strength of the relationship between ncRNAs and di-
seases” section) as the final score associated with each
interaction. This approach allows us to evaluate the con-
tribution provided by our link prediction approach based
on multi-type clustering.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 13 of 24
The evaluation was performed through a 10-fold cross-
validation. It is noteworthy that the computation of classi-
cal measures, such as Precision and Recall, would require
the presence of negative examples or some assumptions
made on unknown examples. In our case, the datasets
contain only positive examples, i.e., we have a set of vali-
dated relationships but we do not have negative examples
of relationships (relationships whose non-existence has
been proven).
Therefore, following the approach adopted in [13], we
evaluated the results in terms of TruePositiveRate@k,
where:
an association is considered a True Positive (TP) if it
is validated in the literature and it is in the first top
k
relationships predicted by the system;
an association is considered a False Negative (or FN)
if it is validated in the literature, but it is not in the
first top
k
relationships predicted by the system.
Since the optimal value of kcannot be known in
advance, we plot the obtained TPR@kby varying the
value of kand compute the Area Under the TPR@kcurve
(AUTPR@k). For a thorough analysis on the most promis-
ing (i.e., top-ranked) interactions, we report all the results
by varying the value of kwithin the interval [ 1, 5000],
obtained with the same configuration of the parameters α
and βfor HOCCLUS2 and LP-HCLUS. Moreover, we also
report the results in terms of ROC and Precision-Recall
curves,aswellastheareasundertherespectivecurves
(AUROC and AUPR), by considering the unknown rela-
tionships as negative examples. We remark that AUROC
and AUPR results can only be used for relative compar-
ison and not as absolute evaluation measures because
they are spoiled by the assumption made on unknown
relationships.
In the paper we report the results obtained with the
most promising configuration according to some prelim-
inary experiments. The complete results, including those
obtained in such preliminary experiments, can be down-
loaded at: http://www.di.uniba.it/~gianvitopio/systems/
lphclus/.
Results - HMDD v3 dataset
In Figures 11,12 and 13 we show the results obtained
on the HMDD dataset in terms of TPR@k, ROC and
Precision-Recall curves, while in Table 3,wereportthe
AUTPR@k, AUROC and AUPR values. From Fig. 11,
we can observe that the proposed method LP-HCLUS,
with the combination strategy based on the maximum,
is in general able to obtain the best performances. The
competitor system ncPred obtains good results, but it out-
performs LP-HCLUS_MAX only for high values of k,and
only when focusing on the first level of the hierarchy.
However, we stress the fact that it is highly preferable to
Fig. 11 TPR@kresults for the dataset HMDD v3.0, obtained with the best configuration (α=0.2, β=0.4) at different levels of the hierarchy
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 14 of 24
Fig. 12 ROC curves for the dataset HMDD v3.0, obtained with the best configuration (α=0.2, β=0.4) at different levels of the hierarchy. These
curves can only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the assumption made on
unknown relationships
achieve better performances on the left side of the curve,
i.e., with low values of k, since it is the real portion of the
ranking on which researchers will focus their analysis. In
such a portion of the curve, LP-HCLUS_MAX dominates
over all the competitors for all the hierarchical levels. It is
noteworthy that some variants of LP-HCLUS (i.e., MAX
and AVG) obtain their best performances at the second
level of the hierarchy. This emphasizes that the extraction
of a hierarchy of clusters could provide some improve-
ments with respect to a flat clustering. This is not so evi-
dent for HOCCLUS2 even if, analogously to LP-HCLUS,
it is able to extract a hierarchy. The results in terms of
AUTPR@k, AUROC and AUPR (see Table 3)confirmthe
superiority of LP-HCLUS_MAX over the competitors.
Results - ID dataset
In Figures 14,15 and 16 we show the results obtained
on the Integrated Dataset (ID) in terms of TPR@k, ROC
and Precision-Recall curves, while in Table 4,wereport
the AUTPR@k, AUROC and AUPR values. It is note-
worthy that this dataset is much more complex than
HMDD, because it consists of several types of nodes, each
associated with its attributes. In this case, the system LP-
HCLUS can fully exploit information brought by other
node types to predict new associations between ncRNAs
and diseases.
As it can be observed from the figures, thanks to such
an ability, LP-HCLUS clearly outperforms all the competi-
tors. It is noteworthy that also the simpler version of LP-
HCLUS, i.e., LP-HCLUS-NoLP, is able to outperform the
competitors, since it exploits the exploration of the net-
work based on meta-paths. However, when we exploit the
full version of LP-HCLUS, which bases its prediction on
the clustering results, the improvement over the existing
approaches becomes much more evident. These conclu-
sions are also confirmed by the AUTPR@k, AUROC and
AUPR values shown in Table 4.
Statistical comparisons
By observing the results reported in Figs. 11,12,13,14,
15 and 16, it is clear that the adoption of the Maxi-
mum (MAX) as LP-HCLUS aggregation function leads
to the best results. This behavior can be motivated by
the fact that such an approach rewards the associations
which show at least one strong evidence from the clus-
ters. Although such a behavior should be observed also
with the Evidence Combination (EC) function, it is note-
worthy that the latter also rewards associations which are
confirmed by several clusters, even if they show a weak
confidence. In this way, EC is prone to false positives
introduced by the combined contribution of several weak
relationships.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 15 of 24
Fig. 13 Precision-Recall curves for the dataset HMDD v3.0, obtained with the best configuration (α=0.2, β=0.4) at different levels of the
hierarchy. These curves can only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the
assumption made on unknown relationships
In order to confirm the superiority of LP-HCLUS_MAX
from a statistical viewpoint, we performed a Friedman
test with Nemenyi post-hoc test with significance value of
0.05. This test is applied to the Area Under the TPR@k
curve, in order to provide a k-independent evaluation of
the results. By observing the results in Fig. 17,itisclear
that LP-HCLUS_MAX is the best ranked method among
the considered approaches. Since, at a glance, the differ-
ence between LP-HCLUS_MAX and ncPred is clear, but
does not appear to be statistically significant with a test
that evaluates differences across multiple systems, we per-
formed three pairwise Wilcoxon tests (one for each hier-
archical level), with the Bonferroni correction. In this way,
it is possible to directly compare LP-HCLUS_MAX and
ncPred. Looking at the average Area Under the TPR@k
and p-values reported in Table 5, it is clear that the differ-
ence between LP-HCLUS_MAX and its direct competitor
ncPred is large (especially for the ID dataset) and, more
importantly, statistically significant for all the hierarchical
levels, at a significance value of 0.01.
Discussion
In this section we discuss about the results of the compar-
ison of LP-HCLUS with its competitors from a qualitative
viewpoint, in order to assess the validity of the proposed
system as a useful tool for biologists.
Discussion on the HMDD v3 dataset
We performed a comparative analysis between the results
obtained by LP-HCLUS against the validated interactions
reported in the updated version of HMDD (i.e., v3.2
released on March 27th, 2019). A graphical overview of
the results of this analysis is provided in Fig. 18, while the
detailed results are provided in Additional file 3,wherethe
relationships introduced in the new release of HMDD are
highlighted in green. The general conclusion we can draw
from Fig. 18 is that several relationships predicted by LP-
HCLUS have been introduced in the new HMDD release
v3.2.
In particular, we found 3055 LP-HCLUS predictions
confirmed by the new release of HMDD at the hierar-
chy level 1 (score range 0.97-0.44), 4119 at level 2 (score
range 0.93-0.37) and 4797 at level 3 (score range 0.79-
0.37). Overall, these results underline the behavior of
LP-HCLUS at the different levels of the hierarchy. As
expected,thenumberofpredictionsgrowsprogressively
from the lowest to the highest levels of the hierarchy, due
to the less stringent constraints imposed by the algorithm,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 16 of 24
Table 3 AUTPR@k, AUROC and AUPR values for the dataset
HMDD, obtained with the best configuration (α=0.2, β=0.4)
at different levels of the hierarchy
AUTPR@k AUPR AUROC
LP-HCLUS-NoLP 0.000000 0.000000 0.496169
ncPred 0.087370 0.007540 0.584268
LP-HCLUS AVG Level 1 0.042658 0.005437 0.523872
Level 2 0.056392 0.003140 0.548665
Level 3 0.020129 0.000469 0.515470
LP-HCLUS MAX Level 1 0.088130 0.010865 0.568056
Level 2 0.109292 0.013420 0.585560
Level 3 0.104244 0.011983 0.580824
LP-HCLUS MIN Level 1 0.031888 0.001935 0.519936
Level 2 0.032765 0.001232 0.524077
Level 3 0.011012 0.000170 0.505846
LP-HCLUS EC Level 1 0.005626 0.000035 0.501872
Level 2 0.004851 0.000030 0.500943
Level 3 0.006493 0.000050 0.502762
HOCCLUS2 AVG Level 1 0.018339 0.000839 0.511722
Level 2 0.016484 0.000670 0.509663
Level 3 0.012082 0.000287 0.507020
HOCCLUS2 MAX Level 1 0.018332 0.000829 0.510398
Level 2 0.016065 0.000659 0.508897
Level 3 0.011150 0.000274 0.506331
HOCCLUS2 MIN Level 1 0.015922 0.000753 0.509336
Level 2 0.016401 0.000668 0.509542
Level 3 0.011647 0.000270 0.506575
HOCCLUS2 EC Level 1 0.013922 0.000536 0.507314
Level 2 0.013717 0.000352 0.507112
Level 3 0.013065 0.000253 0.507751
The results in terms of AUPR and AUROC values can only be used for relative
comparison and not as absolute evaluation measures because they are spoiled by
the assumption made on unknown associations, that are considered as negative
examples
The best result is highlighted in boldface.
that allow LP-HCLUS to identify larger clusters at higher
levels of the hierarchy. Larger clusters, even if possibly less
reliable, in some cases can lead to the identification of less
obvious functional associations.
Comparing the diseases at different levels of the hier-
archy confirmed in the updated release of HMDD, we
found associations involving 276 diseases at level 1, 360
at level 2 and 395 at level 3. Among the diseases involved
in new associations predicted at level 3, but not at levels
1 and 2, there is the acquired immunodeficiency syn-
drome, a chronic, potentially life-threatening condition
caused by the human immunodeficiency virus (HIV). The
associations predicted by LP-HCLUS for this disease, con-
firmed in HMDD v3.2, involve hsa-mir-150 (with score
0.68) and hsa-mir-223 (with score 0.63). Such associations
have been reported in [36]. The authors show the results
of a study where the regulation of cyclin T1 and HIV-
1 replication has been evaluated in resting and activated
CD4+ T lymphocytes with respect to the expression of
endogenous miRNAs. In this study, the authors demon-
strated that miR-27b, miR-29b, miR-150, and miR-223 are
significantly downregulated upon CD4(+) T cell activa-
tion, and identified miR-27b as a novel regulator of cyclin
T1 protein levels and HIV-1 replication, while miR-29b,
miR-223, and miR-150 may regulate cyclin T1 indirectly.
Other validated miRNAs associated with the acquired
immunodeficiency syndrome in HMDD v3.2 are hsa-mir-
27b, -29b, -29a, -29b-1 and hsa-mir-198. As shown in
Fig. 19, these miRNAs, although not directly associated
by LP-HCLUS with the acquired immunodeficiency syn-
drome, have been associated with disease terms strictly
related to the immune system, with a score and speci-
ficity depending on the hierarchy level. In particular, at
level 1, they have been associated with the immune sys-
tem disease term (DOID_2914, a subclass of disease of
anatomical entity) with a score ranging from 0.48 for hsa-
mir-29b to a maximum value of 0.67 for hsa-mir-29a. At
level 2 of the hierarchy, in addition to the classification
in the immune system disease,theyhavealsobeenasso-
ciated with the human immunodeficiency virus infection
(DOID_526) that is a subclass of viral infectious disease
(DOID_934) and the direct parent of the acquired immun-
odeficiency syndrome (DOID_635). At level 3, all the miR-
NAs have also been associated with the viral infectious
disease term.
In addition to hsa-mir-155 and hsa-mir-223, LP-HCLUS
returned many other associations involving acquired
immunodeficiency syndrome with a high score. In partic-
ular, 59 different miRNAs have been associated at level 2
(score between 0.74 and 0.63), and 191 at level 3 (score
between 0.68 and 0.63). Considering such high scores,
we investigated in the literature for some of the associ-
ated miRNAs. In particular, we searched for hsa-mir-30a,
that was among the miRNAs with the highest association
score (0.74 at the 2nd level) and found a work where it has
been significantly associated with other six miRNAs (i.e.,
miR-29a, miR-223, miR-27a, miR-19b, miR-151-3p, miR-
28-5p, miR-766) as biomarker for monitoring the immune
status of patients affected by acquired immunodeficiency
syndrome [38].
Together with hsa-mir-30a, also other miRNAs belong-
ing to the same family (i.e., hsa-mir-30b, -30c and -30e)
have been associated by LP-HCLUS with the same disease.
In [39], four miRNA-like sequences (i.e., hsa-mir-30d, hsa-
mir-30e, hsa-mir-374a and hsa-mir-424) were identified
within the env and the gag-pol encoding regions of sev-
eral HIV-1 strains. The mapping of their sequences within
the HIV-1 genomes localized them to the functionally sig-
nificant variable regions, designated V1, V2, V4 and V5,
of the env glycoprotein gp120. This result was impor-
tantbecausetheregionsV1toV5ofHIV-1envelopes
contain specific and well-characterized domains that are
critical for immune responses, virus neutralization and
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 17 of 24
Fig. 14 TPR@kresults for the dataset ID, obtained with the best configuration (α=0.1, β=0.4) at different levels of the hierarchy
Fig. 15 ROC curves for the dataset ID, obtained with the best configuration (α=0.1, β=0.4) at different levels of the hierarchy. These curves can
only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the assumption made on unknown
relationships
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 18 of 24
Fig. 16 Precision-Recall curves for the dataset ID, obtained with the best configuration (α=0.1, β=0.4) at different levels of the hierarchy. These
curves can only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the assumption made on
unknown relationships
disease progression. The authors concluded that the
newly discovered miRNA-like sequences in the HIV-1
genomes might have evolved to self-regulated survival
of the virus in the host by evading the innate immune
responses and therefore influencing persistence, replica-
tion or pathogenicity of the virus.
Another example of reliable associations of ncRNAs
with the acquired immunodeficiency syndrome identified
by LP-HCLUS, and not present in HMDD 3.2, are those
with hsa-mir-125b, hsa-mir-28 and hsa-mir-382. These
associations are confirmed in [40], where the authors pro-
vided evidence that these miRNAs can contribute, along-
side hsa-mir-155 and hsa-mir-223, to the HIV latency. It
is noteworthy that these associations appear only at level
3 of the hierarchy but not at levels 2 or 1.
Altogether, these results highlight two interesting fea-
tures of LP-HCLUS: the ability to discover meaningful
functional associations, and the way the hierarchical clus-
tering can help in the identification of hidden informa-
tion. In principle, none of the hierarchy levels should be
ignored. As shown for the case of the acquired immun-
odeficiency syndrome, the first hierarchical level, although
in principle more reliable (since based on more strin-
gent constraints), in some cases is not able to capture less
obvious existing associations. On the other hand, results
obtained from higher levels of the hierarchy are much
more inclusive and can provide pieces of information that,
in the lowest levels, are hidden, and that can be pivotal to
the specific aims of a research investigation.
Finally, we compared the ranking values assigned by
LP-HCLUS, ncPred and HOCCLUS2 on the same asso-
ciations, that are, those confirmed in the HMDD v3.2
release (see Additional file 5). At this purpose, we com-
puted the AUTPR@kby considering the new interactions
introduced in HMDD v3.2 as ground truth. By observ-
ing the results reported in Table 6, we can confirm that
LP-HCLUS based on the MAX measure outperforms all
the competitors in identifying new interactions from the
previous version of the dataset (HMDD v3.0) that have
been subsequently validated and introduced in the latest
version (HMDD v3.2).
Discussion on the integrated dataset
As concerns the ID dataset, we performed a qualita-
tive analysis of the top-ranked relationships predicted by
LP-HCLUS, i.e., on those with a score equal to 1.0. For
this purpose, we exploited MNDR v2.0 [41], which is
a comprehensive resource including more than 260,000
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 19 of 24
Table 4 AUTPR@k, AUROC and AUPR values for the dataset ID,
obtained with the best configuration (α=0.1, β=0.4) at
different levels of the hierarchy
AUTPR@k AUPR AUROC
LP-HCLUS-NoLP 0.024087 0.000150 0.525501
ncPred 0.015365 0.000087 0.521975
LP-HCLUS AVG Level 1 0.024335 0.000198 0.532059
Level 2 0.013660 0.000080 0.519290
Level 3 0.005883 0.000024 0.510396
LP-HCLUS MAX Level 1 0.070639 0.001218 0.567991
Level 2 0.054821 0.000780 0.554388
Level 3 0.055141 0.000756 0.551873
LP-HCLUS MIN Level 1 0.005451 0.000010 0.504690
Level 2 0.000474 0.000000 0.500490
Level 3 0.000305 0.000000 0.500154
LP-HCLUS EC Level 1 0.004609 0.000010 0.506366
Level 2 0.005605 0.000013 0.505695
Level 3 0.005353 0.000010 0.505862
HOCCLUS2 AVG Level 1 0.002246 0.000087 0.502169
Level 2 0.002553 0.000006 0.502843
Level 3 0.000885 0.000001 0.501328
HOCCLUS2 MAX Level 1 0.002238 0.000087 0.502169
Level 2 0.002659 0.000005 0.502676
Level 3 0.000973 0.000001 0.500826
HOCCLUS2 MIN Level 1 0.002247 0.000087 0.502169
Level 2 0.002553 0.000006 0.502843
Level 3 0.000885 0.000001 0.501328
HOCCLUS2 EC Level 1 0.002763 0.000015 0.502337
Level 2 0.002320 0.000008 0.501835
Level 3 0.003533 0.000007 0.503683
The results in terms of AUPR and AUROC values can only be used for relative
comparison and not as absolute evaluation measures because they are spoiled by
the assumption made on unknown associations, that are considered as negative
examples
The best result is highlighted in boldface.
experimental and predicted ncRNA-disease associations
for mammalian species, including lncRNA, miRNA,
piRNA, snoRNA and more than 1,400 diseases. Data
in MNDR comes from manual literature curation and
other resources, and include a confidence score for each
ncRNA–disease association. Experimental evidences are
manually classified as strong or weak, while the confi-
dence score is calculated according to the evidence type
(s: strong experimental evidence, w: weak experimental
evidence, p: prediction) and the number of evidences.
The top-ranked relationships returned by LP-HCLUS
involve 1,067 different diseases and 814 different ncR-
NAs, consisting of 488 miRNAs and 326 lncRNAs, among
which there are several antisense RNAs and miRNA host-
ing genes. Table 7shows some examples of top-ranked
interactions predicted by LP-HCLUS and involving 4 ncR-
NAs, i.e., h19, wrap53, pvt1 and hsa-miR-106b.
h19 is a long intergenic ncRNA (lincRNA) and
a developmentally-regulated maternally-imprinted gene
that is expressed only from the inherited chromosome
11. A putative function assigned to it is a tumor suppres-
sor activity. GeneCards (GCID:GC11M001995) reports
its association with the Wilms Tumor 2 (WT2) and
Beckwith-Wiedemann Syndrome, both caused by muta-
tion or deletion of imprinted genes within the chromo-
some 11p15.5 region. Other sources, such as GenBank
[42]andMNDR[41,43], report the association of h19
with many other human diseases, the majority being dif-
ferent types of tumors.
Searching for h19-disease associations in MNDR, we
obtained 101 results with a confidence score ranging
from 0.9820 to 0.1097. The same search performed on
the output produced by LP-HCLUS (0.1 - 0.4, first level
of the hierarchy) returned 993 associations with a score
Fig. 17 Result of the Friedman test with Nemenyi post-hoc test, with a significance level of 0.05, performed on the area under the TPR@kcurve
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 20 of 24
Table 5 Average Area Under the TPR@kcurve and p-values
obtained by the Wilcoxon signed-rank test with the Bonferroni
correction
Average Area Under TPR@kp-values
Method HMDD v3.0 dataset ID dataset LP-HCLUS vs ncPred
ncPred 0.087370 0.015365
LP-HCLUS_MAX_L1 0.088130 0.070639 0.005833 (+)
LP-HCLUS_MAX_L2 0.109292 0.054821 0.000266 (+)
LP-HCLUS_MAX_L3 0.104244 0.055141 0.000266 (+)
The best result for each dataset is emphasized in boldface. (+) indicates that
LP-HCLUS significantly outperforms ncPred (p-value <0.01)
The best result is highlighted in boldface.
ranging from 1.0 to 0.4. A comparative analysis of the
results shows a perfect match of 33 predictions (see
Table 8), many of which also with a similar confidence
score, despite the different approaches adopted to calcu-
late them.
Among the top-ranked associations predicted by LP-
HCLUS involving h19, the association with “bone dis-
eases, developmental” is not present in the results
obtained by the MNDR database (see Table 7). Bone dis-
eases can have different origins and can be also related to
hyperfunction or hypofunction of the endocrine glands,
such as pituitary gland, thyroid gland, parathyroid glands,
adrenal glands, pancreas, gonads, and pineal gland. The
results of the comparative analysis with the data in
MNDR, in addition to the relationship with osteosar-
coma (LP-HCLUS score 0.7732385; MNDR confidence
score s: 0.9820) show associations between h19 and other
diseases which involve endocrine glands such as: ovar-
ian neoplasms (LP-HCLUS score 0.7052352; MNDR con-
fidence score p: 0.1097, s: 0.8589); pancreatic cancer
(LP-HCLUS score 0.8150848; MNDR confidence score s:
0.8808); pancreatic ductal adenocarcinoma (LP-HCLUS
score 0.6575157; MNDR confidence score s: 0.9526) and
thyroid cancer (LP-HCLUS score 0.7732385; MNDR con-
fidence score s: 0.8808, p: 0.1097) (See Table 8). This
indicates that h19 can have a relationship with endocrine
glands functions and, therefore, can be related to bone
diseases as predicted by LP-HCLUS.
Conclusions
In this paper, we have tackled the problem of pre-
dicting possibly unknown ncRNA-disease relationships.
The approach we proposed, LP-HCLUS, is able to take
advantage from the possible heterogeneous nature of the
attributed biological network analyzed. In this way, it is
possible to identify ncRNA-disease relationships by tak-
ing into account the properties of additional biological
entities (e.g. microRNAs, lncRNAs, target genes) they are
connected to.
Methodologically, LP-HCLUS is based on the identifi-
cation of paths in the heterogeneous attributed biolog-
ical network, which potentially confirm the connection
Fig. 18 A graphical representation of the top-100 relationships predicted by LP-HCLUS from HMDD v3.0. The dark green lines represent the position
of the relationships that have been subsequently validated and introduced in HMDD v3.2
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 21 of 24
Fig. 19 Ontology classification of acquired immunodeficiency syndrome according to EMBL-EBI Ontology Lookup Service [37]
between a ncRNA and a disease, and a clustering phase,
which is preparatory to a link prediction phase. In this
way, it is possible to catch the network autocorrelation
phenomena and exploit information implicitly conveyed
by the network structure.
Table 6 AUTPR@k computed using the new associations
introduced in the new version of HMDD v3.2 as ground truth
AUC TPR@k
LP-HCLUS-NoLP 0.00000
ncPred 0.01448
LP-HCLUS AVG Level 1 0.01754
Level 2 0.02663
Level 3 0.01453
LP-HCLUS MAX Level 1 0.03247
Level 2 0.03423
Level 3 0.03111
LP-HCLUS MIN Level 1 0.01846
Level 2 0.02197
Level 3 0.00962
LP-HCLUS EC Level 1 0.00695
Level 2 0.00527
Level 3 0.00548
HOCCLUS2 AVG Level 1 0.01750
Level 2 0.00627
Level 3 0.00962
HOCCLUS2 MAX Level 1 0.01774
Level 2 0.00763
Level 3 0.00991
HOCCLUS2 MIN Level 1 0.01657
Level 2 0.00627
Level 3 0.00962
HOCCLUS2 EC Level 1 0.01689
Level 2 0.01269
Level 3 0.01252
The best result is highlighted in boldface.
The results confirm the initial intuitions and show com-
petitive performances of LP-HCLUS in terms of accu-
racy of the predictions, also when compared, through
a statistical test (at a significance level of 0.01), with
state-of-the-art competitor systems. These results are also
supported by a comparison of LP-HCLUS predictions
with data reported in MNDR and by a qualitative analy-
sis that revealed that several ncRNA-disease associations
predicted by LP-HCLUS have been subsequently experi-
mentally validated and introduced in a more recent release
(v3.2) of HMDD.
Finally, the association between the long-intergenic
ncRNA h19 and bone diseases, predicted by LP-HCLUS,
suggests an important functional role of h19 in the regula-
tion of endocrine glands functions. This further confirms
the potential of LP-HCLUS as a prediction tool for the for-
mulation of new biological hypothesis and experimental
Table 7 Examples of top-ranked ncRNA-disease associations
predicted by LP-HCLUS with a score equal to 1.0
ncRNA Disease
h19 bone diseases, developmental
h19 carcinoma, hepatocellular
h19 colorectal neoplasms
h19 liver neoplasms
h19 parkinson disease, secondary
hsa-miR-106b aging, premature
hsa-miR-106b burkitt s lymphomas
hsa-miR-106b disease progression
pvt1 aging, premature
pvt1 disease progression
wrap53 adrenal gland neoplasms
wrap53 adrenocortical carcinoma
wrap53 emphysema
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 22 of 24
Table 8 Result of matching between the associations predicted
by LP-HCLUS and those present in MNDR
ncRNA Disease LP-HCLUS MNDR
h19 adenocarcinoma 0.7455674 s: 0.7311
h19 adrenocortical carcinoma 0.8150848 s: 0.7311
h19 aortic valve disease 0.6492379 s: 0.7311
h19 astrocytoma 0.7455674 s: 0.7311
h19 breast adenocarcinoma 0.7005121 s: 0.7311
h19 carcinoma, non-small-cell lung 0.7052352 s: 0.9820, p: 0.1097
h19 chronic myeloid leukemia 0.7005121 s: 0.8808
h19 colon carcinoma 0.7005121 s: 0.8589
h19 colorectal cancer 0.8150848 s: 0.9820, p: 0.1097
h19 coronary artery disease 0.6600133 w: 0.4752
h19 embryonal carcinoma 0.6522726 s: 0.9526
h19 endometriosis 0.7052352 s: 0.8808
h19 esophageal cancer 0.8150848 s: 0.8589
h19 gallbladder cancer 0.6522726 s: 0.8808
h19 heart defects, congenital 0.6703589 s: 0.8589
h19 laryngeal squamous cell carcinoma 0.6522726 s: 0.9526
h19 liver neoplasms 1.0000000 w: 0.4752
h19 lung adenocarcinoma 0.6669160 s: 0.8589
h19 lymphoma 0.6962170 p: 0.1321
h19 osteoarthritis 0.6749659 w: 0.4752
h19 osteosarcoma 0.7732385 s: 0.9820
h19 ovarian neoplasms 0.7052352 s: 0.8589, p: 0.1097
h19 pancreatic cancer 0.8150848 s: 0.8808
h19 pancreaticductal adenocarcinoma 0.6575157 s: 0.9526
h19 polycythemia vera 0.7005121 s: 0.7311
h19 prostatic neoplasms 0.7052352 s: 0.7311, p: 0.1097
h19 rheumatoid arthritis 0.6703589 s: 0.9526
h19 schizophrenia 0.7052352 p: 0.1097
h19 squamous cell carcinoma 0.6826756 w: 0.4752
h19 thyroid cancer 0.7732385 s: 0.8808, p: 0.1097
h19 urinary bladder neoplasms 0.6962170 p: 0.1097
h19 uterine cervical neoplasms 0.7455674 s: 0.7311, p: 0.1097
MNDR scores are associated with an evidence type: s: strong experimental
evidence, w: weak experimental evidence, p:prediction
validations for the characterization of the roles of ncRNAs
in biological processes.
For future work, we plan to extend our approach in
order to predict the direction of the relationships, and
not only their presence. This would require to identify
and deal with cause/effect phenomena. Depending on
the availability of data, it would also be very interesting
to evaluate the results of LP-HCLUS analysis on tissue-
specific datasets or on datasets related to physiological or
pathological specific conditions.
Supplementary information
Supplementary information accompanies this paper at
https://doi.org/10.1186/s12859-020-3392-2.
Additional file 1: Discussion of related work.
Additional file 2: Analysis of the time complexity of lP-HCLUS.
Additional file 3: Complete results of the comparative analysis between
the predictions returned by lP-HCLUS from hMDD v3.0 and the new
validated relationships in hMDD v3.2.
Additional file 4: Detailed list of associations involving the acquired
immunodeficiency syndrome and similar disease terms in three hierarchical
levels extracted by lP-HCLUS.
Additional file 5: Comparative analysis of the ranking produced by
lP-HCLUS and its competitors with respect to the new validated
relationships in hMDD v3.2.
Abbreviations
AUPR: Area under the Precision-Recall curve; AUROC: Area under the ROC
curve; AUTPR@k: Area under the TPR@k curve; AVG: Average; CUI: Concept
Unique Identifier; DOID: Human Disease Ontology ID; EC: Evidence
Combination; EMBL-EBI: European Molecular Biology Laboratory - European
Bioinformatics Institute; GBA: Guilt-By-Association principle; GCID: GeneCards
ID; HOCCLUS2: Hierarchical Overlapping Co-CLUStering2; HPO: Human
Phenotype Ontology; lncRNA: long non-coding RNA; LP-HCLUS: Link Prediction
through Hierarchical CLUStering; MAX: Maximum; MeSH: Medical Subject
Headings; MIN: Minimum; miRNA: microRNA; ncRNA: non-coding RNA; OMIM:
Online Mendelian Inheritance in Man; RefSeq: NCBI’s Reference Sequences
database; RNA: RiboNucleic Acid; ROC: Receiver Operating Characteristic; SNP:
Single-Nucleotide Polymorphism; TPR@k: True Positive Rate at k; UML: Unified
Modeling Language; UMLS: Unified Medical Language System
Acknowledgements
Not applicable
Authors’ contributions
MC and GP conceived the task and designed the solution from a
methodological point of view. EB and GP implemented the system. EB ran the
experiments and collected the results. MC and GP discussed the results from a
quantitative viewpoint. DD contributed to the conception of the biological
investigation, collaborated to the review and selection of bioinformatics
resources and analyzed the results from a qualitative viewpoint. All the authors
contributed to the manuscript drafting and approved the final version of the
manuscript.
Funding
We would like to acknowledge the financial support of the European
Commission through the project MAESTRA - Learning from Massive,
Incompletely annotated, and Structured Data (Grant Number
ICT-2013-612944). We also acknowledge the financial support of Ministry of
Education, Universities and Research (MIUR) through the PON projects “Big
Data Analytics” (AIM1852414 - Activity 1, Line 1) and TALIsMAn - Tecnologie di
Assistenza personALizzata per il Miglioramento della quAlità della vitA (Grant
N. ARS01_0111), and of Italian National Research Council (CNR) through the
InterOmics Flagship project.
Availability of data and materials
The system LP-HCLUS, the adopted datasets and all the results are available at:
http://www.di.uniba.it/~gianvitopio/systems/lphclus/
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 23 of 24
Author details
1University of Bari Aldo Moro - Department of Computer Science, Via Orabona,
4, 70125 Bari, Italy. 2Big Data Laboratory, National Interuniversity Consortium
for Informatics (CINI), 00185 Rome, Italy. 3CNR, Institute for Biomedical
Technologies, 70126 Bari, Italy. 4Department of Knowledge Technologies,
Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia.
Received: 30 August 2019 Accepted: 29 January 2020
References
1. Cech TR, Steitz JA. The Noncoding RNA Revolution—Trashing Old Rules
to Forge New Ones. Cell. 2014;157(1):77–94. https://doi.org/10.1016/j.cell.
2014.03.008.
2. Lekka E, Hall J. Noncoding RNAs in disease. FEBS Lett. 2018;592(17):
2884–900. https://doi.org/10.1002/1873-3468.13182.
3. Bernstein B, Birney E, Dunham I, Green E, Gunter C, Snyder M, Abyzov
A, Aken B, Barrell D, Barson G, Berry A, Bignell A, Boychenko V, Bussotti
G, Chrast J, Davidson C, Derrien T, Despacio-Reyes G, Diekhans M,
Hubbard T. An integrated encyclopedia of DNA elements in the human
genome. Nature. 2012;489:57–74.
4. Davis C, Hitz B, Sloan C, Chan E, Davidson J, Gabdank I, Hilton J, Jain K,
Baymuradov U, Narayanan A, Onate K, Graham K, Miyasato S, Dreszer T,
Strattan J, Jolanki O, Tanaka F, Cherry J. The Encyclopedia of DNA
elements (ENCODE): data portal update. Nucleic Acids Res. 2017;46:.
https://doi.org/10.1093/nar/gkx1081.
5. Hayes J, Peruzzi PP, Lawler S. MicroRNAs in cancer: biomarkers, functions
and therapy. Trends Mol Med. 2014;20(8):460–9. https://doi.org/10.1016/j.
molmed.2014.06.005.
6. Melissari M-T, Grote P. Roles for long non-coding RNAs in physiology and
disease. Arch Eur J Physiol. 2016;468(6):945–58. https://doi.org/10.1007/
s00424-016- 1804-y.
7. Akhade VS, Pal D, Kanduri C. Long Noncoding RNA: Genome
Organization and Mechanism of Action. Adv Exp Med Biol. 2017;1008:
47–74. https://doi.org/10.1007/978-981- 10-5203- 3_2.
8. Bak RO, Mikkelsen JG. miRNA sponges: soaking up miRNAs for regulation
of gene expression. Wiley Interdiscip Rev RNA. 2014;5(3):317–33. https://
doi.org/10.1002/wrna.1213.
9. Yoon J-H, Abdelmohsen K, Gorospe M. Functional interactions among
microRNAs and long noncoding RNAs. Semin Cell Dev Biol. 2014;34:9–14.
https://doi.org/10.1016/j.semcdb.2014.05.015.
10. Yang X, Gao L, Guo X, Shi X, Wu H, Song F, Wang B. A Network Based
Method for Analysis of lncRNA-Disease Associations and Prediction of
lncRNAs Implicated in Diseases. PLoS ONE. 2014;9(1):87797. https://doi.
org/10.1371/journal.pone.0087797.
11. Wang P, Guo Q, Gao Y, Zhi H, Zhang Y, Liu Y, Zhang J, Yue M, Guo M,
Ning S, Zhang G, Li X. Improved method for prioritization of disease
associated lncRNAs based on ceRNA theory and functional genomics
data. Oncotarget. 2016;8(3):4642–55. https://doi.org/10.18632/
oncotarget.13964.
12. Ceci M, Pio G, Kuzmanovski V, Džeroski S. Semi-supervised multi-view
learning for gene network reconstruction. PLoS ONE. 2015;10(12):1–27.
https://doi.org/10.1371/journal.pone.0144031.
13. Pio G, Ceci M, Malerba D, D’Elia D. ComiRNet: a web-based system for
the analysis of miRNA-gene regulatory networks. BMC Bioinformatics.
2015;16(Suppl 9):7. https://doi.org/10.1186/1471-2105- 16-S9-S7.
14. Alaimo S, Giugno R, Pulvirenti A. ncPred: ncRNA-Disease Association
Prediction through Tripartite Network-Based Inference. Front Bioeng
Biotechnol. 2014;2:. https://doi.org/10.3389/fbioe.2014.00071.
15. Bonnici V, Caro GD, Constantino G, Liuni S, D’Elia D, Bombieri N,
Licciulli F, Giugno R. Arena-Idb: a platform to build human non-coding
RNA interaction networks. BMC Bioinformatics. 2018;19(Suppl 10):.
https://doi.org/10.1186/s12859-018- 2298-8.
16. Pio G, Ceci M, Prisciandaro F, Malerba D. LOCANDA: Exploiting Causality
in the Reconstruction of Gene Regulatory Networks. In: Yamamoto A,
Kida T, Uno T, Kuboyama T, editors. Discovery Science. Cham: Springer;
2017. p. 283–97.
17. Pio G, Ceci M, Prisciandaro F, Malerba D. Exploiting causality in gene
network reconstruction based on graph embedding. Mach Learn. 2019.
https://doi.org/10.1007/s10994-019- 05861-8.
18. Pio G, Malerba D, D’Elia D, Ceci M. Integrating microRNA target
predictions for the discovery of gene regulatory networks: a
semi-supervised ensemble learning approach. BMC bioinformatics.
2014;15(Suppl 1):4. https://doi.org/10.1186/1471-2105- 15-S1-S4.
19. Mignone P, Pio G, D’Elia D, Ceci M. Exploiting transfer learning for the
reconstruction of the human gene regulatory network. Bioinformatics.
2019. https://doi.org/10.1093/bioinformatics/btz781.
20. Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA
functional similarity network based on lncRNA-disease associations and
disease semantic similarity. Sci Rep. 2015;5:. https://doi.org/10.1038/
srep11338.
21. Martínez V, Berzal F, Cubero J-C. A Survey of Link Prediction in Complex
Networks. ACM Comput Surv. 2016;49(4):69–16933. https://doi.org/10.
1145/3012704.
22. Blockeel H, Raedt LD, Ramon J. Top-down induction of clustering trees.
In: Shavlik JW, editor. Proc. of ICML 1998. Madison: Morgan Kaufmann;
1998. p. 55–63.
23. Dincer NG, Akku¸s Ö. A new fuzzy time series model based on robust
clustering for forecasting of air pollution. Ecol Inform. 2018;43:157–64.
24. Stojanova D, Ceci M, Appice A, Dzeroski S. Network regression with
predictive clustering trees. Data Min Knowl Disc. 2012;25(2):378–413.
25. Lefever S, Anckaert J, Volders P-J, Luypaert M, Vandesompele J,
Mestdagh P. decodeRNA— predicting non-coding RNA functions using
guilt-by-association. Database: J Biol Databases Curation. 2017;2017:.
https://doi.org/10.1093/database/bax042.
26. Pio G, Serafino F, Malerba D, Ceci M. Multi-type clustering and
classification from heterogeneous networks. Inf Sci. 2018;425:107–26.
https://doi.org/10.1016/j.ins.2017.10.021.
27. Zadeh LA. Fuzzy sets. Inf Control. 1965;8(3):338–53. https://doi.org/10.
1016/S0019-9958(65)90241- X.
28. Han J, Kamber M. Data Mining: Concepts and Techniques. Amsterdam:
Elsevier/Morgan Kaufmann; 2006.
29. Pio G, Ceci M, D’Elia D, Loglisci C, Malerba D. A Novel Biclustering
Algorithm for the Discovery of Meaningful Biological Correlations
between microRNAs and their Target Genes. BMC Bioinformatics.
2013;14(Suppl 7):8. https://doi.org/10.1186/1471-2105- 14-S7-S8.
30. Lesmo L, Saitta L, Torasso P. Evidence combination in expert systems. Int
J Man-Mach Stud. 1985;22(3):307–26. https://doi.org/10.1016/S0020-
7373(85)80006-7.
31. Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q. HMDD v3.0: a
database for experimentally supported human microRNA-disease
associations. Nucleic Acids Res. 2019;47(D1):1013–7. https://doi.org/10.
1093/nar/gky1010.
32. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui
Q. LncRNADisease: a database for long-non-coding RNA-associated
diseases. Nucleic Acids Res. 2013;41(Database issue):983–6. https://doi.
org/10.1093/nar/gks1099.
33. Helwak A, Kudla G, Dudnakova T, Tollervey D. Mapping the human
miRNA interactome by CLASH reveals frequent noncanonical binding.
Cell. 2013;153(3):654–65. https://doi.org/10.1016/j.cell.2013.03.043.
34. Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI. DisGeNET: a
Cytoscape plugin to visualize, integrate, search and analyze gene-disease
networks. Bioinforma (Oxf Engl). 2010;26(22):2924–6. https://doi.org/10.
1093/bioinformatics/btq538.
35. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y.
miR2disease: a manually curated database for microRNA deregulation in
human disease. Nucleic Acids Res. 2009;37(Database issue):98–104.
https://doi.org/10.1093/nar/gkn714.
36. Chiang K, Sung T-L, Rice AP. Regulation of Cyclin T1 and HIV-1
Replication by MicroRNAs in Resting CD4+ T Lymphocytes. J Virol.
2012;86(6):3244–52. https://doi.org/10.1128/JVI.05065-11.
https://jvi.asm.org/content/86/6/3244.full.pdf.
37. Jupp S, et al. A new Ontology Lookup Service at EMBL-EBI. In: Malone J, et
al., editors. Proceedings of SWAT4LS International Conference 2015; 2015.
38. Qi Y, Hu H, Guo H, Xu P, Shi Z, Huan X, Zhu Z, Zhou M, Cui L.
MicroRNA profiling in plasma of HIV-1 infected patients: potential markers
of infection and immune status. J Publ Health Emerg. 2017;1(7):. https://
doi.org/10.21037/jphe.2017.05.11.
39. Holland B, Wong J, Li M, Rasheed S. Identification of Human
MicroRNA-Like Sequences Embedded within the Protein-Encoding Genes
of the Human Immunodeficiency Virus. PLoS ONE. 2013;8(3):1–10. https://
doi.org/10.1371/journal.pone.0058586.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Barracchia et al. BMC Bioinformatics (2020) 21:70 Page 24 of 24
40. Huang J, Wang F, Argyris E, Chen K, Liang Z, Tian H, Huang W, Squires
K, Verlinghieri G, Zhang H. Cellular micrornas contribute to hiv-1 latency
in resting primary cd4+ t lymphocytes. Nat Med. 2007;13(10):1241–7.
https://doi.org/10.1038/nm1639.
41. Cui T, Zhang L, Huang Y, Yi Y, Tan P, Zhao Y, Hu Y, Xu L, Li E, Wang D.
MNDR v2.0: an updated resource of ncRNA–disease associations in
mammals. Nucleic Acids Res. 2018;46(Database issue):371–4. https://doi.
org/10.1093/nar/gkx1025.
42. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell
J, Sayers EW. GenBank. Nucleic Acids Res. 2013;41(Database issue):36–42.
https://doi.org/10.1093/nar/gks1195.
43. Wang Y, Chen L, Chen B, Li X, Kang J, Fan K, Hu Y, Xu J, Yi L, Yang J,
Huang Y, Cheng L, Li Y, Wang C, Li K, Li X, Xu J, Wang D. Mammalian
ncRNA-disease repository: a global view of ncRNA-mediated disease
network. Cell Death Dis. 2013;4(8):765. https://doi.org/10.1038/cddis.
2013.292.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... However, elucidating the role of miRNAs in specific diseases remains a significant challenge, worsened by the resource-intensive nature of experimental investigations. In this context, many computational methods have been proposed [1,3,7], providing cost-effective approaches to formulate biological hypotheses that can be subsequently validated through in-vitro experiments. ...
... As a result, constructing association networks from heterogeneous biological data has emerged as a primary goal in overcoming this obstacle [3]. LP-HCLUS [1] is an example of an approach that aims to reach this goal by solving a link prediction task on heterogeneous graphs to unveil previously unknown RNA-disease associations. It exploits validated relationships to predict novel associations between non-coding RNAs and diseases, demonstrating its potential in elucidating the functional role of miRNAs in disease onset or progression. ...
... ncRNAs represent approximately 60% of the transcriptional production of the human genome [15,16], and there is a close association between many diseases and ncRNA mutations or abnormal expression [17,18]. ncRNAs can be divided into two categories based on their length: small ncRNAs and lncRNAs [19][20][21]. ...
... There are increasing studies that have identified lncRNAs as a new class of regulatory molecules with the functions of scaffold, signal, and guide, and they are also involved in transcriptional interference [40][41][42]. man genome [15,16], and there is a close association between many diseases and ncRNA mutations or abnormal expression [17,18]. ncRNAs can be divided into two categorie based on their length: small ncRNAs and lncRNAs [19][20][21]. ...
Article
Full-text available
Non-coding RNAs (ncRNAs) are transcribed from the genome and do not encode proteins. In recent years, ncRNAs have attracted increasing attention as critical participants in gene regulation and disease pathogenesis. Different categories of ncRNAs, which mainly include microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and circular RNAs (circRNAs), are involved in the progression of pregnancy, while abnormal expression of placental ncRNAs impacts the onset and development of adverse pregnancy outcomes (APOs). Therefore, we reviewed the current status of research on placental ncRNAs and APOs to further understand the regulatory mechanisms of placental ncRNAs, which provides a new perspective for treating and preventing related diseases.
... Some studies have shown that pregnant women suffer worse than non-pregnant women (Fan et al. 2020). For predicting possibly unknown ncRNAdisease relationships used multi-type hierarchical clustering (Barracchia et al. 2020). Some papers aim to find possible treatments or drug/gene-disease association clustering ( Loucera et al. 2020). ...
Article
PurposeBased on medical reports, it is hard to find levels of different hospitalized symptomatic COVID-19 patients according to their features in a short time. Besides, there are common and special features for COVID-19 patients at different levels based on physicians’ knowledge that make diagnosis difficult. For this purpose, a hierarchical model is proposed in this paper based on experts’ knowledge, fuzzy C-mean (FCM) clustering, and adaptive neuro-fuzzy inference system (ANFIS) classifier.Methods Experts considered a special set of features for different groups of COVID-19 patients to find their treatment plans. Accordingly, the structure of the proposed hierarchical model is designed based on experts’ knowledge. In the proposed model, we applied clustering methods to patients’ data to determine some clusters. Then, we learn classifiers for each cluster in a hierarchical model. Regarding different common and special features of patients, FCM is considered for the clustering method. Besides, ANFIS had better performances than other classification methods. Therefore, FCM and ANFIS were considered to design the proposed hierarchical model. FCM finds the membership degree of each patient’s data based on common and special features of different clusters to reinforce the ANFIS classifier. Next, ANFIS identifies the need of hospitalized symptomatic COVID-19 patients to ICU and to find whether or not they are in the end-stage (mortality target class). Two real datasets about COVID-19 patients are analyzed in this paper using the proposed model. One of these datasets had only clinical features and another dataset had both clinical and image features. Therefore, some appropriate features are extracted using some image processing and deep learning methods.ResultsAccording to the results and statistical test, the proposed model has the best performance among other utilized classifiers. Its accuracies based on clinical features of the first and second datasets are 92% and 90% to find the ICU target class. Extracted features of image data increase the accuracy by 94%.Conclusion The accuracy of this model is even better for detecting the mortality target class among different classifiers in this paper and the literature review. Besides, this model is compatible with utilized datasets about COVID-19 patients based on clinical data and both clinical and image data, as well.Highlights• A new hierarchical model is proposed using ANFIS classifiers and FCM clustering method in this paper. Its structure is designed based on experts’ knowledge and real medical process. FCM reinforces the ANFIS classification learning phase based on the features of COVID-19 patients.• Two real datasets about COVID-19 patients are studied in this paper. One of these datasets has both clinical and image data. Therefore, appropriate features are extracted based on its image data and considered with available meaningful clinical data. Different levels of hospitalized symptomatic COVID-19 patients are considered in this paper including the need of patients to ICU and whether or not they are in end-stage.• Well-known classification methods including case-based reasoning (CBR), decision tree, convolutional neural networks (CNN), K-nearest neighbors (KNN), learning vector quantization (LVQ), multi-layer perceptron (MLP), Naive Bayes (NB), radial basis function network (RBF), support vector machine (SVM), recurrent neural networks (RNN), fuzzy type-I inference system (FIS), and adaptive neuro-fuzzy inference system (ANFIS) are designed for these datasets and their results are analyzed for different random groups of the train and test data;• According to unbalanced utilized datasets, different performances of classifiers including accuracy, sensitivity, specificity, precision, F-score, and G-mean are compared to find the best classifier. ANFIS classifiers have the best results for both datasets.• To reduce the computational time, the effects of the Principal Component Analysis (PCA) feature reduction method are studied on the performances of the proposed model and classifiers. According to the results and statistical test, the proposed hierarchical model has the best performances among other utilized classifiers.Graphical Abstract
... Overlapping clustering approaches have also been applied to graph clustering. LP-HCLUS is one such method, generating hierarchical clusters with potential for overlap, allowing diseases and ncRNA in a heterogenous graph to be involved in multiple interaction subnetworks, better reflecting their true function (Barracchia et al., 2020). ...
Article
Full-text available
Chronic obstructive pulmonary disease (COPD) is one of the leading causes of death in the United States. COPD represents one of many areas of research where identifying complex pathways and networks of interacting biomarkers is an important avenue toward studying disease progression and potentially discovering cures. Recently, sparse multiple canonical correlation network analysis (SmCCNet) was developed to identify complex relationships between omics associated with a disease phenotype, such as lung function. SmCCNet uses two sets of omics datasets and an associated output phenotypes to generate a multi-omics graph, which can then be used to explore relationships between omics in the context of a disease. Detecting significant subgraphs within this multi-omics network, i.e., subgraphs which exhibit high correlation to a disease phenotype and high inter-connectivity, can help clinicians identify complex biological relationships involved in disease progression. The current approach to identifying significant subgraphs relies on hierarchical clustering, which can be used to inform clinicians about important pathways involved in the disease or phenotype of interest. The reliance on a hierarchical clustering approach can hinder subgraph quality by biasing toward finding more compact subgraphs and removing larger significant subgraphs. This study aims to introduce new significant subgraph detection techniques. In particular, we introduce two subgraph detection methods, dubbed Correlated PageRank and Correlated Louvain, by extending the Personalized PageRank Clustering and Louvain algorithms, as well as a hybrid approach combining the two proposed methods, and compare them to the hierarchical method currently in use. The proposed methods show significant improvement in the quality of the subgraphs produced when compared to the current state of the art.
Article
Full-text available
Circular RNA (CircRNA) is a type of non-coding RNAs in which both ends are covalently linked. Researchers have demonstrated that many circRNAs can act as biomarkers of diseases. However, traditional experimental methods for circRNA-disease associations identification are labor-intensive. In this work, we propose a novel method based on the heterogeneous graph neural network and metapaths for circRNA-disease associations prediction termed as HMCDA. First, a heterogeneous graph consisting of circRNA-disease associations, circRNA-miRNA associations, miRNA-disease associations and disease-disease associations are constructed. Then, six metapaths are defined and generated according to the biomedical pathways. Afterwards, the entity content transformation, intra-metapath and inter-metapath aggregation are implemented to learn the embeddings of circRNA and disease entities. Finally, the learned embeddings are used to predict novel circRNA-disase associations. In particular, the result of extensive experiments demonstrates that HMCDA outperforms four state-of-the-art models in fivefold cross validation. In addition, our case study indicates that HMCDA has the ability to identify novel circRNA-disease associations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05441-7.
Article
Full-text available
Background Clinical studies have shown that miRNAs are closely related to human health. The study of potential associations between miRNAs and diseases will contribute to a profound understanding of the mechanism of disease development, as well as human disease prevention and treatment. MiRNA–disease associations predicted by computational methods are the best complement to biological experiments. Results In this research, a federated computational model KATZNCP was proposed on the basis of the KATZ algorithm and network consistency projection to infer the potential miRNA–disease associations. In KATZNCP, a heterogeneous network was initially constructed by integrating the known miRNA–disease association, integrated miRNA similarities, and integrated disease similarities; then, the KATZ algorithm was implemented in the heterogeneous network to obtain the estimated miRNA–disease prediction scores. Finally, the precise scores were obtained by the network consistency projection method as the final prediction results. KATZNCP achieved the reliable predictive performance in leave-one-out cross-validation (LOOCV) with an AUC value of 0.9325, which was better than the state-of-the-art comparable algorithms. Furthermore, case studies of lung neoplasms and esophageal neoplasms demonstrated the excellent predictive performance of KATZNCP. Conclusion A new computational model KATZNCP was proposed for predicting potential miRNA–drug associations based on KATZ and network consistency projections, which can effectively predict the potential miRNA–disease interactions. Therefore, KATZNCP can be used to provide guidance for future experiments.
Article
Full-text available
Background With the development of biotechnology and the accumulation of theories, many studies have found that microRNAs (miRNAs) play an important role in various diseases. Uncovering the potential associations between miRNAs and diseases is helpful to better understand the pathogenesis of complex diseases. However, traditional biological experiments are expensive and time-consuming. Therefore, it is necessary to develop more efficient computational methods for exploring underlying disease-related miRNAs. Results In this paper, we present a new computational method based on positive point-wise mutual information (PPMI) and attention network to predict miRNA-disease associations (MDAs), called PATMDA. Firstly, we construct the heterogeneous MDA network and multiple similarity networks of miRNAs and diseases. Secondly, we respectively perform random walk with restart and PPMI on different similarity network views to get multi-order proximity features and then obtain high-order proximity representations of miRNAs and diseases by applying the convolutional neural network to fuse the learned proximity features. Then, we design an attention network with neural aggregation to integrate the representations of a node and its heterogeneous neighbor nodes according to the MDA network. Finally, an inner product decoder is adopted to calculate the relationship scores between miRNAs and diseases. Conclusions PATMDA achieves superior performance over the six state-of-the-art methods with the area under the receiver operating characteristic curve of 0.933 and 0.946 on the HMDD v2.0 and HMDD v3.2 datasets, respectively. The case studies further demonstrate the validity of PATMDA for discovering novel disease-associated miRNAs.
Article
More and more evidence indicates that the dysregulations of microRNAs (miRNAs) lead to diseases through various kinds of underlying mechanisms. Identifying the multiple types of disease-related miRNAs plays an important role in studying the molecular mechanism of miRNAs in diseases. Moreover, compared with traditional biological experiments, computational models are time-saving and cost-minimized. However, most tensor-based computational models still face three main challenges: (i) easy to fall into bad local minima; (ii) preservation of high-order relations; (iii) false-negative samples. To this end, we propose a novel tensor completion framework integrating self-paced learning, hypergraph regularization and adaptive weight tensor into nonnegative tensor factorization, called SPLDHyperAWNTF, for the discovery of potential multiple types of miRNA-disease associations. We first combine self-paced learning with nonnegative tensor factorization to effectively alleviate the model from falling into bad local minima. Then, hypergraphs for miRNAs and diseases are constructed, and hypergraph regularization is used to preserve the high-order complex relations of these hypergraphs. Finally, we innovatively introduce adaptive weight tensor, which can effectively alleviate the impact of false-negative samples on the prediction performance. The average results of 5-fold and 10-fold cross-validation on four datasets show that SPLDHyperAWNTF can achieve better prediction performance than baseline models in terms of Top-1 precision, Top-1 recall and Top-1 F1. Furthermore, we implement case studies to further evaluate the accuracy of SPLDHyperAWNTF. As a result, 98 (MDAv2.0) and 98 (MDAv2.0-2) of top-100 are confirmed by HMDDv3.2 dataset. Moreover, the results of enrichment analysis illustrate that unconfirmed potential associations have biological significance.
Article
Full-text available
Gene network reconstruction is a bioinformatics task that aims at modelling the complex regulatory activities that may occur among genes. This task is typically solved by means of link prediction methods that analyze gene expression data. However, the reconstructed networks often suffer from a high amount of false positive edges, which are actually the result of indirect regulation activities due to the presence of common cause and common effect phenomena or, in other terms, due to the fact that the adopted inductive methods do not take into account possible causality phenomena. This issue is accentuated even more by the inherent presence of a high amount of noise in gene expression data. Existing methods for the identification of a transitive reduction of a network or for the removal of (possibly) redundant edges suffer from limitations in the structure of the network or in the nature/length of the indirect regulation, and often require additional pre-processing steps to handle specific peculiarities of the networks (e.g., cycles). Moreover, they are not able to consider possible community structures and possible similar roles of the genes in the network (e.g. hub nodes), which may change the tendency of nodes to be highly connected (and with which nodes) in the network. In this paper, we propose the method INLOCANDA, which learns an inductive predictive model for gene network reconstruction and overcomes all the mentioned limitations. In particular, INLOCANDA is able to (i) identify and exploit indirect relationships of arbitrary length to remove edges due to common cause and common effect phenomena; (ii) take into account possible community structures and possible similar roles by means of graph embedding. Experiments performed along multiple dimensions of analysis on benchmark, real networks of two organisms (E. coli and S. cerevisiae) show a higher accuracy with respect to the competitors, as well as a higher robustness to the presence of noise in the data, also when a huge amount of (possibly false positive) interactions is removed. Availability: http://www.di.uniba.it/~gianvitopio/systems/inlocanda/
Article
Full-text available
Lezak B, Varacallo M. Anatomy, Bony Pelvis and Lower Limb, Foot Veins. [Updated 2019 Jun 6]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2019 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK542295/
Article
Full-text available
Comprehensive databases of microRNA-disease associations are continuously demanded in biomedical researches. The recently launched version 3.0 of Human MicroRNA Disease Database (HMDD v3.0) manually collects a significant number of miRNA-disease association entries from literature. Comparing to HMDD v2.0, this new version contains 2-fold more entries. Besides, the associations have been more accurately classified based on literature-derived evidence code, which results in six generalized categories (genetics, epigenetics, target, circulation, tissue and other) covering 20 types of detailed evidence code. Furthermore, we added new functionalities like network visualization on the web interface. To exemplify the utility of the database, we compared the disease spectrum width of miRNAs (DSW) and the miRNA spectrum width of human diseases (MSW) between version 3.0 and 2.0 of HMDD. HMDD is freely accessible at http://www.cuilab.cn/hmdd. With accumulating evidence of miRNA-disease associations, HMDD database will keep on growing in the future.
Article
Full-text available
Background High throughput technologies have provided the scientific community an unprecedented opportunity for large-scale analysis of genomes. Non-coding RNAs (ncRNAs), for a long time believed to be non-functional, are emerging as one of the most important and large family of gene regulators and key elements for genome maintenance. Functional studies have been able to assign to ncRNAs a wide spectrum of functions in primary biological processes, and for this reason they are assuming a growing importance as a potential new family of cancer therapeutic targets. Nevertheless, the number of functionally characterized ncRNAs is still too poor if compared to the number of new discovered ncRNAs. Thus platforms able to merge information from available resources addressing data integration issues are necessary and still insufficient to elucidate ncRNAs biological roles. Results In this paper, we describe a platform called Arena-Idb for the retrieval of comprehensive and non-redundant annotated ncRNAs interactions. Arena-Idb provides a framework for network reconstruction of ncRNA heterogeneous interactions (i.e., with other type of molecules) and relationships with human diseases which guide the integration of data, extracted from different sources, via mapping of entities and minimization of ambiguity. Conclusions Arena-Idb provides a schema and a visualization system to integrate ncRNA interactions that assists in discovering ncRNA functions through the extraction of heterogeneous interaction networks. The Arena-Idb is available at http://arenaidb.ba.itb.cnr.it
Article
Full-text available
Accumulating evidence suggests that diverse non-coding RNAs (ncRNAs) are involved in the progression of a wide variety of diseases. In recent years, abundant ncRNA-disease associations have been found and predicted according to experiments and prediction algorithms. Diverse ncRNA-disease associations are scattered over many resources and mammals, whereas a global view of diverse ncRNA-disease associations is not available for any mammals. Hence, we have updated the MNDR v2.0 database (www.rna-society.org/mndr/) by integrating experimental and prediction associations from manual literature curation and other resources under one common framework. The new developments in MNDR v2.0 include (i) an over 220-fold increase in ncRNA-disease associations enhancement compared with the previous version (including lncRNA, miRNA, piRNA, snoRNA and more than 1400 diseases); (ii) integrating experimental and prediction evidence from 14 resources and prediction algorithms for each ncRNA-disease association; (iii) mapping disease names to the Disease Ontology and Medical Subject Headings (MeSH); (iv) providing a confidence score for each ncRNA-disease association and (v) an increase of species coverage to six mammals. Finally, MNDR v2.0 intends to provide the scientific community with a resource for efficient browsing and extraction of the associations between diverse ncRNAs and diseases, including >260 000 ncRNA-disease associations.
Article
Full-text available
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
Article
Full-text available
Heterogeneous information networks consist of different types of objects and links. They can be found in several social, economic and scientific fields, ranging from the Internet to social sciences, including biology, epidemiology, geography, finance and many others. In the literature, several clustering and classification algorithms have been proposed which work on network data, but they are usually tailored for homogeneous networks, they make strong assumptions on the network structure (e.g. bi-typed networks or star-structured networks), or they assume that data are independently and identically distributed (i.i.d.). However, in real-world networks, objects can be of multiple types and several kinds of relationship can be identified among them. Moreover, objects and links in the network can be organized in an arbitrary structure where connected objects share some characteristics. This violates the i.i.d. assumption and possibly introduces autocorrelation. To overcome the limitations of existing works, in this paper we propose the algorithm HENPC, which is able to work on heterogeneous networks with an arbitrary structure. In particular, it extracts possibly overlapping and hierarchically-organized heterogeneous clusters and exploits them for predictive purposes. The different levels of the hierarchy which are discovered in the clustering step give us the opportunity to choose either more globally-based or more locally-based predictions, as well as to take into account autocorrelation phenomena at different levels of granularity. Experiments on real data show that HENPC is able to significantly outperform competitor approaches, both in terms of clustering quality and in terms of classification accuracy.
Article
Motivation: The reconstruction of Gene Regulatory Networks (GRNs) from gene expression data has received increasing attention in recent years, due to its usefulness in the understanding of regulatory mechanisms involved in human diseases. Most of the existing methods reconstruct the network through machine learning approaches, by analyzing known examples of interactions. However, i) they often produce poor results when the amount of labeled examples is limited, or when no negative example is available and ii) they are not able to exploit information extracted from GRNs of other (better studied) related organisms, when this information is available. Results: In this paper we propose a novel machine learning method which overcomes these limitations, by exploiting the knowledge about the GRN of a source organism for the reconstruction of the GRN of the target organism, by means of a novel transfer learning technique. Moreover, the proposed method is natively able to work in the Positive-Unlabeled setting, where no negative example is available, by fruitfully exploiting a (possibly large) set of unlabeled examples. In our experiments we reconstructed the human GRN, by exploiting the knowledge of the GRN of M. musculus. Results showed that the proposed method outperforms state-of-the-art approaches and identifies previously unknown functional relationships among the analyzed genes. Availability: http://www.di.uniba.it/~mignone/systems/biosfer/index.html. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Non‐coding RNAs (ncRNAs) are emerging as potent and multifunctional regulators in all biological processes. In parallel, a rapidly‐growing number of studies has unravelled associations between aberrant non‐coding RNA expression and human diseases. These associations have been extensively reviewed, often with the focus on a particular miRNA (family) or a selected disease/pathology. In this Mini‐Review, we highlight a selection of studies in order to demonstrate the widescale involvement of microRNAs (miRNAs) and long non‐coding RNAs (lncRNAs) in the pathophysiology of three types of diseases: cancer, cardiovascular and neurological disorders. This research is opening new avenues to novel therapeutic approaches. This article is protected by copyright. All rights reserved.
Article
In this study, a new Fuzzy Time Series (FTS) model based on the Fuzzy K-Medoid (FKM) clustering algorithm is proposed in order to forecast air pollution. FTS models generally have some advantages when compared with other techniques used in forecasting of air pollution as they do not require any statistical assumptions on time series data; and they provide successful forecasting results even in situations where the number of observations is small and where data sets include uncertainty, still allowing for generalization. But existing FTS models based on fuzzy clustering fail in modeling of data sets that include outliers such as air pollution data. The potential superiority of the proposed model is to be a robust technique for outliers and abnormal observations. In order to show the performance of the proposed method in forecasting of air pollution, a time series consisting of SO2 concentrations measured in 65 monitoring stations in Turkey are used. According to the results of analyses, it is observed that the proposed method provides successful forecasting results especially in time series which include numerous outliers.