Homomorphisms of Multisource Trees into Networks with Applications to Metabolic Pathways
ABSTRACT Network mapping is a convenient tool for comparing and exploring biological networks; it can be used for predicting unknown pathways, fast and meaningful searching of databases, and potentially establishing evolutionary relations. Unfortunately, existing tools for mapping paths into general networks (PathBlast) or trees into tree networks allowing gaps (MetaPathwayHunter) cannot handle large query pathways or complex networks. In this paper we consider homomorphisms, i.e., mappings allowing to map different enzymes from the query pathway into the same enzyme from the networks. Homomorphisms are more general than homeomorphism (allowing gaps) and easier to handle algorithmically. Our dynamic programming algorithm efficiently finds the minimum cost homomorphism from a multisource tree to directed acyclic graphs as well as general networks. We have performed pairwise mapping of all pathways for four organisms (E. coli, S. cerevisiae, B. subtilis and T. thermophilus species) and found a reasonably large set of statistically significant pathway similarities. Further analysis of our mappings identifies conserved pathways across examined species and indicates potential pathway holes in existing pathway descriptions.

Conference Paper: Efficient Alignments of Metabolic Networks with Bounded Treewidth.
[Show abstract] [Hide abstract]
ABSTRACT: The accumulation of highthroughput genomic and proteomic data allows for the reconstruction of the increasingly large and complex metabolic networks. In order to analyze accumulated data and reconstructed networks, it is critical to identify network patterns and evolutionary relations between metabolic networks. But even finding similar networks becomes computationally challenging. Alignment of the reconstructed networks can help to catch model inconsistencies and infer missing elements. We have formulated the network alignment problem which asks for the optimal vertextovertex mapping allowing path contraction, vertex deletion, and vertex insertions. This paper gives the first efficient algorithm for optimal aligning of metabolic pathways with bounded tree width. In particular, the optimal alignment from pathway P to pathway T can be found in time O(VP VT(a+1), where VP and VT are the vertex sets of pathways and a is the tree width of P. This significantly improves alignment tools since the E.coli metabolic network has tree width 3 and more than 90% of pathways of several organisms are seriesparallel. We have implemented the algorithm for alignment of metabolic pathways of tree width 2 with arbitrary metabolic networks. Our experiments show that allowing pattern vertex deletion significantly improves alignment. We also have applied the network alignment to identifying inconsistency, inferring missing enzymes, and finding potential candidates for filling the holes.ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 14 December 2010; 01/2010  08/2009: pages 271  293; , ISBN: 9783527627981

Conference Paper: FixedParameter Tractable Combinatorial Algorithms for Metabolic Networks Alignments.
[Show abstract] [Hide abstract]
ABSTRACT: The accumulation of highthroughput genomic and proteomic data allows for the reconstruction of the increasingly large and complex metabolic networks. In order to analyze accumulated data and reconstructed networks, it is critical to identify network patterns and evolutionary relations between metabolic networks. But even finding similar networks is computationally challenging. Based on the property of gene duplication and function sharing in biological network, we have formulated the network alignment problem which asks the optimal vertextovertex mapping allowing path contraction, vertex deletion, and vertex insertions. In this paper we present fixed parameter tractable combinatorial algorithms, which take into account the enzymes' functions and the similarity of arbitrary network topologies such as trees and arbitrary graphs wit hallowing the different types of vertex deletions. The proposed algorithms are fixed parameter tractable in the liner or square of the size of feedback vertex set respectively for the case of disallowing or allowing the deletions. We have developed the web service tool MetNetAligner which aligns metabolic networks. We evaluated our results by the randomizedPValue computation. In the computation, we followed two standard randomization procedures and further developed two other random graph generators which keep the more stringent and consistent topology constraints. By comparing their distribution of the significant alignment pairs, we observed that the more stringent constraints in the topology the random graph generator has, the more pairs of significant alignments there exist. We also performed pair wise mapping of all pathways for four organisms and found a set of statistically significant pathway similarities. We have applied the network alignment to identifying pathway holes which are resulted by inconsistency and missing enzymes. MetNetAligner is available athttp://\alla.cs.gsu.edu:8080/MinePW/pages/gmapping/GMMain.html Two ran  dom graph generations and the list of identified pathway holes are available online.ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 14 December 2010; 01/2010
Page 1
Homomorphisms of Multisource Trees into Networks with Applications to
Metabolic Pathways
Qiong Cheng, Robert Harrison, Alexander Zelikovsky∗
Department of Computer Science, Georgia State University, Atlanta, Georgia 30303
Email: cscqxcx, rharrison, alexz@cs.gsu.edu
Abstract
Network mapping is a convenient tool for comparing and
exploring biological networks; it can be used for predict
ing unknown pathways, fast and meaningful searching of
databases, and potentially establishing evolutionary rela
tions. Unfortunately, existing tools for mapping paths into
general networks (PathBlast) or trees into tree networks
allowing gaps (MetaPathwayHunter) cannot handle large
query pathways or complex networks.
In this paper we consider homomorphisms, i.e., map
pings allowing to map different enzymes from the query
pathway into the same enzyme from the networks.
momorphisms are more general than homeomorphism (al
lowing gaps) and easier to handle algorithmically. Our
dynamic programming algorithm efficiently finds the min
imum cost homomorphism from a multisource tree to di
rected acyclic graphs as well as general networks.
We have performed pairwise mapping of all pathways
for four organisms (E. coli, S. cerevisiae, B. subtilis and T.
thermophilus species) and found a reasonably large set of
statistically significant pathway similarities. Further anal
ysis of our mappings identifies conserved pathways across
examined species and indicates potential pathway holes in
existing pathway descriptions.
Ho
Availability: The software is available from the authors
on request.
1. Introduction
The explosive growth of cellular network databases re
quires novel analytical methods constituting a new interdis
ciplinary area of computational systems biology. The main
problems in this area are finding conserved subnetworks,
integrating interacting gene networks, protein networks
and biochemical reactions, discovering critical elements or
∗Correspondence author.
modules and finding homologous pathways. With the im
mense increase of goodquality data from highthroughput
genomic and proteomic technologies, studies of these ques
tions are more and more challenging from analytical and
computational perspectives.
Network mapping is a convenient tool for comparing and
exploring biological networks. When mapping metabolic
pathways by matching similar enzymes and chemical reac
tions chains we can match homologous pathways. Network
mapping can be used for predicting unknown pathways, fast
and meaningful searching of databases, and potentially es
tablishing evolutionary relations.
Let the pattern be a pathway for which we are search
ing for homologous pathways in the text, i.e., the known
metabolic network of a different species (see Figure 1).
This problem includes the Isomorphic Embedding problem,
therefore it is NPhard (see [8]). Given a linear length?
pathway as the pattern and a graph as the text, PathBlast
(see [9, 8, 14]) finds the image of the pattern in the text such
that no consecutive mismatches or gaps on the pattern and
the text are allowed. The pathtopath mapping algorithm
builds a global alignment graph and decomposes it to linear
pathways mapping.
A single enzyme in one pathway may replace a few se
quential enzymes in homologous pathway and vise versa.
MetaPathwayHunter [12, 13] finds the optimal homeomor
phic treetotree mapping allowing an arbitrary number of
gaps. In contrast to the previous approaches, we allow for
the mapping of different enzymes from the pattern into the
same enzyme from the text while keeping the freedom to
mapasingleedgefromthepatterntoapathinthetext. Such
mappings (homomorphisms) are more general than homeo
morphisms and easier to handle algorithmically.
Our contributions include: (1) efficient dynamic pro
gramming based algorithm and its implementation finding
the minimum cost homomorphism from multisource trees
into arbitrary networks, (2) a new protein similarity score
scheme based on 4digit EC enzyme hierarchy, (3) experi
mental pairwise comparison of all pathways in four differ
ent organisms (E. coli, S. cerevisiae, B. subtilis and T. ther
Page 2
6.3.4.3
6.3.2.121.5.1.3 2.1.2.1
1.5.1.20
2.1.1.13
2.1.1.45 0.0.0.0
6.3.2.17
3.5.1.10
1.5.1.5 3.5.4.9
1.5.1.53.5.4.96.3.4.3
Pattern : Formaldehyde oxidation V pathway in B. subtilis
Text : Formy1THF biosynthesis pathway in E. coli
Figure 1. An example of network mapping to find an im
age of pattern in text.
mophilus) resulting in a reasonably large set of statistically
significant pathway similarities, (4) identification of path
ways conserved across examined species, potential holes in
existing pathway descriptions, and an estimation of the evo
lutionary relationship between examined species.
The remainder of the paper is organized as follows. The
next section describes previous work. Section 3 presents the
proposed models for mapping with the definition of pro
tein similarity score scheme. Section 4 introduces neces
sary definitions and graphtheoretical problem formulation.
Section5presentsourdynamicprogrammingalgorithmand
analyzes its runtime. Section 6 describes our computational
study of metabolic pathways of four organisms. The analy
sis and validation of experimental study is given in Section
7. Finally, we draw conclusions in Section 8.
2Previous Work
The earlier papers comparing pathways did not take into
account their topology and instead focused only on sim
ilarity between proteins or genes. Biochemical pathway
similarity was defined in terms of sequence similarity of
involved genes (see [5]); a multiple pairwise comparison
algorithm utilizing 4digit EC enzyme hierarchy was pro
posed in [15]. The alignment of linear pathways was re
duced to the sequence alignment problem in [2].
A series of papers [9, 8, 14] has taken into account
the nonlinearity of protein network topology and formu
lated the mapping problem as follows.
length? pathway pattern T = (V T,ET) and text graph
G = (V G,EG), find an image of the pattern in the text
without consecutive gaps and minimizing mismatches be
tween proteins. A global alignment graph in [9] was built
in which each vertex represents a pair of proteins and
each edge represents a conserved interaction, gap, or mis
Given a linear
matches; their objective is to find the khighestscoring
path with limited length ? and no consecutive gaps or mis
matchesbasedonthebuiltglobalgraph. Theapproachtakes
O(V T?+2V G2)).
PathBlast with the same problem formulation as [9] was
presented in [8]. However, PathBlast’s solution is to ran
domly decompose the text graph into linear pathways which
are then aligned against the pattern and then to obtain op
timal mapping based on standard sequence alignment algo
rithms. The algorithm requires O(?!) random decomposi
tions to ensure that no significant alignment is missed, ef
fectively limiting the size of the query to about six vertices.
In [13], metabolic pathways were modeled as outgo
ing trees; the problem was reduced to the approximately
labeled tree homeomorphism problem.
bottomup dynamic programming algorithm has runtime
O(m2n/logm+mnlogn) where m and n are the number
of vertices in pattern and text, respectively.
In [11] the problem was formulated as an integer
quadratic problem to obtain the global similarity score
based on the mapping of as many as nodetonode similar
ity and as many as edgetoedge similarity. An exhaustive
searching approach was employed in [17] to find the vertex
tovertexandpathtopathmappingswiththemaximalmap
ping score under the condition of limited length of gaps or
mismatch. The algorithm has worst case time complexity
O(2m× m2). A labeldiversity backtrack algorithm was
proposed in [16] to align two networks with cycles based
on the mapping of as many as pathtopath similarity.
Finally, amoregeneralapproachtoaligningofmetabolic
pathways – homomorphisms allowing edgetopath map
ping – has been first proposed in [3].
The proposed
3Modeling Metabolic Pathway Mappings
A metabolic pathway is a series of chemical reactions
catalyzed by enzymes that occur within a cell. Metabolic
pathways are represented by directed networks in which
vertices correspond to enzymes and there is a directed edge
from one enzyme to another if the product of the reaction
catalyzed by the first enzyme is a substrate of the reaction
catalyzed by the second.
Mapping metabolic pathways should capture the similar
ities of enzymes represented by proteins as well as topolog
ical properties that cannot be always reduced to sequential
reactions represented by paths. Below we first describe our
approach to measure enzyme similarity and then discuss ad
vantages and drawbacks of the homomorphism mappings of
metabolic networks.
Our implementation provides two alternative enzyme
similarity scores. One approach is to employ the lowest
common upper class distribution proposed in [15] and dis
cussed in [13]. The corresponding penalty score for gap is
Page 3
2.0.
OurnewapproachmakesfulluseofECencodingandthe
tight reaction property classified by EC. The EC number is
expressed with a 4level hierarchical scheme. The 4digit
EC number, d1.d2.d3.d4represents a subsubsubclass in
dication of biochemical reaction. If d1.d2of two enzymes
are different, their similarity score is infinite; if d3of two
enzymes are different, their similarity score is 10; if d4of
two enzymes are different, their similarity score is 1; or else
the similarity score is 0. The corresponding penalty score
for gap is 0.5. Our experimental study indicates that the
proposed similarity score scheme results in biochemically
more relevant pathway matches.
The topology of most metabolic pathways is a simple
path, but frequently pathways may branch or have several
incoming arcs – all such topologies are instances of a mul
tisource tree, i.e., a directed graph which becomes an undi
rected tree when edge directions are disregarded. The query
pathways are usually simple and can be represented as a
multisource tree but in some cases they can have a cycle or
alternative ways to reach the same vertex. Then we sug
gest to follow the standard practice of breaking such cycle
or paths by removing edges.
The obvious way to preserve the pathway topology is
to use isomorphic embedding – onetoone correspondence
between vertices and edges of the pattern and its image in
the text. The requirement on edges can be relaxed – an
edge in the pattern can be mapped to a path in the text
([12,13])andthecorrespondingmappingiscalledahomeo
morphism. The computational drawback of isomorphic em
bedding and homeomorphism is that the problem of finding
optimal mapping is NPcomplete and, therefore, requires
severe constraints on the topology of the text to become ef
ficient. In [12, 13], the text is supposed to be a tree, the pat
tern should a directed tree while allowing multisourcetree
pattern complicates the algorithm. Their algorithm is com
plex and slow because it repeatedly finds minimum weight
perfect matchings.
In this paper we propose to additionally relax onetoone
correspondence between vertices – instead we allow differ
ent pattern vertices to be mapped to a single text vertex. The
corresponding mapping is called a homomorphism. Such
relaxation may sometimes cause confusion – a path can be
mapped to a cycle. For instance, if two enzymes with sim
ilar functions belong to the same path in the pattern and a
cycle with similar enzyme belongs to the text, then the path
can be mapped into a cycle (see Figure 2). However, if the
text graph is acyclic this cannot happen. Even if there are
cycles in the text, still one can expect that functionally sim
ilar enzymes are very rare in the same path.
Computing minimum cost homomorphisms is much
simpler and faster than homeomorphisms. We will show
that a fast dynamic programming algorithm can find the
C
D
B
A
A’=f(A)=f(D)
B’=f(B)
C’=f(C)
PathEdge
Vertextovertex Mapping
the pattern onto a cycle (A?,B?,...,C?,D?= A?) in the
text.
Path
Cycle
Figure 2. Homomorphism of a path (A,B,...,C,D) in
minimum cost homomorphism that allows edgetopath
mapping for the multisource tree pattern and an arbitrary
text graph.
4GraphTheoretical Problem Formulation
Wefirstgivenotationsanddefinitionsandthenformulate
the corresponding graphtheoretical problem.
A pattern T = (V T,ET) is a directed graph with vertex
set V T and edge set ET. We only consider the case of
T being a multisource tree. Following [13], a multisource
tree is a directed graph, whose underlying undirected graph
is a tree. It is not necessarily a directed tree – each node can
have several incoming as well as several outgoing edges.
A text G = (V G,EG) is a directed graph with vertex set
V G and edge set EG. We further distinguish the case when
G is a general network and the case when G is a directed
acyclic graph.
A mapping f : T → G from pattern T = (V T,ET) to
text G = (V G,EG) is called a homomorphism if
(1) every vertex in V T is mapped to a vertex in V G ;
(2) every edge e = (u,v) ∈ ET is mapped to a directed
path f(e) = (u0= f(u),u1,u2,...,uk= f(v)) in G.
We will now introduce the cost of a homomorphism. Let
∆(u,v), u ∈ V T, v ∈ V G, be the cost of mapping an en
zyme corresponding to the pattern vertex u into an enzyme
corresponding to the text vertex v.
Following [13], the rule (2) allows edgetopath map
ping, but edgetoedge mapping is still preferable. There
fore, the homomorphism cost should increase proportion
ally to the number of extra hops in the images of edges, i.e.,
?
where f(e) = k is the number of hops in the path f(e) =
(u0= f(u),u1,u2,...,uk= f(v)).
e∈ET
(f(e) − 1)
Page 4
Following [13], the cost of a homomorphism f : T → G
takes in account cost of enzyme mapping and edgetopath
mapping as follows
?
where λ is the cost of a single extra hop in an edgeto
path mapping. This parameter balances enzyme mapping
and edgetopath costs. If λ = 0, then only the enzyme
mapping cost is taken into account. If λ is very large, then
the enzyme mapping cost contribution is negligible. In our
computational experiments we use λ = 0.5.
Finally, the graphtheoretical problem formulation is as
follows.
cost(f) =
v∈V T
∆(v,f(v)) + λ
?
e∈ET
(f(e) − 1)
Minimum Cost Homomorphism Problem. Given a multi
source pattern tree T and a text graph G, find the minimum
cost homomorphism f : T → G.
5Dynamic Programming Algorithm
We will first describe preprocessing of the text graph G
and ordering of vertices of the pattern graph T. Then we
define the dynamic programming table and show how to fill
that table in a bottomup manner. We conclude with the
runtime analysis of the entire algorithm.
Text Graph Preprocessing. In order to compute the cost
of a homomorphism it is necessary to know the number of
hops for any shortest path in the text graph G. Although
finding singlesource shortest paths in general graphs is
slow, in our case it is sufficient to run breadthfirstsearch
with runtime O(EG + V G). Assuming that G is con
nected, i.e., EG ≥ V G, we conclude that the total run
time of finding all shortest paths is O(V GEG). In the
resulting transitive closure G?= (V G,EG?) of the graph
G, each edge e ∈ EG?is supplied with the number of hops
h(e) in the shortest path connecting its ends.
Pattern Graph Ordering. We will further need a cer
tain fixed order of vertices in V T as follows. Let T?=
(V T,ET?) be the undirected tree obtained from T by dis
regarding edge directions. Let us choose an arbitrary vertex
r ∈ V T as a root and run depthfirst search (DFS) in T?
from r. Let {r = v1,...,vV T} be the order of the DFS
traversal of V T and let e?
to directed edge ei∈ ET) be the unique edge connecting
vito the set {v1,...,vi−1}. The vertex v ∈ {v1,...,vi−1}
is called a parent of viand viis called a child of v.
DP Table. Now we will describe our dynamic program
ming table DT[1,...,V T][1,...,V G]. Each row and
column of this table corresponds to a vertex of T and
G, respectively. While the columns u1,...,uV Gof DT
are in no particular order, the rows {r = v1,...,vV T}
i= (vi,v) ∈ ET?(corresponding
of DT are sorted according to the DFS traversal of T?.
Each element DT[i,j] is equal to the best cost of a ho
momorphism from the subgraph of T induced1by vertices
{vV T,vV T−1,...,vi} into G?which maps viinto uj.
Filling DP Table. The table DT is filled bottomup for
i = V T,V T − 1,...,1 as follows. If viis not a parent
for any vertex in T, then viis a leaf and
DT[i,j] = ∆(vi,uj)
In general, let vibe a parent for the vertices vi1,...,vik.
In order to compute DT[i,j], we should find the cheapest
mapping of each of the children vi1,...,viksubject to vi
being mapped to uj. The mappings of the children do not
depend on each other since the only connection between
them in the tree T is through vi. Therefore, each child vil,
l = 1,...,k should be mapped intoujlminimizing the con
tribution of vilto the total cost
C[il,jl] = DT[il,jl] + λ(h(j,jl) − 1)
where h(j,jl) depends on direction of eil, i.e., h(j,jl) =
h(uj,ujl) if eil= (vi,vil) and h(j,jl) = h(ujl,uj) if
eil= (vil,vi). Finally,
DT[i,j] = ∆(vi,uj) +
k
?
l=1
min
j?=1,...,V GC[il,j?]
Runtime Analysis. As we mentioned earlier, the runtime
for constructing the transitive closure G?= (V G,EG?) is
O(V GEG). The runtime to fill a cell DT[i,j] is propor
tional to
tij= degT(vi)degG?(uj)
where degT(vi) and degG?(uj) are degrees of viand ujin
graphs T and G?, respectively. Indeed, the number of chil
dren of viis degT(vi) − 1 and for each child vilof vithere
are at most degG?(uj) feasible positions in G?since f(vi)
and f(vil) should be adjacent. The runtime to fill the entire
table DT is proportional to
V G
?
Thus the total runtime is O(V GEG+EG?V T). Even
thoughGissparse, EG?maybeaslargeasO(V G2), i.e.,
the runtime is O(V G(EG + V GV T)).
j=1
V T
?
i=1
tij=
V G
?
j=1
degG?(uj)
V T
?
i=1
degT(vi) = 2EG?ET
6Mapping Metabolic Pathways
In this section we first describe the metabolic pathway
data, then explain how we measure statistical significance
1A subgraph induced by the subset of vertices S includes only edges
that have both ends in S.
Page 5
pattern network
(tree pathways)
T. thermophilus
text network (number of pthways)
B. subtilis(226)
21
14
20
217
143
153
5
3
5
12
7
12
T. thermophilus(208)
38
28
35
162
80
106
9
2
9
24
9
21
E. coli(113)
18
12
18
121
85
92
38
3
14
12
6
12
S. cerevisiae(151)
18
13
17
58
39
40
3
2
5
14
13
14
# of mapping pairs
# of mapped pattern pathways
# of mapped text pathways
# of mapping pairs
# of mapped pattern pathways
# of mapped text pathways
# of mapping pairs
# of mapped pattern pathways
# of mapped text pathways
# of mapping pairs
# of mapped pattern pathways
# of mapped text pathways
B. subtilis
E. coli
S. cerevisiae
Table 1. Pairwise statistical homomorphisms among T. thermophilus, B. subtilis, E. coli and S. cerevisiae.
of homomorphisms and report the results of pairwise map
pings between four species.
Data. The genomescale metabolic network data in our
studies were drawn from BioCyc [1, 7, 10], the collection of
260 Pathway/Genome Databases, each of which describes
metabolic pathways and enzymes of a single organism. We
havechosenmetabolicnetworksofE.coli, theyeastS.cere
visiae, the eubacterium B. subtilis and the archeabacterium
T. thermophilus so that they cover major lineages Archaea,
Eukaryotes, and Eubacteria. The bacterium E. coli with
113 pathways is the most extensively studied prokaryotic
organism. T. thermophilus with 208 pathways belongs to
Archaea. B. subtilis with 226 pathways is one of the best
understood Eubacteria in terms of molecular biology and
cell biology. S. cerevisiae with 151 pathways is the most
thoroughly researched eukaryotic microorganism.
Statistical Significance of Mapping. Although the cost
of a homomorphism reflects the similarity of pathways, it
alone cannot assure us that such cost is not obtained by
chance. Only statistically significant cost values can be
taken in account. Statistical significance is measured by p
value, i.e., the probability of the null hypothesis that the cost
value is obtained by pure chance. Following a standard ran
domization procedure, we randomly permute pairs of edges
(u,v) and (u?,v?) if no other edges exist between these 4
vertices u,u?,v,v?in the text graph by reconnecting them
as (u,v?) and (u?,v). This allows us to keep the incoming
and outgoing degree of each vertex intact. We find the min
imum cost homomorphism from the pattern graph into the
fully randomization of the text graph and check if its cost is
at least as big as the minimum cost before randomization of
the text graph. We say that the homomorphism is statisti
cally significant with p < 0.01 if we found at most 9 better
costs in 1000 randomization of the text graph.
Experiments. For each pair of four species (B. subtilis, E.
coli, T. thermophilus and S. cerevisiae), using our algorithm
we find the best homomorphism from each pathway of one
species to each pathway of the other and check if this homo
morphism is statistically significant, i.e., if p < 0.01. We
have run our experiments on a Pentium 4 processor, 2.99
GHz clock with 1.00 GB RAM. The total runtime was 1.5h
for the input/output of pathways and computing the opti
mal patterntotext mapping and its pvalue for every pair
of pathways (there are in total 516052 patterntext pathway
pairs).
Results. The results of our experiments are reported in Ta
ble 1. The first column contains the name of the species
from whose metabolic network the pattern pathways have
been chosen. Note that if a pathway is not a multisource
tree or degenerate (i.e., has less than 3 nodes), then it is
omitted. We did not omit any pathway from the text species
since our algorithm supports any network as a text. For
every speciestospecies mapping, we compute the number
of mapped pairs with p < 0.01, the number of the pattern
pathways that have at least one statistically significant ho
momorphic image and the number of the text pathways that
have at least one statistically significant homomorphic pre
image.
For example, for homomorphism from T. thermophilus
to B. subtilis, there are in total 21 statistically significant
mapped pairs, 14 nondegenerate tree T. thermophilus path
ways have statistically significant homomorphic images in
B. subtilis and 20 out of 226 B. subtilis pathways have sta
tistically significant homomorphic preimages.
7 Implications of Pathway Mappings
In this section we identify pathways conserved across
multiple species, show how one can resolve enzyme am
biguity and identify potential holes in pathways, and phylo
genetically validate our pathway mappings.
Identifying Conserved Pathways. We first have identified
the pathways that are conserved across all 4 species under
consideration. Table 2 contains a list of all 20 pathways in
Page 6
Pathway name
alanine biosynthesis I
biotin biosynthesis I
coenzyme A biosynthesis
fatty acid beta
fatty acid elongation saturated
formaldehyde oxidation V (tetrahydrofolate pathway)
glyceraldehyde 3 phosphate degradation
histidine biosynthesis I
homoserine biosynthesis
lysine biosynthesis I
ornithine biosynthesis
phenylalanine biosynthesis I
phenylalanine biosynthesis II
polyisoprenoid biosynthesis
proline biosynthesis I
quinate degradation
serine biosynthesis
superpathway of gluconate degradation
tyrosine biosynthesis I
UDP galactose biosynthesis
alanine biosynthesis
biotin biosynthesis
fatty acid oxidation pathway
fructoselysine and psicoselysine degradation
Table 2. The list of all 20 pathways in B. subtilis that
have statistically significant homomorphic images simulta
neously in all 3 other species E. coli, T. thermophilus and S.
cerevisiae. The lower part contains 4 more different path
ways with statistically significant images in all 4 species.
Pathway name
triple: B. subtilis, E. coli, and T. thermophilus
4 aminobutyrate degradation I
de novo biosynthesis of pyrimidine deoxyribonucleotides
de novo biosynthesis of pyrimidine ribonucleotides
enterobacterial common antigen biosynthesis
phospholipid biosynthesis I
PRPP biosynthesis II
salvage pathways of pyrimidine deoxyribonucleotides
ubiquinone biosynthesis
flavin biosynthesis
glycogen biosynthesis I (from ADP D Glucose)
L idonate degradation
lipoate biosynthesis and incorporation I
menaquinone biosynthesis
NAD biosynthesis I (from aspartate)
triple: B. subtilis, E. coli, and S. cerevisiae
oxidative branch of the pentose phosphate pathway
S adenosylmethionine biosynthesis
triple: B. subtilis, T. thermophilus, and S. cerevisiae
tyrosine biosynthesis I
fatty acid elongation unsaturated I
Table 3. The list of 14 pathways conserved across B. sub
tilis, E. coli, and T. thermophilus; 2 more pathways con
served across B. subtilis, E. coli, and S. cerevisiae; 2 more
pathways conserved across B. subtilis, T. thermophilus, and
S. cerevisiae.
Page 7
B. subtilis that have statistically significant homomorphic
images simultaneously in all species. The lower part of Ta
ble 2 contains 4 more pathways with different names in E.
coli, T. thermophilus and S. cerevisiae, which have simulta
neous statistically significant images in all species.
Besides 24 pathways conserved across all 4 species we
have also found 18 pathways only common for triples of
these species. Table 3 gives the pathway names for each
possible triple of species (the triple E. coli, T. thermophilus
and S. cerevisiae does not have extra conserved pathways).
2.6.1.1
2.6.1.1
1.2.4. 2.3.1.
2.3.1.61
6.2.1.5
6.2.1.5
1.3.99.1
1.3.99.1
4.2.1.2
4.2.1.2
1.1.1.1.82
1.1.1.82
1.2.4.2
Figure 3. Mapping of glutamate degradation VII path
ways from B. subtilis to T. thermophilus (p < 0.01). The
node with upper part and lower part represents a vertexto
vertex mapping. The upper part represents the query en
zyme and the lower part represents the text enzyme. The
shaded node reflects enzyme homology.
3.5.3.6
3.5.3.6
1.5.1.21.5.1.
1.5.1.2
2.1.3.3
2.1.3.3
2.6.1.13
2.6.1.13
0.0.0.0
0.0.0.0
5.1.1.4
5.1.1.4
0.0.0.0
0.0.0.0
0.0.0.0
0.0.0.0
1.4.1.12
1.4.1.12
6.1.1.12
6.1.1.12
5.4.3.5
5.4.3.
Figure 4. Mapping of interconversion of arginine, or
nithine and proline pathway from T. thermophilus to B. sub
tilis (p < 0.01). The node with upper part and lower part
represents a vertextovertex mapping. The upper part rep
resents the query enzyme and the lower part represents the
text enzyme. The shaded node reflects enzyme homology.
Resolving Ambiguity. Currently multiple pathways con
tain unresolved enzymes. Completely unresolved enzymes
have EC notation 0.0.0.0/... and partially unresolved en
zymes have less ””’s, e.g., EC notation 1.2.4.. We can
use our mapping tool to suggest possible resolution of these
ambiguities as follows.
Let us consider two examples of homomorphism –
the mapping of glutamate degradation VII pathway in B.
subtilis to glutamate degradation VII pathway in T. ther
mophilus (shown in Figure 3), and the mapping of inter
conversion of arginine, ornithine and proline pathway in T.
thermophilus to interconversion of arginine, ornithine and
proline pathway in B. subtilis (shown in Figure 4). When
some enzymes in pathway are labeled with the end of ”.”, it
denotes their exact reactions are not explicit. The mapping
results indicate that a potential similar enzyme with similar
functions of the unclear enzyme can be found in some other
species.
6.3.2.12
1.5.1.51.5.1.5
1.5.1.53.5.4.9
3.5.4.9
6.3.4.3
6.3.4.3
1.5.1.32.1.2.1
1.5.1.20
2.1.1.13
2.1.1.450.0.0.0
6.3.2.17
3.5.1.10
Figure 5. Mapping of formaldehyde oxidation V path
way in B. subtilis to formy1THF biosynthesis pathway in
E. coli (p < 0.01). The node with upper part and lower part
represents a vertextovertex mapping. The upper part rep
resents the query enzyme and the lower part represents the
text enzyme. The node with dashed box represents a gap.
Holes in Pathways. Pathway holes happen when a genome
appears to lack the enzymes needed to catalyze reactions
in a pathway [6]. We can use our mapping tool to identify
potential pathway holes as shown in the following example
(see Figure 5).
There is a statistically significant mapping from
formaldehyde oxidation V (tetrahydrofolate pathway) in
B. subtilis to formyITHF biosynthesis in E. coli.
correspondence between enzymes shows a gap – enzyme
3.5.1.10 is missing. We found that the enzyme 3.5.1.10
exists in B. subtilis according to Enzyme and Swissprot
databases.
The
Phylogenetic Validation. One can measure similarity be
tween species based on the number of conserved pathways.
Thelargestamountofconservedpathwaysisfoundbetween
B. subtilis and T. thermophilus – two speciestospecies
mappings have in total 183 statistically significant pairs of
pathways. The next closest two species are E. coli and
B. subtilis which have 126 statistically significant pairs of
Page 8
pathways. This agrees with fact that B. subtilis, T. ther
mophilus, and E. coli are prokaryote and S. cerevisiae is a
eucaryote.
8 Conclusions
In this paper we have introduced a new method of map
ping metabolic pathways based on homomorphism. The
proposed mapping approach allows to map different en
zymes of the pattern pathway into a single enzyme of a text
network. We have also define a novel scoring scheme for
computing similarity between enzymes based on their EC
notation.
We have formulated the graphtheoretical problem cor
responding to finding optimal homomorphism from the
pattern network to a text network.
cient dynamicprogramming method for exactly solving
this problem when the pattern is multisource tree an the text
is an arbitrary network.
We have applied our mapping tool pairwise mapping of
all pathways for four organisms (E. coli, S. cerevisiae, B.
subtilis and T. thermophilus species) representing main dif
ferent lineages and found a reasonably large set of statisti
cally significant pathway similarities.
We report 24 pathways that are conserved across all 4
species as well 18 more pathways that are conserved across
at least three of these species. We show that our mapping
toolcanbeusedforidentificationofpotentialpathwayholes
as well resolving enzyme notation ambiguities in existing
pathway descriptions.
We give an effi
Acknowledgments
We thank Professor Pinter and Oleg Rokhlenko for giv
ing access to their software package MetaPathwayHunter,
Professor Karp and Dr. Kaipa for providing the pathway
databases for T. thermophilus, B. subtilis, E. coli and S.
cerevisiae. We also thank Amit Sabnis, Dipendra Kaur,
Kelly Westbrooks for helpful discussions.
References
[1] http://www.biocyc.org/.
[2] M. Chen and R. Hofest. An algorithm for linear metabolic
pathway alignment. In silico biology (In silico biol.) ISSN,
13866338: 111128, 2005.
[3] Q. Cheng and A. Zelikovsky. Optimal mapping of multi
sourcetreesintodaginbiologicalnetwork. ISBRA’07Poster,
May 2007.
[4] T. Dandekar, S. Schuster, B. Snel, M. Huynen, and P. Bork.
Pathway alignment: application to the comparative analysis
of glycolytic enzymes. Biochem. J., 1: 115124, 1999.
[5] C. V. Forst and K. Schulten. Evolution of metabolism: a
newmethodforthecomparisonofmetabolicpathwaysusing
genomics information. J. Comput. Biol., 6: 343360, 1999.
[6] M. L. Green and P. D. Karp. A bayesian method for iden
tifying missing enzymes in predicted metabolic pathway
databases. BMC Bioinformatics, Sep. 2004.
[7] I. M. Keeler, V. J. Collard, C. S. Gama, J. Ingrafts, S. Palely,
I. T. Paulson, M. PeraltaGil, and P. D. Karp. Ecocyc: a
comprehensive database resource for escherichia coli. Nu
cleic Acids Research, 33(1):D334337, 2006.
[8] B. P. Kelly, R. Sharan, R. M. Karp, T. Sittler, D. E. Root,
and B. R. Stockwell. Pathblast: a tool for alignment of pro
tein interaction networks. Nucleic Acids Research, Vol.32 :
W83W88, 2004.
[9] B. P. Kelly, R. Sharan, R. M. Karp, T. Sittler, D. E. Root,
and B. R. Stockwell. Conserved pathways within bacteria
and yeast as revealed by global protein network alignment.
PNAS, 1139411399, Sep. 30 2003.
[10] C. J. Krieger, P. Zhang, L. A. Mueller, A. Wang, S. Paley,
M. Arnaud, J. Pick, S. Rheme, and P. Karp. Metacyc: a mi
croorganism database of metabolic pathways and enzymes.
Nucleic Acids Research, 32(1):D43842, 2006.
[11] Z. Li, Y. Wang, S. Zhang, X.S. Zhang, and L. Chen. Align
ment of protein interaction networks by integer quadratic
programming. EMBS ’06. 28th Annual International Con
ference of the IEEE, 55275530, Aug. 2006.
[12] R. Pinter, O. Rokhlenko, D. Tsur, and M. ZivUkelson. Ap
proximate labeled subtree homeomorphism. In Proceedings
of 15th Annual Symposium of Combinatorial Pattern Match
ing.
[13] R. Y. Pinter, O. Rokhlenko, E. YegerLotem, and M. Ziv
Ukelson. Alignmentof metabolic pathways. Bioinformatics.
[14] R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine,
P. Uetz, T. Sittler, R. M. Karp, and T. Ideker. Conserved
patterns of protein interaction in multiple species. PNAS,
Vol.102 : 19741979, 2005.
[15] Y. Tohsato, H. Matsuda, and A. Hashimoto.
alignment algorithm for metabolic pathway analysis using
enzyme hierarchy. Proc. 8th International Conference on
Intelligent Systems for Molecular Biology, 376383, ISMB
2000.
[16] S. Wernicke. Combinatorial algorithms to cope with the
complexity of biological networks. Dissertation, December
2006.
[17] Q. Yang and S.H. Sze. Path matching and graph matching
in biological networks. Journal of Computational Biology,
Vol. 14, No. 1: 5667 : 55275530, 2007.
A multiple
View other sources
Hide other sources
 Available from psu.edu
 Available from gsu.edu
 Available from A. Zelikovsky · May 31, 2014