Page 1

Global Alignment of Protein-Protein Interaction

Networks

Misael Mongiov` ıRoded Sharan

Blavatnik School of Computer Science, Tel Aviv University,

Tel Aviv, 69978, Israel.

Abstract

Sequence-based comparisons have been the workhorse of Bioinformatics for the past four

decades, furthering our understanding of gene function and evolution. Over the last decade

a plethora of technologies have matured for measuring protein-protein interactions (PPIs)

at large scale, yielding comprehensive PPI networks for over 10 species. In this chapter we

review methods for harnessing PPI networks to improve the detection of orthologous proteins

across species. In particular we focus on pairwise global network alignment methods that

aim to find a mapping between the networks of two species that maximizes the sequence and

interaction similarities between matched nodes. We further suggest a novel evolutionary-

based global alignment algorithm. We then compare the different methods on a yeast-fly-

worm benchmark, discuss their performance differences and conclude with open directions

for future research.

1

Page 2

1 Introduction

Over the last decade, high-throughput techniques such as yeast two-hybrid assays [1] and co-

immunoprecipitation experiments [2], have allowed the construction of large scale networks

of protein-protein interactions (PPIs) for multiple species. Comparative analyses of these

networks have greatly enhanced our understanding of protein function and evolution.

Analogously to the sequence comparison domain, two main concepts have been introduced

in the network comparison context: Local network alignment and global network alignment.

The first considers local regions of the network, aiming to identify small subnetworks that

are conserved across two or more species (where conservation is measured in terms of both

sequence and interaction patterns). Local alignment algorithms have been utilized to detect

protein pathways [3] and complexes that are conserved across multiple species [4, 5, 6], to

predict protein function and to infer novel PPIs [4].

In global network alignment (GNA) the goal is to associate proteins from two or more

species in a global manner so as to maximize the rate of sequence and interaction conservation

across the aligned networks. In its simplest form, the problem calls for identifying a 1-1

mapping between the proteins of two species so as to optimize some conservation criterion.

Extensions of the problem consider multiple networks and many-to-many (rather than 1-1)

mappings. Such analyses assist in identifying (functional) orthologous proteins and orthology

families [7] with applications to predicting protein function and interaction. They aim to

improve upon sequence-only methods that partition proteins into orthologous groups based

on sequence similarity computations [8, 9, 10].

GNA methods can be classified into two main categories. The first category contains

matching methods that explicitly search for a one-to-one mapping that maximizes a suitable

scoring function. The scoring function favors mappings that conserve sequence and interac-

tion. Methods in this category include the integer linear programming method of [11] and

2

Page 3

a greedy Gradient Ascent method of [12]. The second category includes ranking methods

that consider all possible pairs of interspecies-proteins that are sufficiently sequence-similar,

and rank them according to their sequence and topological similarity. These ranks are then

used to derive a 1-1 mapping. Methods in this category include a Markov Random Field

approach [13], the IsoRank method that is based on Google’s Page Rank [7] and a diffusion-

based method – Hybrid RankProp [14]. In addition, there are several very recent ranking

approaches that do not use sequence similarity information at all [15, 16].

Here we aim to propose a third, evolutionary perspective on global alignment by designing

a GNA algorithm that is based on a probabilistic model of network evolution. The evolution

of a network is described in terms of four basic events: gene duplication, gene loss, edge

attachment and edge detachment. This model allows the computation of the probability of

observing extant networks given the ancestral network they originated from; by maximizing

this probability one obtains the most likely ancestor-descendant relations, which naturally

translate into a network alignment.

The chapter is organized as follows: Section 3 reviews GNA methods that are based

on graph matching. Section 4 presents the ranking based methods. Section 5 describes in

detail the probabilistic model of evolution and the proposed alignment method. The different

approaches are compared in Section 6. Finally, Section 7 gives a brief summary and discusses

future research directions.

2 Preliminaries and Problem Definition

We focus the presentation on methods for pairwise global alignment, where the input consists

of two networks and possibly sequence-similarity information between their nodes, and the

output is a correspondence, commonly one-to-one, between the nodes of the two networks.

A protein network G = (V,E) has a set V of nodes, corresponding to proteins, and a

3

Page 4

set E of edges, corresponding to protein-protein interactions (PPIs). For a node i ∈ V , we

denote its set of (direct) neighbors by N(i). Let G1= (V1,E1) and G2= (V2,E2) be the

two networks to be aligned. Let R ⊆ V1× V2be a compatibility relation between proteins

of the two networks, representing pairs of proteins that are sufficiently sequence-similar. A

many-to-many correspondence that is consistent with R is any subset R∗⊆ R. Under such a

correspondence, we say that an edge (u,v) in one of the networks is conserved if there exists

an edge (u?,v?) in the other network such that (u,u?),(v,v?) ∈ R∗or (u?,u),(v?,v) ∈ R∗. We

let T(G1,G2) = {(u,u?,v,v?) : (u,v),(u?,v?) ∈ R,(u,u?) ∈ E1,(v,v?) ∈ E2} denote the set of

all quadruples of nodes that induce a conserved interaction.

In its simplest formulation, the alignment problem is defined as the problem of finding

an injective function (one-to-one mapping) ϕ : V1→ V2such that: (i) it is consistent with

R; and (ii) it maximizes the number of conserved interactions. More elaborate formulations

of the problem can relax the 1-1 mapping to a many-to-many mapping and possibly define

an alignment score to be optimized that combines the amount of interaction conservation

and the sequence similarity of the matched nodes. The definition of a conserved interaction

can also be made more elaborate by taking into account the reliability of the pertaining

interactions and by allowing “gapped” interactions, i.e., a directed interaction in one network

is matched to two nodes that are of distance 2 in the other network. We defer the discussion

of these extensions and the specific scoring functions used to the next sections, where the

different GNA approaches are described.

The problem of finding the optimal one-to-one alignment between two networks, as

defined above, can be shown to be NP-hard by reduction from Maximum Common Sub-

graph [11]. Consequently, an efficient algorithm cannot be designed for the general case.

However, under certain relaxations the problem can be solved optimally on current data sets

in acceptable time.

4

Page 5

3 Graph Matching Methods

In this section we describe GNA methods that look for an explicit 1-1 correspondence between

the two compared networks. The first method, by Klau, is based on reformulating the

alignment problem as an Integer Linear Program (ILP) [11]. The variables of the program

represent the 1-1 mapping sought. Specifically, for each pair (u,v) ∈ R, the author defines

a binary variable xuvdenoting whether u and v are matched (xu,v= 1) in the alignment or

not (xu,v= 0). The ILP formulation is as follows:

max

?

(u,u?,v,v?)∈T(G1,G2)

xu,v· xu?,v? +

?

(u,v)∈R

σ(u,v) · xu,v

s.t.

?

?

where σ(u,v) denotes the sequence similarity of u and v. The objective function can

u∈V1

xu,v≤ 1 ∀v ∈ V2

v∈V2

xu,v≤ 1 ∀u ∈ V1

be linearized in an obvious way by introducing binary variables tu,u?,v,v? = xu,v· xu?,v? (for

(u,u?,v,v?) ∈ T(G1,G2)) with appropriate constraints.

While the author uses optimization techniques, such as Lagrangian decomposition and

Lagrangian relaxation, to solve this problem, an optimum solution for restricted instances

can be found in reasonable time as we report in Section 6. We note that if V1∪ V2 is

first partitioned into sufficiently small orthology clusters (using, e.g., the Inparanoid algo-

rithm [8]) and if the graph of potential conserved interactions across clusters has no loops,

then the optimum alignment can be found in polynomial time via a dynamic programming

algorithm [12].

In the general case, the computation of optimal solutions is too costly, hence the use of

heuristics is necessary. Vert et al. [12] suggested a gradient ascent approach. It starts from a

5

Page 6

feasible solution and computes a sequence of moves in the direction of the objective’s gradient

until converging to a local maximum. Denoting the adjacency matrices of the two graphs by

A1and A2, respectively, and assuming that |V1| = |V2| = n (otherwise, add dummy vertices),

the goal of the optimization is to find a permutation matrix P that maximizes a weighted

sum of the number J(P) of conserved interactions and a sequence similarity term S(P). In

matrix notation J(P) =1

2tr(AT

1PA2PT) and its gradient is AT

1PA2; S(P) = tr(PC) where

C – the matrix of sequence-similarity scores – is its gradient.

The initial solution P0is given by sequence similarity alone, using a maximum matching

algorithm. At each step, the algorithm employs a maximum matching computation to update

the current permutation in the direction of the gradient:

Pn+1= argmax

P

tr([λAT

1PnA2+ (1 − λ)C]P)

where 0 ≤ λ ≤ 1 is a weighting constant.

4Methods based on ranking

A second class of methods is based on assigning a score to each pair of compatible nodes

and only at a second step choosing a global pairing of the nodes. The latter pairing is

effectively disambiguating the compatibility relations, pinpointing the “best” 1-1 mapping.

The disambiguation can be achieved by computing a maximum weighted bipartite matching

or via simple greedy strategies. The difference between the various methods lies mainly in

the first, scoring phase.

The first method for global network alignment has been proposed by Bandyopadhyay et

al. [13] and uses a ranking that is based on a Markov random field (MRF) model. It starts

by building an alignment graph, where the nodes represent candidate pairs of (sequence-

6

Page 7

similar) proteins and the edges represent potentially conserved interactions. Each node in

the alignment graph is associated with a binary state z indicating if that node represents

a true orthology relation or not. The state values are modeled using a Markov random

field. The MRF model assumes that for each node of the alignment graph j = (u,v), the

probability that j represents a true pair of orthologs (zj= 1) depends only on the states of

its neighbors (N(j)), and the dependence is through a logistic function:

P(zj|zN(j)) =

1

1 + e−α−β·c(j)

where α and β are parameters and c(j) is the conservation index of j, defined as twice the

number of conserved interactions between j and neighbors of j whose states are pre-assigned

with value 1 (true orthologs), divided by the total number of interactions involving u and

v across the two species. The inference of the states of the nodes is conducted using Gibbs

sampling [17], yielding orthology probabilities for every node. These estimated probabilities

are used to disambiguate the pairing.

Singh et al. [7] proposed an alignment method (IsoRank) that is based on Google’s

PageRank algorithm. As for MRF, the method first computes a score for each candidate

pair of orthologs, and then uses the scores for disambiguating the pairing. The score R(i,j)of

the pair (i,j) ∈ V1×V2, is a weighted average of the scores of its neighboring pairs (assuming

that all node pairings are allowed):

R(i,j)=

?

u∈N(i)

?

v∈N(j)

R(u,v)

|N(u)||N(v)|

The authors translate the problem of finding R into an eigenvector problem by expressing

7

Page 8

it in matrix form as R = AR where A is defined as:

A(i,j)(u,v)=

1

|N(u)||N(v)|

0

if (i,u) ∈ E1, (j,v) ∈ E2

otherwise

Under this formulation the problem reduces to finding the dominant eigenvector of A, which

is efficiently solved using the power method. To account for sequence similarity, the objective

is modified as R = [αA + (1 − α)B1T]R where B is the vector of normalized bit-scores and

1Tis an all-1 row vector.

Yosef et al. [14] devised the Hybrid RankProp algorithm. It considers one “query node”

of the first network at a time and ranks the nodes of the second network with respect to

it by using a diffusion procedure. To this end, they constructed a composite network with

two types of edges: PPI and sequence similarity. The query node is assigned a score of 1.0

that is continually pumped to the other nodes through the network’s edges. The scores that

the nodes assume after the diffusion process converges induce a ranked list of candidates for

matching the query node. In detail, at step t + 1, the score of a node i with respect to a

query q is given by:

Si(t + 1) = Wqi+ α

?

j∈N(i)\{q}

WjiSj(t)

where α is a parameter controlling the diffusion rate; and W is a weight matrix that represents

the composite network – it is the normalized confidence of an interaction for PPI edges and

a normalized sequence similarity for sequence-similarity edges. Finally, to make the score

symmetric, proteins from both networks are queried and each pair is assigned the average

score of its two associated queries.

8

Page 9

5 Network-evolution based alignment

In this section we present a new alignment method, called PME, that is based on a Proba-

bilistic Model of Evolution. PME aims to reconstruct the most probable ancestral network

that gave rise to the observed extant networks. Such a network induces a many-to-many

alignment in the descending networks by associating groups of proteins in the two input

networks with the corresponding ancestral proteins. The method is based on a probabilistic

model of the evolutionary dynamics of a network, that supports four kinds of evolutionary

events: link attachment, link detachment, gene duplication and gene loss [18].

An alignment between two networks G1and G2is defined by an ancestral network G0=

(V0,E0) and two functions f1: V1→ V0and f2: V2→ V0which map the nodes of G1and

G2into the nodes of G0(ancestral proteins). The score of an alignment A = (G0,f1,f2) is

the product of the prior probability for A and the likelihood of observing G1and G2given

A. We describe the probability computations in detail below.

The probability P(A) is the product of two terms that consider the prior probability

of observing G0and the probability of the pattern of gene duplications and losses implied

by f1 and f2. For the former, we adopt a simple Erd˝ os-R´ enyi model where edges occur

independently with some constant probability PE. For the latter, we focus on gene dupli-

cations (as in [18]), assuming that gene duplication events occur independently with some

fixed probability Pd. For computational efficiency, we disallow gene losses, although those

could be easily incorporated to the model in a similar manner. Formally, the two terms are

as follows:

• A-priori ancestral network probability:

?

(u,v)?∈E0

(1 − PE) ·

?

(u,v)∈E0

PE

9

Page 10

• Gene duplication (i ∈ {1,2}):

?

(v)?=∅

v∈V0

f−1

i

P|f−1

d

i

(v)|−1

·

?

(v)|≤1

v∈V0

|f−1

i

(1 − Pd)

The probability P(Gi|A) of observing the network Gi, i ∈ {1,2} is given by the product

of two factors that consider edge attachment and edge detachment events, assuming these

events occur independently with probabilities PAand PD, respectively.

• Edge attachment:

?

(u,v)?∈E0

?

?

(u?,v?)?∈Ei

fi(u?)=u,fi(v?)=v

(1 − PA) ·

?

(u?,v?)∈Ei

fi(u?)=u,fi(v?)=v

PA

?

• Edge detachment:

?

(u,v)∈E0

?

?

(u?,v?)∈Ei

fi(u?)=u,fi(v?)=v

(1 − PD) ·

?

(u?,v?)?∈Ei

fi(u?)=u,fi(v?)=v

PD

?

Our goal is to find an alignment that maximizes P(G1,G2,A) = P(A)·P(G1|A)·P(G2|A).

In the following we provide an integer linear programming (ILP) formulation of the problem.

Consider a set of n hypothetical nodes of the ancestral network, where n = |V1|+|V2| is the

maximal number of nodes in the ancestral network. With each node we associate a binary

variable ziwhich is 1 if and only if node i has some descendant node in the extant networks.

With each vertex pair (i,j) we associate a binary variable tijwhich is 1 if and only if nodes

i and j interact with each other in the ancestral network. To model the mappings f1and

f2we define binary variables xiuand yiv, where xiu= 1 (yiv= 1) if and only if f1(u) = i

(f2(v) = i). Finally, in order to consider gene duplications, we add binary variables dj

i,

j ∈ {1,2} such that dj

i= 0 if and only if i has more than one descendant in Gj.

10

Page 11

5.1 The ILP formulation

The constraints of the ILP are defined as follows:

tij≤ zi,zj

1 ≤ i < j ≤ n

to allow edges only between “true” vertices of the ancestral network.

n

?

n

?

i=1

xiu= 1 u ∈ V1

i=1

yiv= 1 v ∈ V2

to model the fact that each protein descends from a single ancestor.

?

?

xiu≤ zi 1 ≤ i ≤ n, u ∈ V1

u∈V1

xiu≥ zi 1 ≤ i ≤ n

v∈V2

yiv≥ zi 1 ≤ i ≤ n

yiu≤ zi 1 ≤ i ≤ n, u ∈ V2

to model the fact that each true node of the ancestral network (zi= 1) must have at least

one descendant in each network and each dummy node of the ancestral network (zi= 0)

11

Page 12

cannot have any descendants.

d1

i≤ 1 + zi− xiu− xiv

d1

1 ≤ i ≤ n, u,v ∈ V1

i≥ 1 + zi−

?

u∈V1

xiu 1 ≤ i ≤ n

d2

i≤ 1 + zi− yiu− yiv

d2

1 ≤ i ≤ n, u,v ∈ V2

i≥ 1 + zi−

?

u∈V2

yiu 1 ≤ i ≤ n

to impose that nodes that have only one descendant have not undergone a duplication event.

Finally, we add the integer constraints:

xiu,yiv,zi,tij,d1

i,d2

i∈ {0,1} 1 ≤ i,j ≤ n,u ∈ V1,v ∈ V2

The objective is to maximize P(G1,G2,A) or, equivalently, to maximize logP(G1,G2,A).

The latter is a sum of four terms:

• A-priori ancestral network probability:

ϕE

=

?

i<j

?

log(PE) · tij+ log(1 − PE) · (1 − tij)

?

• Gene duplication (for simplicity we specify only the sub-term involving G1):

ϕd =

n

?

i=1

??

u∈V1

xiu− zi

?

· log(Pd) +

n

?

i=1

log(1 − Pd) · d1

i

12

Page 13

• Edge attachment (for simplicity we specify only the sub-term involving G1):

ϕA =

?

i<j

(1 − tij) ·

?

?

?

(u,v)?∈E1

xiu· xjv· log(1 − PA)

?

+

(u,v)∈E1

xiu· xjv· log(PA)

• Edge detachment (for simplicity we specify only the sub-term involving G1):

ϕD

=

?

i<j

tij·

?

xiu· xjv· log(PD)

?

(u,v)∈E1

xiu· xjv· log(1 − PD)

?

+

?

(u,v)?∈E1

In order to make the problem linear we introduce the following additional binary variables

with appropriate constraints: pijuv= tij· xiu· xjv; and qijuv= (1 − tij) · xiu· xjv.

5.2 Refinements and variable reduction

In some cases there are not enough interactions to support a match. To avoid an arbitrary

choice among identically-scored solutions, we choose the solution that agrees best with the

sequence similarity information. To this end, we add a small penalty to each ancestral-

descendant connection whose value is 10−8· log(S + 1), where S is the bit-score of the two

proteins.

Although PME naturally produces a many-to-many correspondence between orthologous

proteins, we focus here on its reduction to a one-to-one mapping to facilitate its comparison

to other methods. To this end, we rank all pairs of inter-species proteins that are predicted

to descend from the same common ancestor. For any potentially matched pair (u,v), with

f(u) = f(v) = i, the score of (u,v) is given by the score of the global alignment after

removing all the nodes that descend from i except for u and v (i.e., forcing the alignment to

13

Page 14

match u and v). These scores are then fed to a maximum bipartite matching computation

to construct a 1-1 alignment.

The sequence similarity information allows us to greatly reduce the number of variables

considered. We start with a set V = V1∪ V2of hypothetical ancestral nodes. We build two

relations R1⊆ V × V1and R2⊆ V × V2as follows: For each i ∈ V , we add to R1all pairs

(i,u) with u ∈ V1such as u is sequence-similar to i and u ≤ i. Analogously, we add to R2

all pairs (i,v) with v ∈ V2such that u is sequence-similar to i and u ≤ i. The search is then

restricted to alignments whose ancestor-descendant pairs are in R1∪ R2.

The relations R1 and R2 also allow us to reduce the number of possible edges of the

ancestral network. Consider a pair of nodes (u,v) of the ancestral network such that all

possible pairs of descendants of these nodes span non-edges. Clearly, in the optimal solution

(u,v) will be a non-edge. Since the networks are usually very sparse, this simple rule greatly

reduces the number of variables required to model the topology of the ancestral network

and, consequently, greatly saves in variables introduced by the linearization. Although non-

edges contribute to the objective function, we can modify the latter so that the contribution

of non-edges is zero (by adding −log(1 − PE) to all ancestral vertex pairs). In a similar

manner, we can reduce the number of ancestor-descendant pairs that are considered in the

computation of edge attachment events.

6 Experimental Results

To compare the different GNA methods we used the benchmark in [13], which focuses on

the pairwise global alignment of the PPI networks of yeast and fly, starting from an initial

clustering of the proteins into orthology families formed by the Inparanoid algorithm [8]. In

addition, we compared, under the same setting, the alignments of each of these networks to a

PPI network of worm. The worm network was constructed by collecting data from recently

14

Page 15

published papers and public databases [19, 20, 21] and spanned 2,967 proteins and 4,852

interactions. The yeast network contained 4,393 proteins and 14,318 interactions; the fly

network contained 7,042 proteins and 20,719 interactions. We considered 2,244 Inparanoid

groups between yeast and fly, 1,833 groups between yeast and worm and 4,228 groups between

worm and fly.

We included in the comparison the following methods: ILP [11], MRF [13], IsoRank [7]

and PME (Section 5). We did not consider Gradient Ascent [12] and Hybrid RankProp [14]

in our tests. Gradient Ascent tries to approximate the same objective as the ILP method,

hence the latter should be superior to it. Hybrid RankProp was shown by its authors to

be equivalent in performance to the original RankProp method, which is based on sequence

only.

We implemented ILP, IsoRank and PME in Matlab and used ILOG CPLEX as an ILP

solver. For MRF, we report on the results published in the original paper [13]. The parameter

that balance topology versus sequence similarity was set as c = 0.01 for both IsoRank and

ILP in order to give higher weight to topology. For PME we used the following settings: the

probability of attachment and detachment was set so as to obtain the same global rate of

attachments and detachments estimated from the unambiguous clusters of Inparanoid (PA=

0.0026; PD= 0.9617). The probability of an edge in the ancestral network was estimated

from the density of the two networks (PE = 3.32e−4). The probability of duplication was

set to Pd= 0.03 with the results being robust to a wide range of values for this parameter

(in the range 10−4to 0.5). All the experiments were executed on a DELL server with 8

processors Quad-Core AMD Opteron and 16 GB RAM, OS Ubuntu 9.04.

To evaluate the functional coherency of the aligned proteins, we considered two measures:

(i) the number of pairs that are classified as orthologs by HomoloGene [22], considered as

a gold standard; and (ii) a score based on the Gene Ontology (GO) [23], focusing on the

Biological Process and Molecular Function branches. To evaluate the significance of the

15