# New directions for diffusion-based network prediction of protein function: Incorporating pathways with confidence

**Abstract**

Motivation: It has long been hypothesized that incorporating models of network noise as well as edge directions and known pathway information into the representation of protein–protein interaction (PPI) networks might improve their utility for functional inference. However, a simple way to do this has not been obvious. We find that diffusion state distance (DSD), our recent diffusion-based metric for measuring dissimilarity in PPI networks, has natural extensions that incorporate confidence, directions and can even express coherent pathways by calculating DSD on an augmented graph.
Results: We define three incremental versions of DSD which we term cDSD, caDSD and capDSD, where the capDSD matrix incorporates confidence, known directed edges, and pathways into the measure of how similar each pair of nodes is according to the structure of the PPI network. We test four popular function prediction methods (majority vote, weighted majority vote, multi-way cut and functional flow) using these different matrices on the Baker’s yeast PPI network in cross-validation. The best performing method is weighted majority vote using capDSD. We then test the performance of our augmented DSD methods on an integrated heterogeneous set of protein association edges from the STRING database. The superior performance of capDSD in this context confirms that treating the pathways as probabilistic units is more powerful than simply incorporating pathway edges independently into the network.
Availability: All source code for calculating the confidences, for extracting pathway information from KEGG XML files, and for calculating the cDSD, caDSD and capDSD matrices are available from http://dsd.cs.tufts.edu/capdsd
Contact:
lenore.cowen@tufts.edu or benjamin.hescott@tufts.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.

Vol. 30 ISMB 2014, pages i219–i227

BIOINFORMATICS doi:10.1093/bioinformatics/btu263

New directions for diffusion-based network prediction of protein

function: incorporating pathways with confidence

Mengfei Cao

1

,ChristopherM.Pietras

1

,XianFeng

1

,KathrynJ.Doroschak

2

,

Thomas Schaffner

1

,JisooPark

1

,HaoZhang

1

, Lenore J. Cowen

1,

*

and Benjamin J. Hescott

1,

*

1

Department of Computer Science, Tufts University, Medford, MA 02155, USA and

2

Department of Computer Science,

University of Minnesota, Minneapolis, MN 55455, USA

ABSTRACT

Motivation: It has long been hypothesized that incorporating models

of network noise as well as edge directions and known pathway in-

formation into the representation of protein–protein interaction (PPI)

networks might improve their utility for functional inference.

However, a simple way to do this has not been obvious. We find

that diffusion state distance (DSD), our recent diffusion-based metric

for measuring dissimilarity in PPI networks, has natural extensions that

incorporate confidence, directions and can even express coherent

pathways by calculating DSD on an augmented graph.

Results: We define three incremental versions of DSD which we term

cDSD, caDSD and capDSD, where the capDSD matrix incorporates

confidence, known directed edges, and pathways into the measure of

how similar each pair of nodes is according to the structure of the PPI

network. We test four popular function prediction methods (majority

vote, weighted majority vote, multi-way cut and functional flow) using

these different matrices on the Baker’s yeast PPI network in cross-

validation. The best performing method is weighted majority vote

using capDSD. We then test the performance of our augmented

DSD methods on an integrated heterogeneous set of protein associ-

ation edges from the STRING database. The superior performance of

capDSD in this context confirms that treating the pathways as prob-

abilistic units is more powerful than simply incorporating pathway

edges independently into the network.

Availability: All source code for calculating the confidences, for ex-

tracting pathway information from KEGG XML files, and for calculating

the cDSD, caDSD and capDSD matrices are available from http://dsd.

cs.tufts.edu/capdsd

Contact: lenore.cowen@tufts.edu or benjamin.hescott@tufts.edu

Supplementary information: Supplementary data are available at

Bioinformatics online.

1 INTRODUCTION

One of the most well-studied problems in computational network

biology is the prediction of protein functional labels from dis-

tance and neighborhood structure in the protein–protein inter-

action network (PPI network). In 2013, based on the observation

that paths through high-degree ‘hub’ nodes in the PPI network

were less informative than short paths through protein nodes

with fewer interaction partners, (Cao et al., 2013) introduce the

diffusion state distance (DSD) metric that is able to quantify

topological similarity in a PPI network in a more fine-grained

way. Diffusion-based methods had been previously proposed for

clustering similar proteins (Voevodski et al., 2009) and for

ranking candidate disease genes (Chen et al., 2009; Erten et al.

2011; Kohler et al., 2008; Vanunu et al., 2010), but by explicitly

taking an L1 norm of the vector of the random walks to all other

nodes in the network to measure the distance between nodes,

DSD is able to capture a more global view of the network

than other prior work we are aware of, with the exception of

Vavien (Erten et al. 2011) for candidate disease gene ranking,

and ISORANK-N (Liao et al., 2009), which also is based on a

global embedding, but for a very different problem (network

alignment).

Cao et al. (2013) showed that when a DSD-based distance is

substituted for ordinary next-hop shortest-path distance in four

classical network-based function prediction methods, functional

label prediction performance for the GO (Gene Ontology), as

well as all three levels of the MIPS (Munich Information

Center For Protein Sequences) ontology, improved across

the board in cross-validation experiments on both the

Saccharomyces cerevisiae and the S.pombe PPI networks.

However, these results were based only on a simple undirected

model of the PPI network, which additionally assumed that all

the edges listed in the BioGRID data were uniformly correct.

On the other hand, it is well-established both that there is noise

in the PPI interaction network data (Mering et al., 2002; Reguly

et al., 2006; Gandhi et al., 2006), and that some interactions are

naturally directed in the PPI network (Liu et al., 2009; Gitter

et al., 2011; Du et al., 2012). In addition, looking just at pairwise

interaction data as edges does not fully capture all the informa-

tion that is known about the PPI network. In particular, there is

increasingly available data on biological pathways, for example,

TGF- binds TGF- receptor 1, which phosphorylates Smad3,

which with importin-1 enters the nucleus and binds DNA to

regulate expression (Moustakas 2002).

In this article, we revisit the DSD metric we designed in earlier

work for function prediction in the ordinary undirected PPI net-

work. We find that its diffusion-based framework gives a natural

way to incorporate edge confidences and directed edges (when

known). However, the main contribution of this article is to

show that there is a way to capture the cohesiveness of known

pathways by calculating DSD on an augmented network, and

that this way of representing pathways results in better perform-

ance than just incorporating the pathway edges themselves for

most, but not all of the function prediction methods we study.

We show this first in cross-validation on the standard network

consisting of just experimentally verified physical interaction

edges from S.cerevisiae, and then on an integrative network

with heterogeneous protein association data edges derived from

the STRING database (Franceschini et al.,2013).

*To whom correspondence should be addressed.

ß The Author 2014. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which

permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact

journals.permissions@oup.com

1.1 Overview of DSD

PPI networks are known to be ‘small world’ networks in the

sense that they are small-diameter, and most nodes are close to

all other nodes. Thus any method that infers similarity based on

proximity will find that a large fraction of the network is prox-

imate to any typical node. In fact, this issue has already been

termed the ‘ties in proximity’ problem in the computational biol-

ogy literature (Arnau et al.,2005).

Furthermore, the fact that two particular nodes are adjacent

(i.e., have shortest-path distance 1) in a PPI network can signify

something very different than the adjacency of two other nodes.

For example, in PPI networks two nodes with many low-degree

neighbors in common should be thought of as ‘more similar’

than nodes with few low-degree neighbors in common; and

such nodes should also be thought of as ‘more similar’ than

two nodes whose common neighbors have high-degree. Thus,

characterizing node pairs based only on a shortest-path notion

of distance fails to capture important knowledge encoded in the

structure of the network.

In (Cao et al., 2013), DSD is defined on an undirected

connected simple graph. In particular, our PPI network is

defined with a vertex set V, containing a node for each verified

ORF, and an edge set E, containing an unweighted and undir-

ected edge for each physical interaction. We first calculate

He

{k}

(A,B) as the expected number of times that a random

walk starting at node A and proceeding for k steps, will visit

node B; then we further define a n-dimensional vector

He

fkg

ðv

i

Þ; 8v

i

2 V,where

He

fkg

ðv

i

Þ=ðHe

fkg

ðv

i

; v

1

Þ; He

fkg

ðv

i

; v

2

Þ; :::; He

fkg

ðv

i

; v

n

ÞÞ:

In what follows, the k-step DSD between two vertices u and v;

8u; v 2 V is defined as

DSD

fkg

ðu; vÞ=jjHe

fkg

ðuÞHe

fkg

ðvÞjj

1

;

where jjHe

fkg

ðuÞHe

fkg

ðvÞjj

1

denotes the L

1

norm of the He

vectors of u and v. As proved in (Cao et al.,2013),onthe

simple connected graph whose random walk one-step transition

probability matrix is diagonalizable and ergodic as a Markov

chain, the limit of DSD when k approaches infinity exists and

can be calculated as

lim

k!1

DSD

fkg

ðu; vÞ=jjðb

u

T

b

v

T

ÞðI P+CÞ

1

jj

1

;

where I is the identity matrix, C is the constant matrix in which

each row is a copy of

T

;

T

is the unique steady state distribu-

tion, and for any i 2 V, b

i

T

is the i-th basis vector, that is, the

row vector of all zeros except for a 1 in the i-th position, and

P=fp

ij

g

n

i;j=0

is the n-dimensional one-step transition probability

matrix where the (i, j)th entry is given by

p

ij

=

1

d

i

if ðv

i

; v

j

Þ2E

0 otherwise

;

8

<

:

where d

i

is the degree of node v

i

. In this work, we use the con-

verged DSD values as the original DSD calculation for

comparison.

1.2 New directions

In the first DSD paper, we modified four classical function pre-

diction methods (including Neighborhood Majority Vote

(Schwikowski et al.,2000),

2

Neighborhood (Hishigaki et al.,

2001), Multi-way Cut (Vazquez et al., 2003) and Functional

Flow (Nabieva et al., 2005)) to use this dissimilarity metric

rather than next-hop shortest-path distance as a dissimilarity

metric, and showed that performance improved across the

board. Now we extend the calculation of DSD to incorporate

confidence, then confidence and directed and undirected path-

way edges, then confidence, pathway edges and full biological

pathways. We present three new dissimilarity measures, which

we call cDSD, caDSD or capDSD, respectively, where capDSD

stands for confidence, augmented pathway diffusion state distance.

These measures can be substituted for original DSD in the four

classical function prediction methods we studied (or in any func-

tional prediction method that incorporates a pairwise dissimilar-

ity measure between nodes).

First, to define cDSD, similar to the approach suggested by

Gitter et al. (2011), we assign a confidence to each PPI inter-

action edge in BioGRID (Stark et al., 2006), based on the

number of publications in which that PPI appears, and whether

the reported experiments are high-throughput or low-through-

put. Given the formal definition of DSD, there is a natural way

to incorporate these confidences simply as edge weights, and the

k-step DSD calculation is generalized to a weighted matrix in the

natural way (see Section 2.1.3 for full details). We show that

incorporating confidence values in this way improves perform-

ance over the basic DSD method (which in turn improved the

performance compared to the corresponding method based on

shortest-path distances (Cao et al., 2013)) in cross-validation on

each of the classical network-based function prediction methods

we consider.

On top of the confidence values, we then seek to augment the

network by adding edges from the KEGG PATHWAY database

in two ways. We find that 2471 of these edges are not already in

BioGRID, and an additional 177 are in BioGRID, but we would

have assigned them lower confidence without the additional in-

formation that they also appeared in KEGG, so it is not surpris-

ing that adding in these edges improves our results as compared

to DSD and cDSD. In the first and simplest way, which we call

caDSD, we augment the graph by adding undirected and dir-

ected edges from the KEGG database; where edges of the types:

activation, inhibition, phosphorylation, dephosphorylation and

ubiquination are considered naturally directed as in (Liu et al.,

2009) and all other KEGG edges are considered undirected

(however, an undirected edge being included in the KEGG data-

base raises its edge weight because KEGG is manually curated).

However, we also create capDSD which creates an augmented

graph that represents the signaling pathways coherently using

new sets of nodes and edges. In this new augmented graph, path-

ways can be thought of as being represented by ‘controlled-

access highways’, in the sense that once the diffusion random

walk enters a pathway, it stays on that pathway with some

fixed probability r and only leaves that pathway to walk in the

regular PPI network (still augmented with directed edges, where

known, and confidence) with probability 1– r,wherethefixedr is

a parameter of the method. Just like DSD, capDSD is not a

i220

M.Cao et al.

function prediction method in itself, it is a dissimilarity matrix:

for each pair of nodes, capDSD gives a value that measures their

similarity in this (now augmented, confidence weighted) network.

For the best performing function prediction methods we test, we

find that adding in the KEGG pathway edges using the highway

approach is superior to just adding in the KEGG edges naively.

Furthermore, the performance increase is even stronger when

using an integrative network derived from the STRING data-

base (see Section 2.1.2).

Figure 1 shows an example of the modifications to the network

involved in computing cDSD, caDSD and finally capDSD. Of

the four different classical methods we test with all of DSD,

cDSD, caDSD and capDSD, we find that our best functio n pre-

diction method, over all three levels of the MIPS hierarchy is the

one that predicts v’s label based on the t closest neighbors in terms

of their values in the capDSD matrix, and has them vote on the

functional label of v, with a vote weight inversely proportional to

their capDSD value, assigning v the function with the highest

weighted vote. Significantly, the improvement is greater at the

lower (more specific) levels of the MIPS hierar chy.

2 MATERIALS AND METHODS

2.1 Datasets

2.1.1 Physical protein interaction network from BioGRID The

S.cerevisiae protein–protein physical interaction network is constructed

as follows: the list of 5064 verified ORFs downloaded from the SGD

website (Saccharomyces Genome Database, version date October 25,

2013) defines the nodes, and the 133 705 protein–protein physical inter-

actions from BioGRID (Stark et al., 2006) between nodes that are ver-

ified by at least one wet-lab experiment define the edges. After removing

edge redundancy, self-loops, and edges incident to unverified ORF nodes,

we extract the largest connected component and obtain a simple undir-

ected graph with n = 5001 nodes and m = 76 025 unique undirected

edges; we denote by G

0

(V

0

, E

0

, W

0

) this simple undirected graph with

unit-weight for all edges, where V

0

={v

1

,v

2

, ..., v

n

}andW

0

, the weight

matrix, is the n-dimensional square matrix with value 1 for entry (i, j)if

and only if (v

i

, v

j

)isinE

0

, and 0 otherwise.

2.1.2 Protein association network from STRING STRING

(Franceschini et al., 2013) is a database that integrates known and pre-

dicted protein associations from various sources, such as BioGRID

(Stark et al., 2006), BIND (Bader et al., 2003), DIP (Xenarios et al.,

2002), MINT (Licata et al., 2012), KEGG PATHWAY (Kanehisa and

Goto 2000) and gene co-expression data (Franceschini et al., 2013).

STRING assigns normalized confidence scores to many different types

of protein associations: some from experiments (physical and genetic pro-

tein interactions), or derived from co-expression, and others either

inferred by literature annotation or transferred from homology.

Because including edges inferred by literature annotation could invalidate

the separation of training and testing in our cross-validation experiments,

we could not use all the association categories in STRING. We extract all

protein associations from the ‘experiments’ and ‘co-expression’ categories

for yeast (with confidence score40 for at least one of the two categories),

where ‘experiments’ covers all physical and genetic protein interactions

and ‘co-expression’ refers to protein associations that are inferred from

similar transcriptional patterns in terms of gene co-expression levels. We

also want to include KEGG PATHWAY PPIs that have already been

incorporated in STRING; however, such information is mixed with and

cannot be separated from other data sources in the ‘database’ category,

including GO, which we do not want to include so as to avoid possible

overlapping between test data and training data in our function predic-

tion evaluation framework. Therefore, we directly extract association

links for pathway neighbors and subunits of the same enzyme/complex

from the KEGG PATHWAY database, the same fashion as what

STRING utilizes. We extract 454 600 protein–protein associations

(being sure to exclude homology-based transferred interologs) from

STRING version 9.05, release date: March 3, 2013 (Note that there is

also a more recent December 27, 2013 version 9.1 of STRING now

available, but it has no simple way to exclude interologs, so we used

the previous version.) We also include edges directly from KEGG (all

but 249 of these also appear in the portion of the STRING database we

use for our network; the discrepancy of 249 additional edges comes from

the fact that we use the December 2013 version of KEGG while STRING

version 9.05 uses the August 2012 version of KEGG). We further filter

the network by removing associations that are incident with at least one

unverified ORF from SGD. Afterward we compile the undirected graph

where a node corresponds to an ORF and an undirected edge is added if

there exists an association link between the two ORFs (we did not add

directed edges for the STRING experiment, since they were shown to

(a)

(b)

(c)

(d)

Fig. 1. An example of constructing auxiliary graphs for calculating dif-

ferent DSDs (with our BioGRID confidence scores). (a) The original PPI

network and two KEGG pathways; (b) the weight graph with PPI con-

fidence score as edge weights; (c) the directed graph with KEGG PPIs

added; and (d) the augmented graph by incorporating KEGG pathways

as weighted paths

i221

New directions for network-based protein function prediction

matter so little on the BioGRID experiment, see Table 3). The resulting

graph G

str

is undirected, connected, has diameter 5, and contains 5058

nodes and 404 358 edges.

2.1.3 PPI confidence assignment Because there is no confidence

score provided by BioGRID, we create confidence weights for

BioGRID PPI edges in G

0

using a scoring scheme similar to previous

work by (Gitter et al., 2011), according to the following premises:

Low-throughput experiments, due to their lower false positive rate,

are considered to provide more reliable PPIs than high-throughput

experiments.

If a PPI is verified experimentally by more experiments from curated

publications, we hold higher confidence in the existence of the PPI.

There are more than 7000 publications associated with the physical inter-

action PPI data we collect from BioGRID, making a manual assignment

of whether the experiment supporting the PPI is high- or low-throughput

highly impractical. Instead, we automatically and efficiently determine a

close proxy for this information by simply counting the number of differ-

ent PPIs that a particular publication vouches for in BioGRID. If there

are at least 100 PPIs associated with a particular publication, we classify

that publication’s endorsements as high-throughput and otherwise low-

throughput. In total, 7112 publications are classified as low-throughput

and 97 publications are classified as high-throughput. Note that these 97

high-throughput publications actually generate more than two-thirds of

the physical interactions. (We tried other cutoff values for distinguishing

low-throughput/high-throughput and the results were similar; in fact,

very few publications lie close to the 100 threshold; most low-throughput

publications have substantially less, and most high-throughput publica-

tions have substantially more.) If an interaction edge is endorsed by only

experiments of one type (either high- or low-throughput) we assign con-

fidence weights according to Table 1. If an interaction edge is endorsed by

both high confidence and low confidence experiments, we use the confi-

dence score from the low-throughput column in Table 1 plus 5% times

the number of high-throughput endorsements; however, if this value

exceeds 95%, we still assign a maximum confidence score of 95%.

For all pairs of nodes in G

0

, we can assign the confidence score as their

weight. We denote by W

conf

=fw

ij

g

n

i;j=1

the weight matrix, where w

ij

is the

confidence score for the node pair (v

i

, v

j

) (also denoted as w

vi,vj

when

confusion does not exist). Note that w

ij

=0; 8ðv

i

; v

j

Þ =2 E

0

and

w

ij

40; 8ðv

i

; v

j

Þ2E

0

.WedenotebyG

conf

ðV

conf

; E

conf

; W

conf

Þ this simple

undirected graph where V

conf

= V

0

, E

conf

= E

0

and W

conf

is defined

above.

For the edge weights in G

str

we simply take the confidence scores p

1

, p

2

and p

3

from STRING for each selected category: ‘experiments’,

‘co-expression’ and ‘database’ (Note that we assign 0.9 for ‘database’

confidence score if the association link is in the KEGG PATHWAY

PPIs, and 0 otherwise; the choice of 0.9 for KEGG PPIs is

similar to STRING’s.); then we calculate the combined confidence

score as p=1 ð1 p

1

Þð1 p

2

Þð1 p

3

Þ in the Bayesian scheme,

which is exactly how STRING (Franceschini et al., 2013) suggests indi-

vidual confidence scores be combined.

2.1.4 Functional pathway maps We use all 105 S.cerevisiae signaling

pathways from the KEGG PATHWAY database (Kanehisa and Goto,

2000) (version date: December 12, 2013) where there are 75 pathways from

the metabolism category, 21 from the genetic information processing cat-

egory, 3 from the environmental information processing category and 6

from the cellular processes category. Just as suggested in (Liu et al., 2009),

in the BioGRID experiments, we run both caDSD with all edges undir-

ected, and we also run the version of caDSD where we additionally con-

sider the following five protein relations that appear in the KEGG

database as directional: activation, inhibition, phosphorylation, depho-

sphorylation and ubiquination. Any PPIs extracted with only one of

these five types are considered directed, while all the other PPIs annotated

with types such as ‘compound’ are considered undirected. In total, there

are 206 directed PPIs and 6951 undirected PPIs separately, involving 1120

proteins in the KEGG PATHWAY database; since we only consider edges

of which both endpoints appear in the connected PPI network G

0

,we

extract 157 directed PPIs, the set of which is denoted by D, and 3374

undirected PPIs, the set of which is denoted by U, involving 1083 unique

ORFs total. Because the results for the caDSD adding so few directed

edges were very similar to the fully undirected version of caDSD, we do

not add directions to the edges in the STRING experiment.

2.1.5 Functional annotation We consider both the MIPS functional

catalogue (FunCat) (Ruepp et al., 2004) and GO annotations (Ashburner

et al., 2000). We use the latest version of FunCat (version 2.1) and the

first, second and third level functional categories, retaining only those

labels annotating at least three proteins in our dataset. We present results

for MIPS annotations at the first level (4443 proteins with 10 569 anno-

tations in 17 functional categories in BioGRID), second level (4428 pro-

teins with 12 378 annotations in 74 out of 80 functional categories

annotating at least 3 proteins in BioGRID) and third level (4061 proteins

with 9441 annotations in 154 out of 181 functional categories annotating

at least 3 proteins in BioGRID). We also present results for the popular

GO (Ashburner et al., 2000), where the variable depth hierarchy of the

annotation labels makes the evaluation of predicted labels more compli-

cated, in the Supplementary Material.

2.2 cDSD, caDSD and capDSD

2.2.1 cDSD: incorporating PPI confidence We build the undir-

ected weighted simple graph G

conf

(V

conf

, E

conf

, W

conf

)whereV

conf

= V

0

and E

conf

= E

0

are simply defined by assigning the confidence score to

all pairs of nodes in V

0

. The confidence scores are assigned as described in

Section 2.1.3. Let P

0

=fp

0

ij

g

n

i;j=0

be the n-dimensional one-step transition

matrix where the (i, j)th entry is given by

p

0

ij

=

w

ij

X

n

l=1

w

il

if ðv

i

; v

j

Þ2E

conf

0otherwise

:

8

>

<

>

:

Note that P

0

represents the probability to reach each neighbor in the

random walk. Then the definition of k-step transition probability

matrix P

0

fkg=P

0

k follows for all positive k.Itiseasytoshowthat

the expected number of times that a random walk starting at node v

i

and proceeding for k steps will visit node v

j

,denotedasHe

0

fkg

ðv

i

; v

j

Þ,

can be calculated as

X

k

l=0

p

0

flg

ij

,wherep

0

flg

ij

is the (i, j)th entry

of l-step transition probability matrix. The n-dimensional vector

He

0

fkg

ðv

i

Þ; 8v

i

2 V

conf

can be constructed accordingly. Therefore, when

we fix the number of random walk steps k, the definition of DSD with

PPI confidence follows:

cDSD

fkg

ðu; vÞ=jjHe

0

fkg

ðuÞHe

0

fkg

ðvÞjj

1

:

Table 1. Confidence score assignment for PPIs when either only low-

throughput or only high-throughput experiments are present

No. of experiments Low-throughput High-throughput

000

1 0.80 0.25

2 0.90 0.50

3 0.95 0.75

4 0.95 0.85

i222

M.Cao et al.

2.2.2 caDSD: adding KEGG PPIs We consider PPIs from KEGG

PATHWAY database highly reliable since they are manually drawn by

domain experts; for the BioGRID experiments, we will re-assign max-

imum confidence score 1 to these PPIs no matter whether or not the PPI

is present in the BioGRID database (For the STRING experiments, note

that every KEGG edge is already assigned a confidence value of at least

0.9 by cDSD (and maybe larger if there is additional independent evi-

dence) so we just retain cDSD confidence values on these edges).

Thus, based on the undirected graph G

conf

ðV

conf

; E

conf

; W

conf

Þ,theun-

directed edge set U and the directed edge set D from KEGG pathways,

we build a directed graph G

aug

ðV

aug

; E

aug

; W

aug

Þ,whereV

aug

=V

0

; E

aug

and W

aug

=fw

faugg

ij

g

n

i;j=0

are constructed as follows (we use hi to denote

directed edges compared to () for undirected edges):

(1) Initialize E

aug

by adding hv

i

; v

j

i and hv

j

; v

i

i with weight

w

faugg

ij

=w

faugg

ji

=w

fconfg

ij

; 8ðv

i

; v

j

Þ2E

conf

;

(2) For each edge ðv

i

; v

j

Þ2U,if(v

i

,v

j

) already exists in E

conf

,set

w

faugg

ij

=w

faugg

ji

=1, otherwise add hv

i

; v

j

i and hv

j

; v

i

i into E

aug

with

weight 1; and

(3) For each edge hv

i

; v

j

i2D,if(v

i

,v

j

) already exists in E

conf

,set

w

faugg

ij

=1, otherwise add hv

i

; v

j

i into E

aug

with weight 1.

Again, we define the one-step transition probability matrix P

aug

=

fp

faugg

ij

g

n

i;j=0

as follows:

p

faugg

ij

=

w

faugg

ij

=

X

n

l=1

w

faugg

il

if hv

i

; v

j

i2E

aug

;

0 otherwise:

(

Similarly we define the k-step transition probability matrix P

fkg

aug

=P

k

aug

and calculate the expected number of times that a random walk

starting at node v

i

and proceeding for k steps will visit node

v

j

; He

fkg

aug

ðv

i

; v

j

Þ=

X

k

l=0

p

faug;lg

ij

,wherep

faug;lg

ij

is the (i, j)th entry of

the l-step transition probability matrix P

flg

aug

.Thusthen-dimensional

vector He

fkg

aug

ðv

i

Þ; 8v

i

2 V

aug

follows similarly and when we fix the

number of random walk steps k, the definition of DSD with

KEGG PPIs is

caDSD

fkg

ðu; vÞ=jjHe

fkg

aug

ðuÞHe

fkg

aug

ðvÞjj

1

:

2.2.3 capDSD: the augmented graph with explicit pathways The

previous caDSD makes use of the fact that the PPIs from the KEGG

PATHWAY database are high-quality, and sometimes known to be dir-

ectional; however it incorporates the KEGG pathway information as

individual interaction edges and retains no notion of each pathway as a

cohesive whole. In particular, some graph paths may not be meaningful

at all when mapped to a chain of ORFs, while other graph paths corres-

pond to signaling pathways. We hypothesize that if we can make the

random walks used to calculate DSD values hew more tightly to the

known pathways, the resulting diffusion process might better capture

the notion of functional similarity. However, doing so directly would

destroy the ‘memoryless’ structure of the underlying random walk, and

make the probabilities too difficult to calculate. Our solution is to instead

build a new network, where nodes in pathways are replicated, into ordin-

ary and ‘highway’ versions, where the ‘highway’ version is chosen with

some probability, and if the ‘highway’ is taken, edge probabilities for the

highway nodes are set so that it is highly likely to continue along the

pathway. More specifically, we build a network G

path

ðV

path

; E

path

; W

path

Þ

where W

path

will be a mapping: W

path

: V

path

V

path

! R; 8a; b 2 V

path

(instead of an n-dimensional square matrix because the size of V

path

will

be different from n) as follows:

(1) Denote by {P

1

, P

2

, ..., P

g

}whereg is the number of pathways, the

set of pathways; denote by PE

1

, PE

2

, ..., PE

g

the sets of directed

edges from the g pathways where each undirected edge is

considered as two directed edges; denote by PV

1

,PV

2

, ...,PV

g

the sets of proteins involved in each of the g pathways where

each set is a subset of V

aug

, namely the ORF list;

(2) We initialize V

path

with fv

0

1

; v

0

2

; :::v

0

n

g by relabeling each ORF node

v

i

2 V

aug

with a superscript 0, which stands for the original PPI

network; we initialize W

path

as the empty map;

(3) We initialize E

path

by adding hv

0

i

; v

0

j

i with weight W

path

ðv

0

i

; v

0

j

Þ=

w

faugg

v

i

;v

j

for all hv

i

; v

j

i2E

aug

;

(4) For each pathway P

2fP

1

; P

2

; :::; P

g

g:

(a) For each protein v

i

2 PV

, add a pathway node v

i

into V

path

;

(b) For each pathway node v

i

2 V

path

:foreachedge

hv

i

; v

j

i2E

aug

, we add an edge hv

i

; v

0

j

i into E

path

with weight

w

faugg

ij

; and for each edge hv

j

; v

i

i2E

aug

,weaddanedgehv

0

j

; v

i

i

into E

path

with weight w

faugg

ji

; these newly added edges are

called cross edges; and

(c) For each edge hv

i

; v

j

i2PE

which we call a pathway edge, add

an edge hv

i

; v

j

i into E

path

, and the weight assignment will not

be set but the transition probability will be assigned specially in

Step 7 when all the pathways are processed.

(5) For each cross edge in the form of hv

0

i

; v

j

i2E

path

;

8i 2f1; 2; :::; ng; v

j

2 PV

;2f1; 2; :::; gg, boost the weight by

multiplying W

path

ðv

0

i

; v

j

Þ by the factor of m and update the

weight with the boosted value, where m is a multiplication factor

parameter;

(6) For all the directed node pairs hv

i

; v

j

i =2 E

path

; 8;

2f0; 1; :::; gg; v

i

; v

j

2 V

path

, assign 0 as the weight since we do

not have any evidence for the existence of the PPI pair hv

i

; v

j

i;

(7) Let N = jV

path

j, where N=n+

X

g

=1

jPV

j. Now we calculate the

N-dimensional one-step transition probability square matrix P

path

wherewedenotebyp

i

;j

as the one-step transition probability

from v

i

to v

j

; 8v

i

; v

j

2 V

path

:

(a) For each pathway node v

i

2 V

path

,where40, the pathway

edge hv

i

; v

j

i2E

path

, will have transition probability set as

p

i

;j

=r=d

i

,wherer 2ð0; 1Þ is a parameter and d

i

is the

number of pathway edges starting from v

i

; the cross edge

hv

i

; v

0

j

i2E

path

will have transition probability set as p

i

;j

0

=ð1

rÞW

path

ðv

i

; v

0

j

Þ=

X

hv

i

;v

0

l

i2E

path

W

path

ðv

i

; v

0

l

Þ if d

i

40, and

p

i

;j

0

=W

path

ðv

i

; v

0

j

Þ=

X

hv

i

;v

0

l

i2E

path

W

path

ðv

i

; v

0

l

Þ otherwise (no

edges across two pathway nodes from two different pathways

exist); and

(b) For each node v

0

i

2 V

path

, the transition probability will be set

as p

i

0

;j

=W

path

ðv

0

i

; v

j

Þ=

X

hv

0

i

;v

l

i2E

path

W

path

ðv

0

i

; v

l

Þ,ifhv

0

i

; v

j

i2

E

path

, and 0 otherwise.

Step 5 is used so that the probability of entering pathways can be

adjusted higher by setting the multiplication factor m41; in the Results

section, we report the results where m = 25. Step 7(a) is used so that the

total probability of staying on the same pathway after one transition

from a non-terminal pathway node (the node that has outgoing pathway

edges) will be r, which in our case we set as r = 0.7. We tried different

values for r and m empirically; and results are fairly robust to different

choices of r and m (results of weighted majority voting capDSD over

different choices of r and m appear in the Supplementary Material).

Given the one-step transition probability matrix P

path

as well as the

l-step transition probability matrix P

flg

path

=P

l

path

; 8l 0, we can calcu-

late the expected number of times that a random walk starting at

node v

i

and proceeding for k steps will visit node

v

j

; EXP

fkg

ðv

i

; v

j

Þ=

X

k

l=0

p

i

;j

.ThenwedefinetheHe value for each

i223

New directions for network-based protein function prediction

pair of ORF nodes v

i

; v

j

2 V

0

:

He

fkg

path

ðv

i

; v

j

Þ=

X

:v

j

2V

path

EXP

fkg

ðv

0

i

; v

j

Þ;

as well as the n-dimensional vector:

He

fkg

path

ðv

i

Þ=ðHe

fkg

path

ðv

i

; v

1

Þ; He

fkg

path

ðv

i

; v

2

Þ; :::; He

fkg

path

ðv

i

; v

n

ÞÞ:

The definition of DSD with external paths follows:

capDSD

fkg

ðv

i

; v

j

Þ=jjHe

fkg

path

ðv

i

ÞHe

fkg

path

ðv

j

Þjj

1

; 8v

i

; v

j

2 V

0

:

2.3 Evaluation

As shown in (5), the original DSD improves all the tested classical protein

function prediction algorithms in 2-fold cross-validation for functional

label prediction for all three levels of the MIPS hierarchy by simply

replacing the shortest-path distance with the DSD matrix, where the

best performing method overall was the DSD version of weighted major-

ity vote. In this work, we similarly evaluate four methods (majority vote,

weighted majority vote, multi-way cut and functional flow) using cDSD,

caDSD and capDSD as the distance metric. While the results in (Cao

et al., 2013) were based on the converged DSD as k !1,wehavenotyet

been able to prove convergence for our new cDSd, caDSD and capDSD

variants. Thus, in our experiments, we set the length of random walk step

k = 7 for all the three variants of DSDs (we also tested other values of k

and empirically observed that when k 5, the performance is almost

unchanged even though we have not been able to prove the convergence

of the variants of DSDs.)

We stress that in each of our experiments, the function prediction

method is unchanged, and does not explicitly incorporate confidence or

pathway information in any way, except in that it uses the values from the

cDSD, caDSD or capDSD matrix instead of from the DSD (or ordinary

shortest-path distance) matrix.

2.3.1 Cross-validation task We consider 2-fold cross-validation

tasks. In each of the 2-fold cross-validation tasks, we first randomly

split the annotated proteins into two sets. For each set, we use its anno-

tations as the training set to predict the annotations on proteins in the

other set. We then average the performance over the 2-folds of the cross-

validation. We conduct 10 runs of 2-fold cross-validation. For MIPS

function prediction we report the means and standard deviations of the

two performance measures over these 10 runs: accuracy and F1 score

(Cao et al., 2013). The accuracy is calculated as the percentage of proteins

that are assigned a correct function annotation (Schwikowski et al.,

2000). The F1 score for each protein function is calculated as (Darnell

et al.,2007)

F1=

2 precision recall

precision+recall

;

where precision and recall are calculated by looking at the top (in our

case, we present results for = 3) predicted annotations. We average F1

scores over the individual functions and obtain the overall F1 score for

each algorithm. Our GO (Ashburner et al., 2000) results take into account

partial matches based on the deep hierarchy of the GO labels according

to the methods of (Deng et al., 2003, 2004) and appear in the

Supplementary Material.

2.3.2 Neighborhood majority voting algorithm: weighted and

unweighted

These are the simplest of all function prediction methods.

Directly applying the concept of ‘guilt by association’, (Schwikowski

et al., 2000) consider for each protein u its neighboring proteins. Each

neighbor votes for their own annotations, and the majority is used as the

predicted functional label. To incorporate DSD, the neighborhood of u is

defined simply as the t nearest neighbors of u under the DSD metric.

Furthermore, two schemes are considered: an unweighted scheme where

all new neighbors vote equally, and a DSD weighted scheme where all

new neighbors get a vote proportional to the reciprocal of their DSD

distance. As in (Cao et al., 2013), we set t = 10.

Multi-way cut algorithm Similar to (Nabieva et al., 2005), we imple-

ment the minimal multi-way k-cut algorithm of (Vazquez et al., 2003)

whose motivation is to minimize the number of times that annotations

associated with neighboring proteins differ, by approximately solving the

integer linear programming problem:

maximize

X

ðu;vÞ2E;a2FUNC

X

u;v;a

subject to the constraints

X

a2FUNC

X

u;a

=1; X

u;v;a

X

u;a

; X

u;v;a

2f0; 1g

; X

v;a

2f0; 1g where the edge variables X

u,v,a

are defined for each function

a in the function set FUNC, whenever there exists an edge between pro-

teins u and v in the edge set E. X

u,v,a

is set to 1, if protein u and v both are

assigned function a, and 0 otherwise. The node variable X

u,a

are set to 1

when u is labeled with function a and 0 otherwise. The first constraint

insures that each protein is only given one annotation. The second con-

straint makes sure only annotations that appear among the vertices can

be assigned to the edges. While this problem is NP-hard, the ILP is

tractable in practice; in our case we use the IBM CPLEX solver (version

12.4, http:// www.ilog.com/ products/cplex/). For the DSD version of this

algorithm, we simply add additional edges between vertices whose DSD is

below a threshold . We set a global threshold D based on the average

DSD of all pairs, specifically we set D= c ,where is the average,

and is the standard deviation of the global set of DSD values among all

pairs of nodes in the graph. As in (Cao et al.,2013),wesetc =1.5.

Functional flow algorithm Nabieva et al. (2005) use a network flow

algorithm on the graph of protein interactions to label proteins. The

idea is to consider each protein having a known function annotation

as a ‘reservoir’ of that function, and to simulate flow of functional

association through the network to make predictions. We adapt the

approach to use DSD by creating an edge between each node pair,

with a weight inversely proportional to DSD. For computational effi-

ciency we do not create edges when the reciprocal of DSD is below a

small value. This global threshold for DSD values is set the same as in

the multi-way cut algorithm. As in the original functional flow, we

calculate flow through this new network at each time step. We denote

the size of the reservoir of function a at node u and time step i,tobe

R

a

i

ðuÞ. For a given function (annotation) a we initialize the reservoir

size at node u to be infinite if protein u has been annotated with

function a; otherwise we set it to be 0. More formally: R

a

0

ðuÞ=1 if

u is annotated with a and 0 otherwise. We then update the reservoir

over a sequence of time steps (we use six time steps, as in the original

version (Nabieva et al.,2005)):

R

a

t

ðuÞ=R

a

t1

ðuÞ+

X

v:ðu;vÞ2E

ðg

a

t

ðv; uÞg

a

t

ðu; vÞÞ;

where g

a

t

ðv; uÞ is the amount of flow a that moves from u to v at time

t. We incorporate DSD into the edge weight as follows:

g

a

t

ðu; vÞ=

0; if R

a

t1

ðuÞ5R

a

t1

ðvÞ

minð

1

DSDðu; vÞ

; flow

u;v

Þ otherwise:

;

8

>

<

>

:

where flow

u;v

=

1

DSDðu;vÞ

X

ðu;yÞ2E

1

DSDðu;yÞ

. The final functional score for node u and

function a is computed as the total amount of incoming flow.

i224

M.Cao et al.

3 RESULTS

3.1 Performance of function prediction methods and

their DSD variants on MIPS

Cao et al. (2013) show how to modify several classical function

prediction methods, including the four we study here (majority

vote, weighted majority vote, multi-way cut and functional flow)

to utilize the DSD pairwise dissimilarity metric in place of or-

dinary shortest-path distance. In this work, we use the same

DSD-based methods as in Cao et al. (2013), but instead substi-

tute the cDSD, caDSD and capDSD matrices to incorporate

confidence measures and pathways. Full MIPS results on

BioGRID data appear in Table 2, where we have two versions

of caDSD: one that adds directions to the 157 edges which are of

the five types identified by Gitter et al. (2011) as naturally dir-

ected, and one where all edges are left undirected. Table 3 then

gives the results on the integrative STRING database. Note that

for the STRING database, we already include all the KEGG

edges, so cDSD is equivalent to (undirected) caDSD, so this

merges the two lines in the table. GO results appear in the

Supplementary Material.

We observe that, on both BioGRID and STRING, over 10

runs of 2-fold cross-validation, the best method overall is

weighted majority vote with capDSD. For example, weighted

majority vote with capDSD achieves an average 68.90% accur-

acy and 51.61% F1 score on the first level of the MIPS hierarchy

on BioGRID, and an average 71.30% accuracy and 52.91% F1

score on the first level of the MIPS hierarchy using STRING.

Several other observations are interesting. On the BioGRID

data, substituting original DSD for the ordinary shortest-paths

metric improved all the function prediction methods we tested

across the board. On STRING, this was not the case: when

additional edges such as co-expression were added in, ordinary

DSD (without confidence weights) no longer improved the clas-

sical function prediction methods we tested with the exception of

functional flow, where there was a large improvement. But func-

tional flow did much worse overall on the STRING database

compared to BioGRID. This implies that when adding in add-

itional edges from sources that might be more weakly correlated

to functional transfer of annotation, it is crucial to include con-

fidence values. Once we go from unweighted DSD to DSD with

confidence, we again see improvements over classical methods.

Going from unweighted DSD to cDSD improves everything, but

it is even more crucial for STRING than for BioGRID to include

a confidence measure.

Now let us consider all the different ways to incorporate high-

confidence KEGG edges. In the BioGRID experiments, as re-

marked above, it is not surprising that caDSD and capDSD,

which use these edges perform better than cDSD, since not all

these edges appear already in BioGRID. In the STRING experi-

ment, these edges are already present in cDSD, so cDSD=caDSD

gives the naive way to put in these edges, whereas capDSD puts

them in as augmented pathways. In the BioGRID experiments, we

also experimentally tried assigning directions to some of the

Table 2. Summary of protein MIPS function prediction performance for the physical PPI network using DSD, cDSD, caDSD and capDSD compared

to the original methods in 10 runs of 2-fold cross-validation (as a percentage)

MIPS 1 MIPS 2 MIPS 3

Accuracy F1 score Accuracy F1 score Accuracy F1

Majority Vote (MV) 50.08 0.72 41.45 0.40 40.69 0.49 30.85 0.33 38.03 0.37 29.50 0.14

MV with original DSD 62.96 0.45 47.40 0.28 49.41 0.65 35.71 0.33 43.87 0.47 32.33 0.18

MV with cDSD 66.16 0.56 49.10 0.24 53.08 0.54 38.12 0.16 47.73 0.56 35.13 0.33

MV with caDSD (directed edges) 67.61 0.56 50.37 0.22 59.11 0.67 41.58 0.19 52.14 0.55 38.09 0.16

MV with caDSD (no directed edges) 67.61 0.42 50.36 0.24 59.11 0.57 41.57 0.25 52.13 0.56 38.07 0.21

MV with capDSD 67.60 0.37 50.28 0.27 59.46 0.57 41.58 0.22 52.97 0.59 38.190.23

Weighted MV (WMV) with original DSD 63.40 0.51 48.29 0.25 50.69 0.82 36.74 0.36 45.20 0.58 33.72 0.27

WMV with cDSD 67.07 0.45 50.12 0.35 54.82

0.56 39.53 0.18 49.56 0.49 36.71 0.32

WMV with caDSD (directed edges) 68.69 0.40 51.48 0.29 60.96 0.51 43.13 0.23 54.51 0.51 39.91 0.28

WMV with caDSD (no directed edges) 68.68 0.41 51.48 0.25 60.96 0.53 43.13 0.22 54.51 0.46 39.90 0.32

WMV with capDSD 68.90 0.49 51.61 0.21 61.82 0.59 43.54 0.26 56.16 0.59 40.42 0.35

Multi-way Cut (GMC) 55.31 0.41 42.18 0.29 42.02 0.43 28.21 0.36 36.69 0.50 24.98 0.21

GMC with original DSD 58.36 0.32 42.51 0.19 44.63 0.32 29.51 0.27 38.20 0.40 25.49 0.22

GMC with cDSD 61.11 0.37 42.85 0.23 47.11 0.35 30.52 0.25 40.83 0.61 26.66 0.22

GMC with caDSD (directed edges) 62.71 0.30 43.46 0.24 52.59 0.25 32.47 0.30 44.29 0.63 28.46 0.19

GMC with caDSD (no directed edges) 62.76

0.31 43.45 0.25 52.61 0.25 32.50 0.30 44.31 0.63 28.46 0.19

GMC with capDSD 62.44 0.31 43.43 0.17 52.30 0.46 32.48 0.31 44.18 0.59 28.34 0.32

Functional Flow (FF) 50.48 0.48 37.17 0.25 32.57 0.48 22.64 0.32 25.29 0.39 18.27 0.14

FF with original DSD 53.58 0.36 40.75 0.11 38.20 0.65 26.71 0.29 30.70 0.45 22.29 0.28

FF with cDSD 57.78 0.49 42.82 0.27 42.17 0.58 29.29 0.38 35.68 0.48 25.72 0.17

FF with caDSD (directed edges) 60.09 0.55 44.81 0.24 49.73 0.41 33.89 0.32 40.82 0.60 28.94 0.27

FF with caDSD (no directed edges) 60.18 0.47 44.80 0.20 49.67 0.51 33.89 0.28 40.82 0.51 28.97 0.23

FF with capDSD 58.98 0.53 43.80 0.27 49.32 0.61 33.32 0.29 41.04 0.33 28.83

0.33

Note: Weighted majority vote with capDSD (in bold) gives the best results over all three levels of the MIPS hierarchy.

i225

New directions for network-based protein function prediction

KEGG edges as well, as in the method of Gitter et al. (2011) (see

Methods section). However, we find that directing 157 edges is

much too small a number to affect results; as can be seen in Table

2, results are nearly identical to the undirected caDSD. We there-

fore used only undirected caDSD which is the same as cDSD for

the STRING experiments.

So it remains to answer the main question of the article,

whether using the augmented pathways as controlled-access

highways is a better way to incorporate pathway information

than just using individual edges. The best performing method,

weighted majority vote, improved things only very slightly (by

51–1.5 pp) for BioGRID, on different levels of the MIPS hier-

archy, with more improvement at the lower levels of the hier-

archy. However, on STRING, with the presence of more edges

that were more weakly correlated to function, the improvement

is much greater. In the STRING experiments (Table 3), going to

pathways (capDSD) improved weighted majority vote by over

1.5 pp on the first level of the MIPS hierarchy, by over 3 pp on

the second level of the MIPS hierarchy and by over 4 pp on the

third level of the MIPS hierarchy. Similar improvements are seen

for capDSD with unweighted majority vote and functional flow

on STRING, though these are not the best performing methods

overall, while performance of multi-way cut degrades with aug-

mented pathways. We next discuss why that might be the case.

4DISCUSSION

Incorporating confidence and pathways into our diffusion-based

distance metric DSD, we studied whether it was best to incorp-

orate pathway information as edges or as controlled-access high-

ways in an augmented graph. We showed that the augmented

graph improved the best function prediction method we tested,

weighted majority vote, especially in our experiments on the

STRING database, where there were additional edges whose

correlation with function was weaker. The performance of

other methods was not as clearly served by the augmented path-

ways; capDSD improved functional flow in the noisier STRING

setting, but not on BioGRID. The performance of multi-way cut

degraded across the board. We hypothesize that the methods

that will improve using capDSD versus just caDSD are those

that use only some sort of information about the local neighbor-

hood of a node to predict its function; here, making path-

ways ‘closer’ with highways is helpful, whereas the amount of

distortion in augmenting the graph causes too much noise for

more global methods such as multi-way cut. Functional flow, has

both local and global aspects, so its mixed performance would be

consistent with this theory.

Finally, the best modern function prediction methods are all

integrative methods, and may do something more sophisticated

than adding in data from other high-throughput data sources as

edges with different confidences (Sharan et al., 2005, 2007;

Borgwardt et al., 2005; Cozzetto et al., 2013; Dutkowski et al.,

2013). Thus the next step would be to integrate our results into a

hybrid method along these lines.

We note that all code for calculating the confidences, for ex-

tracting pathway information from KEGG XML files, and for

calculating the cDSD, caDSD and capDSD matrices is available

from http://dsd.cs.tufts.edu/capdsd.

ACKNOWLEDGEMENTS

Thanks to the CRA-W DREU program which supported K.J.D.

to spend the summer doing research with L.J.C. at Tufts. Thanks

to Mark Crovella, Donna Slonim and the entire Tufts BCB

group for helpful feedback.

Funding: J.P. was partially supported by NIH grant R01

HD076140 (to D. K. S.).

Conflict of interest: none declared.

Table 3. Summary of protein MIPS function prediction performance for the STRING integrative network G

str

using DSD, cDSD/caDSD and capDSD

compared to the original methods in 10 runs of 2-fold cross-validation (as a percentage)

MIPS 1 MIPS 2 MIPS 3

Accuracy F1 score Accuracy F1 score Accuracy F1

Majority Vote (MV) 65.71 0.36 49.50 0.25 53.95 0.47 37.96 0.19 46.17 0.50 33.75 0.33

MV with original DSD 64.93 0.56 48.55 0.42 50.99 0.35 36.10 0.24 44.47 0.35 31.85 0.22

MV with cDSD/caDSD 69.38 0.71 51.54 0.36 58.01 0.50 40.41 0.32 51.48 0.46 36.86 0.32

MV with capDSD 70.25 0.47 52.22 0.39 61.22 0.57 42.52 0.29 55.54 0.44 39.36 0.21

Weighted MV (WMV) with original DSD 65.25 0.45 49.15 0.44 52.19 0.42 37.10 0.29 45.64 0.41 33.00 0.16

WMV with cDSD/caDSD 69.67 0.56 52.20 0.37 59.41 0.42 41.62 0.26 53.21 0.37 38.29 0.28

WMV with capDSD 71.30 0.44 52.97 0.38 62.88 0.54 43.98 0.39 57.84 0.50 41.07 0.21

Multi-way Cut (GMC) 63.48 0.56 43.03 0.20 52.66

0.54 31.67 0.18 43.37 0.60 26.20 0.19

GMC with original DSD 63.29 0.68 42.80 0.23 52.34 0.56 31.60 0.21 43.59 0.33 26.39 0.18

GMC with cDSD/caDSD 65.18 0.38 43.39 0.16 53.59 0.47 31.89 0.18 44.46 0.36 26.50 0.17

GMC with capDSD 65.21 0.46 43.31 0.15 51.09 0.37 30.74 0.20 40.73 0.40 25.49 0.21

Functional Flow (FF) 39.91 0.77 31.61 0.25 22.26 0.53 17.25 0.21 18.48 0.49 14.26 0.09

FF with original DSD 47.44 0.42 36.46 0.18 29.46 0.30 21.06 0.25 23.08 0.21 16.68 0.16

FF with cDSD/caDSD 51.70 0.43 38.57 0.21 34.67 0.27 24.03 0.19 28.32 0.35 19.39 0.20

FF with capDSD 53.00 0.37 39.73 0.19 37.93 0.50 26.56 0.18 31.18 0.36 21.59 0.20

Note: Weighted majority vote with capDSD (in bold) gives the best results over all three levels of the MIPS hierarchy.

i226

M.Cao et al.

REFERENCES

Arnau,V. et al. (2005) Iterative cluster analysis of protein interaction data.

Bioinformatics, 21, 364–378.

Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat.

Genet., 25, 25–29.

Bader,G.D. et al. (2003) BIND: the biomolecular interaction network database.

Nucleic Acids Res., 31, 248–250.

Borgwardt,K.M. et al. (2005) Protein function prediction via graph kernels.

Bioinformatics, 21 (Suppl. 1), i47–i56.

Cao,M. et al. (2013) Going the distance for protein function prediction: a new

distance metric for protein interaction networks. PLoS One, 8,e76339.

Chen,J. et al. (2009) Disease candidate gene identification and prioritization using

protein interaction networks. BMC Bioinformatics, 10, doi:10.1186/1471–2105–

10–73.

Cozzetto,D. et al. (2013) Protein function prediction by massive integration of

evolutionary analyses and multiple data sources. BMC Bioinformatics, 14

(Suppl. 3), S1.

Darnell,S.J. et al. (2007) An automated decision-tree approach to predicting protein

interaction hot spots. Prot. Struct. Funct. Bioinform., 68, 813–823.

Deng,M. et al. (2003) Assessment of the reliability of protein-protein interactions

and protein function prediction. Pacific Symposium on Biocomputing, 140–151.

Deng,M. et al. (2004) Mapping Gene Ontology to proteins based on protein–protein

interaction data. Bioinformatics, 20, 895–902.

Du,D. et al. (2012) Systematic differences in signal emitting and receiving revealed

by pagerank analysis of a human protein interactome. PLoS One, 7, e44872.

Dutkowski,J. et al. (2013) A Gene Ontology inferred from molecular networks. Nat.

Biotechnol, 31, 38–45.

Erten,S. et al. (2011) VAVIEN: an algorithm for prioritizing candidate disease genes

based on topological similarity of protein interaction networks. J. Comput. Biol.

,

18, 1561–1574.

Franceschini,A. et al. (2013) String v9. 1: protein-protein interaction networks, with

increased coverage and integration. Nucleic Acids Res., 41, D808–D815.

Gandhi,T. et al. (2006) Analysis of the human protein interactome and comparison

with yeast, worm and fly interaction datasets. Nat. Genet., 38, 285–293.

Gitter,A. et al. (2011) Discovering pathways by orienting edges in protein inter-

action networks. Nucleic Acids Res., 39, e22–e22.

Hishigaki,H. et al. (2001) Assessment of prediction accuracy of protein function

from protein-protein interaction data. Yeast, 18, 523–531.

Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto encyclopedia of genes and gen-

omes. Nucleic Acids Res., 28,27–30.

Kohler,S. et al. (2008) Walking the interactome for prioritization of candidate dis-

ease genes. Am.J.Hum.Genet., 82, 949–958.

Liao,C.-S. et al.IsoRankN: spectral methods for global alignment of multiple pro-

tein networks. Bioinformatics, 25, i253–i258.

Licata,L. et al. (2012) Mint, the molecular interaction database: 2012 update.

Nucleic Acids Res., 40, D857–D861.

Liu,W. et al. (2009) Proteome-wide prediction of signal flow direction in protein

interaction networks based on interacting domains. Mol. Cell. Proteom., 8,

2063–2070.

Mering,V.C. et al. (2002) Comparative assessment of large-scale data sets of protein-

protein interactions. Nature, 417, 399–403.

Moustakas,A. (2002) Smad signalling network. J. Cell Sci., 115, 3355–3356.

Nabieva,E. et al. (2005) Whole-proteome prediction of protein function via graph-

theoretic analysis of interaction maps. Bioinformatics, 21, 302–310.

Reguly,T. et al. (2006) Comprehensive curation and analysis of global interaction

networks in Saccharomyces cerevisiae. J. Biol.

, 5,11.

Ruepp,A. et al. (2004) The FunCat, a functional annotation scheme for system-

atic classification of proteins from whole genomes. Nucleic Acids Res., 32,

5539–5545.

Schwikowski,B. et al. (2000) A network of protein-protein interactions in yeast. Nat.

Biotechnol., 18, 1257–1261.

Sharan,R. et al. (2005) Conserved patterns of protein interaction in multiple species.

Proc. Natl Acad. Sci. USA, 102, 1974–1979.

Sharan,R. et al. (2007) Network-based prediction of protein function. Mol. Syst.

Biol., 3,88.

Stark,C. et al. (2006) BioGRID: a general repository for interaction datasets.

Nucleic Acids Res., 34 (Suppl. 1), D535–D539.

Vanunu,O. et al. (2010) Associating genes and protein complexes with disease via

network propogation. PLoS Comput. Biol., 6, e1000641.

Vazquez,A. et al. (2003) Global protein function prediction from protein-protein

interaction networks. Nat. Biotechnol., 21, 696–700.

Voevodski,K. et al. (2009) Spectral affinity in protein networks. BMC Syst. Biol., 3,

112.

Xenarios,I. et al. (2002) DIP, the database of interacting proteins: a research tool

for studying cellular networks of protein interactions. Nucleic Acids Res., 30,

303–305.

i227

New directions for network-based protein function prediction

- CitationsCitations9
- ReferencesReferences43

- "Protein–protein interaction (PPI) network STRING 9.0 database (Search Tool for the Retrieval of Interacting Genes) was used to gather direct and indirect protein–protein interactions (Franceschini et al. 2013). The database provided access to information on their neighborhood , gene fusions, co-occurrence, co-expression, experiments and literature mining (Cao et al. 2014). We established a PPI network based on a high confidence score of 0.700, which suggested that only interaction with high level of confidence were extracted from the database and deemed to be valid links for the PPI network. "

[Show abstract] [Hide abstract]**ABSTRACT:**In the present study, seven galacturonosyltransferase-like (GATL) genes (OsGATLs) in rice (Oryza sativa L.) were genome-widely identified and the chromosomal locations and the gene structures of which were characterized. Under normal condition, OsGATL2 and OsGATL3 are highly expressed in root, while OsGATL4 is highly expressed in stem and leaf. Many cis-elements related to stress response and plant hormone were found in the promoter sequence of each OsGATL. The expression patterns of these OsGATL genes under treatment with abscisic acid (ABA), drought and low temperature were assessed by qRT-PCR. The expression levels of most OsGATLs significantly increased following the treatments with drought or low temperature. In addition, physicochemical properties of OsGATLs and phylogenetic analysis with GATL from rice and several other species were performed. 3D structures and protein–protein interaction (PPI) network of OsGATLs were further predicted by Swiss-model and STRING 9.0 database, respectively. The identification and bioinformatic analysis of GATL family in rice could provide reference data for further study on their biological functions, especially in the responsiveness to hormones and stress signaling.- "This insight provided the basis for several diffusion-based methods [4] [12] [19] [8] that aim to predict characteristics of genes or proteins by using the diffusion states to better capture topological associations. Instead of simply using the probability in the diffusion state, the diffusion state distance (DSD) approach, using L1 distances between diffusion states, achieved the state-of-the-art performance on predicting protein functions on yeast interactomes [4] "

[Show abstract] [Hide abstract]**ABSTRACT:**Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or proteins. Using the intuition that the topological role of a gene in a network relates to its biological function, local or diffusion based "guilt-by-association" and graph-theoretic methods have had success in inferring gene functions. Here we seek to improve function prediction by integrating diffusion-based methods with a novel dimensionality reduction technique to overcome the incomplete and noisy nature of network data. In this paper, we introduce diffusion component analysis (DCA), a framework that plugs in a diffusion model and learns a low-dimensional vector representation of each node to encode the topological properties of a network. As a proof of concept, we demonstrate DCA's substantial improvement over state-of-the-art diffusion-based approaches in predicting protein function from molecular interaction networks. Moreover, our DCA framework can integrate multiple networks from heterogeneous sources, consisting of genomic information, biochemical experiments and other resources, to even further improve function prediction. Yet another layer of performance gain is achieved by integrating the DCA framework with support vector machines that take our node vector representations as features. Overall, our DCA framework provides a novel representation of nodes in a network that can be used as a plug-in architecture to other machine learning algorithms to decipher topological properties of and obtain novel insights into interactomes.- "These methods can also be adapted to replace Eq. (7). Since our work focuses on how to replenish the missing labels and how to predict protein functions using incomplete hierarchical labels, how to more efficiently utilize the guilt-by-association rule and how to reduce noise in PPI networks to boost the accuracy (i.e., by enhancing the functional content [42], or by incorporating additional data sources [5,15,16]), is out of scope. Based on Eq. (6) and Eq. "

[Show abstract] [Hide abstract]**ABSTRACT:**Background Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction.ResultsIn this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.Conclusion The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is classified Romeo Yellow.

Learn more