Saccà et al. BMC Bioinformatics
2014, 15:103
http://www.biomedcentral.com/1471-2105/15/103
METHODOLOGY ARTICLE Open Access
Improved multi-level protein–protein
interaction prediction with semantic-based
regularization
Claudio Saccà
1
,StefanoTeso
2
, Michelangelo Diligenti
1
and Andrea Passerini
2*
Abstract
Background: Protein–protein interactions can be seen as a hierarchical process occurring at three related levels:
proteins bind by means of specific domains, which in turn form interfaces through patches of residues.Detailed
knowledge about which domains and residues are involved in a given interaction has extensive applications to
biology, including better understanding of the binding process and more efficient drug/enzyme design. Alas, most
current interaction prediction methods do not identify which parts of a protein actually instantiate an interaction.
Furthermore, they also fail to leverage the hierarchical nature of the problem, ignoring otherwise useful information
available at the lower levels; when they do, they do not generate predictions that are guaranteed to be consistent
between levels.
Results: Inspired by earlier ideas of Yip et al. (BMC Bioinformatics 10:241, 2009), in the present paper we view the
problem as a multi-level learning task, with one task per level (proteins, domains and residues), and propose a
machine learning method that collectively infers the binding state of all object pairs. Our method is based on
Semantic Based Regularization (SBR), a flexible a nd theoretically sound machine learning framewor k that uses First
Order Logic constraints to tie the learning tasks together. We introduce a set of biologically motivated rules that
enforce consistent predictions between the hierarchy levels.
Conclusions: We study the empirical performance of our method using a standard validation procedure, and
compare its performance against the only other existing multi-level prediction technique. We present results showing
that our method substantially outperforms the competitor in several experimental settings, indicating that exploiting
the hierarchical nature of the problem can lead to better predictions. In addition, our method is also guaranteed to
produce interactions that are consistent with respect to the protein–domain–residue hierarchy.
Background
Physical interactions betwee n proteins are the workhorse
of cell life and development [1], and play an extremely
important role both in the mechanisms of disease [ 2]
and in the design of new drugs [3]. In recent years,
there has been enormous interest in re verse engineer-
ing the protein–protein interaction (PPI) networks of
several species, particularly due to the availability of high-
throughput experimental techniques, leading to an abun-
dance of large databases on all aspects of PPIs [4].
*Correspondence: passerini@disi.unitn.it
2
Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento,
Trento, Italy
Full list of author information is available at the end of the article
Notwithstanding the increased availability of interac-
tion data, the natural question of whether two arbitrary
proteins interact, and why, is still open. The growing
literature on protein interaction prediction [4-6] is symp-
tomatic of the gap separating the amount of available data
and the effective size of the interaction network [7]. The
present paper is a contribution towards filling this gap.
Our work is based on the observation that physical
interactions can b e viewed at three levels of detail. At
a higher level, two proteins interact to perform some
function within a biological pathway (e.g. metabolism,
signaling, regulation, etc.) [8]. At a lower level, the same
interaction occurs between a pair of specific domains
appearing in the proteins; the types of the domains
involved characterize the f unctional semantics of the
© 2014 Saccà et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited.
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 2 of 18
http://www.biomedcentral.com/1471-2105/15/103
interaction [9]. At the lowest level, the interaction is
instantiated by the binding of a p air of protein inter-
faces, patches of solvent accessible residues with compat-
ible shapes and chemical properties [10]. The low-level
features of the binding sites determine whether the inter-
action is transient or permanent, whether two proteins
compete for interaction with a third one, etc. Figure 1 illus-
trates the multi-level mechanisms with an example taken
from the PDB.
Despite the sig nificance of low-le vel details in elu-
cidating the mechanics of protein–protein interactions,
most of the current experimental data comes from high-
throughput screening techniques, such as yeast two-
hybrid (Y2H) assays [11]. These techniques do not provide
information on domain- or residue-level interactions,
which require solv ing the three-dimensional structure of
each protein-protein complex, an expensive and time con-
suming task addressed by X-Ray crystallography, NMR, or
electron microscopy techniques [12]. As a consequence,
protein–protein interaction data is under-characteri zed
at the domain and residue levels: the current databases
are relatively lacking when compared to the magnitude
of the existing body of data about protein-level interac-
tions [13]. At the time of writing, the PDB host s 84,418
structures, but merely 4,210 resolved complexes (accord-
ing to http://www.rcsb.org/pdb/statistics/
holdings.do, retrieved on 2013/06/20). The latter
cover only a tiny fraction of the interactions stored in
databases such as BioGRID and MIPS.
From a purely biological perspective, predictions at dif-
ferent levels have several important applications. The
network topology and individual features of protein inter-
actions are an es sential component of a wide range of
biological tasks: inferring protein function [14] and local-
ization [15], reconstructing signal and metabolic pathways
[16], discovering candidate targets for drug development
[2]. Finer granularity predictions at the domain level
allow to discover affinities between domain types that
can be carri ed over to other proteins [17,18]; domain–
domain networks have also b een assessed as being typ-
ically more reliable than their protein counterparts [13].
Finally, res idue-level predictions, i.e., interface recogni-
tion, enable the detailed study of the principles of protein
interactions, and are crucial for tasks such as rational
drug design [3], metabolic reconstruction and engineering
[19], and identification of hot-spots [20] in the absence of
structure information.
Given the usefulness of knowing the details o f protein–
protein interactions at diverse levels of detail, and based
on earlier ideas of Yip et al. [21], in this paper we addre ss
the problem of collectively predicting the binding state
of all proteins, domains, and residues in a network. We
call this task the multi-level protein interaction prediction
problem (MLPIP for short).
From a computational point of view, the most important
feature of the multi-level prediction problem is its inher-
ently relational nature. Proteins, domains and residues
are organized in a h ierarchy, which dictates constraints
on the binding state of pairs of objects at the differ-
ent levels, as follows. On the one hand, whenever two
proteins are bound, at least two of their domains must
also be bound, and, similarly, there must be residues
in the two domains that form an interface. On the
other hand, if no residues of the two proteins interact,
neither do their domains, nor the proteins themselves.
In other words, predictions at different levels must be
consistent.
In this paper we c ast the multi-level prediction problem
as a statistical-relational learning task, leverag i ng the lat-
est developments in the field. Our prediction method is
based on Semantic Based Regularization [22], an elegant
semi-supervised prediction framewo rk that caters both
the effectiveness of kernel machines and the expressiv-
ity of First Order Logic (FOL). The constraints described
above are encoded as FOL rules, which are used to enforce
consistent predictions at all levels of the interaction hier-
archy. By computing multi-level predictions, our method
can not only infer which protein pairs are likely to interact,
but also p rovide details about how the interactions take
place. Our empirical evaluation shows the effectiveness o f
this constraint-based approach in boosting predictive per-
formance, achie ving substantial improvements over both
Figure 1 The protein–domain–residue hierarchy. Two bound proteins and their interacting domains and residues, captured in PDB complex
4IOP. The proteins are a Killer cell lectin-like receptor (in violet) and its partner, a C-type lectin domain protein (in blue). (Left) Interaction as visible
from the contact surface. (Center) The two C-type lectin domains instantiating the interaction. (Right) Effectively interacting residues in red.
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 3 of 18
http://www.biomedcentral.com/1471-2105/15/103
an unconstrained baseline and the only existing alterna-
tive MLPIP metho d [21].
Problem definition
PPI networks are most naturally formalized as graphs,
where nodes represent proteins and edges represent inter-
actions. Given a set of fe atures describing the properties of
the proteins in the network (e.g. primary structure, local-
ization, tertiary structure —when available—, etc.), infer-
ring the PPI network topology amounts to determining
those pairs of proteins that are likely to interact. This task
is often c ast as a pairwise classification problem, where a
binary classifier takes as input a pair of proteins (or rather
their feature-based representations) and predicts whether
they interact or not. Standard binary cla ssification meth-
ods, such as Support Vector Machines [23], can be used to
implement the pairwise classifier. In this setting, the inter-
action depends only on the features of the tw o incident
nodes, and is independent of all other nodes. Interactions
between domains or residue s can be predicted similarly.
The most straightforward way to address the MLPIP
problem is to cast the three interaction prediction prob-
lems, for proteins, domains and res idues respectively, a s
independent pairwise classif i cation tasks. However, as
prev iously discussed, these problems are clearly strongly
related: two proteins interact via one or more domains,
which in turn contain patches of residues that consti-
tute the interaction surface. Ig noring these relationships
can lead to heavily suboptimal, inconsistent predictions,
where, e.g. two proteins are predicted to interact but
none of their domains are predicted to be involved in this
interaction. Making thes e relationships explicit and forc-
ing predictors to s atisfy consistency constraints is the key
contribution of this work. In the machine learning com-
munity, this kind of scenario characterized by multiple
related p rediction tasks is usually cast as a statistical-
relational learning problem [24,25], where the goal is to
collectively clas sify the s tate of all objects of interest, tak-
ing into account the relations existing between them. The
solution we adopt is grounded in this learning framework.
Overview of the proposed method
In this paper we propose solv ing the multi-level prediction
problem adapting a state-of-the-art statistical-relational
learning framework, namely Semantic Based Regulariza-
tion (SBR) [22]. SBR ties multiple learning tasks, which
are themselves addressed by kernel machines, using con-
straints expressing First Order Logic knowledge. In the
following we give an overview of the SBR framework, also
pictured in Figure 2; see Methods for further details.
Let
X b e a set of objects . In mo st scenarios, objects
are typed, so that objects of the same type can be con-
sidered as belonging to the same group. In our setting,
object types are proteins, domains and residues, with
corresponding sets
X
P
, X
D
and X
R
respectively. Predic ate s
represent properties of objec ts or relationships between
them. Depending on the scenario, some predicates are
always known (called given predicates), some other are
known only for a subset of the objects, and their value
should be predicted when unknown (query or target pred-
icates). The parentpd(p,d) predicate, for instance,
specifies that domain d ∈
X
D
is par t of protein p ∈ X
P
,i.e.
the predicate is true for all (p,d) pairs for which d is a
domain of p, and false otherwise. The value of this predi-
cate is known for all objects in our domain; note that there
indeed are many proteins whose domains are unknown,
but in this case there is no corresponding domain object
in our data). The boundp(p,p’) predicate specifies
whether two proteins p and p’ are interacting. This is one
of the target predicates, whos e truth v alue should be pre-
dicted for novel protein- protein pairs. Similar predicates
aredefinedfordomainandresiduelevelbindings.Target
predicates are modelled as binary classifiers, i.e. functions
trained to predict the truth value of the predicate. Rela-
tionships between predicates can be introduced in order
to enforce constraints known to hold in the domain. SBR
allows to exploit the full power of First Order Logic in
doing this. As a matter of example, the notion that two
interacting proteins should have at least one interacting
domain can be modelled as (see Methods for details on
First Order Logic notation):
∀(p,p’)boundp(p,p’) ⇒∃ (d,d’) boundd(d,’d) ∧
parentpd(p,d) ∧
parentpd(p’,d’)
Each binary classifier is implemented in the SBR frame-
work as a kernel machine [26]. The key component of
kernel machines is the kernel function, which mea sures
the similarity between objects in terms of their repre-
sentations. A protein, for instance, can be represented
as the sequence of its residues, plus additional infor-
mation as its subcellular localization and/or its phylo-
genetic profile. Having the same subcellular localization,
for instance, should increase the similarity between two
proteins, as having a similar amino acid composition.
Designing appropriate kernels is a crucial component of a
successful predictor. A kernel machine is a function w hich
predicts a certain property of an object x in terms of a
weighted sum of similarities to other objects for which the
property is known, i.e.:
f (x) =
i
w
i
K(x, x
i
)
A kernel machine could for instance predict whether a
protein is an enzyme or not (binary classification), in
terms of weighted similarity to other proteins. Being sim-
ilar to an enzyme x
i
will drive the prediction towards
the positive (enzyme) cla ss (positive weight w
i
), while
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 4 of 18
http://www.biomedcentral.com/1471-2105/15/103
Figure 2 Visualization of the proposed method. Visualization of the proposed method. (a) Kernel preparation at the three levels. A kernel is
derived for each input feature (Left); the resulting matrices are summed up to obtain a per-object kernel (Middle), which is transformed into a
pairwise kernel using Eq 1. Here N
p
(N
d
, N
r
) is the number of individual proteins (respectively domains, residues) in the level, while N
pp
(N
dd
, N
rr
)is
the number of protein (respectively domain, residue) interactions in the dataset. (b) Instantiation of all predicates (Table 1) over a pair of proteins p’
and p’ and their parts. Circles represent proteins, domains and residues. Dotted lines indicate a parent-child relationship between objects,
representing the parentpd and parentdr predicates. Solid lines link pairs of bound objects, i.e. objects for which the boundp, boundd or
boundr predicates are true. (c) Visualization of the experimental pipeline. Given the pairwise kernels, the set of rules (Table 2), a set of example
interactions, and a description of the protein-domain-residue hierarchy, SBR finds a prediction for the query predicates consistent with the rules.
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 5 of 18
http://www.biomedcentral.com/1471-2105/15/103
being similar to a non-enzyme x
j
will drive the prediction
towards the opposite class (negative w eight w
j
).
In the interaction prediction setting, target predicates
actually predict properties of pairs of objects (proteins ,
domains or residues). We thus employ a pairwise kernel
machine classifier to model the target predicate:
f (x, x
) =
i
w
i
K((x, x
), (x
i
, x
i
))
Here the kernel function measures the similarity between
two pairs of objects, so that, e.g. two proteins will be pre-
dicted as interacting if they are similar to protein pairs
which are known to interact, and dissimilar from pairs
known to not interact.
Given a kernel between objects K(x, x
), it is possible to
construct a pairwis e kernel by means of a the following
transformation [27]:
K((x
i
, x
j
), (x
k
, x
l
)) = K(x
i
, x
k
) · K (x
i
, x
l
) +
K(x
j
, x
k
) · K (x
j
, x
l
)
(1)
This transformation guarantees that, if the input func-
tion K is a valid kernel, so is the resulting pairwise
function.
As already explained, in SBR each target predicate is
implemented as a kernel machine, and the state of a
predicate for an uncharacterized pair of proteins can be
inferred by querying the machine. Positive predictions
correspond to true predicates, i.e. bound protein pairs,
and negative p redictions to false ones. The confidence
of the kernel machine, also called margin,embodiesthe
confidence in the state of the predicate, that is , how
strongly two proteins are believed to interact (or not).
Given the output of the kernel machines for all target
predicates , SBR uses the First Order Logic rules to con-
dition the state of the correlated predicates. It does so by
first translating the FOL rules into continuous constraints ,
which we discuss more thoroughly in Methods. The vari-
ables coming into play into the continuous constraints
are the confidences of all target predicates (and the state
of all given predicates) appearing in the equivalent FOL
constraint. The amount of violation is refle cted by the
value of the continuous constraints: if the predicted pre d -
icates satisfy a FOL rule, the corresponding constraint will
have a value equal to 1; on the other hand, the closer the
constraint value to zero, the more the FOL r ule is violated.
SBR computes a so lution to the inference problem, i.e.
deciding the truth value of all target predicates, that maxi-
mizes both the confidence of individual predicates and the
amount of satisfaction of all constraints. Informally, the
optimal assignment to all predicates, i.e. the binding state
of protein, domain and residue pairs, y
∗
, is a solution to
the following optimization problem:
y
∗
= arg max
y
consist(y, f ) + consist(y, KB)
where the first term accounts for consistency between
inferred truth values and confidence of the individual pre-
dictions, and the second incorporates information on the
degree of satisfaction of the constraints build from the
FOL knowledge. Contrarily to standard kernel methods,
this optimization problem is non-convex. This is com-
monly the case for complex statistical-relational learning
tasks [24], and implies that we are re stricted to finding
local optima. SBR employs a two-stage learning process to
make training effective even in presence of local optima.
In par ticular, the first stage of SBR learning takes into
account the fitting of the individual predictions to the
supervised data. This learning task is convex and can be
efficiently solved. The solution found in the first stage is
used as starting point for a second stage, where the FOL
knowledge is also considered. This optimization strategy
has been experimentally proved to find high-quality solu-
tions without adding the computational burden of other
non-convex optimization techniques [22].
SBR is a semi-sup ervised metho d [28], me aning that
the set of target proteins is given beforehand and can be
exploited during the learning stage to fine-tune the model.
Semi-supervised learning is known to enhance the pre-
diction ability when appropriately used [29], and can be
applied very naturally to PPI prediction, as the full set of
proteins is always known.
To summarize, at each level the state of an uncharac-
terizedpairofobjects,e.g.proteinsp and p
,ismainly
inferred by the s imilarity of the pair (p, p
) to other pairs
that are known to interact or not , through the pair-
wise kernel function K andthelearnedweightsw.Thus
the kernel allows to propagate information horizontally
within the same level. At the same time, the FOL con-
straints allow to propagate information vertically between
the levels, by keeping the interaction pattern along the
protein-domain-residue hierarchy consistent.
Modeling multi-level interactions
As already explained, we use two distinct kinds of pre di-
cates: given predicates and target pre dicates. Given pred-
icates encode aprioriknowledgeabouttheproblem,in
our case the structure of the multi-level object hierar-
chy. In particular, given a protein p and a domain d,the
parentpd(p,d) predic ate is true if and only if domain d
occurs in protein p;theparentdr predic ate is the analo-
gous for domains and residues. This simple representation
suffices to encode the whole protein–domain–residue
hierarchy. To simplify the notation, we also introduce the
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 6 of 18
http://www.biomedcentral.com/1471-2105/15/103
hasdom(p) predicate to encode the fact that protein p
has at least one domain. More formally:
hasdom(p) :=∃d parentpd(p,d)
The hasdom predicate can be computed directly by SBR
using the above definition; we instead p re-compute its
value for all protein pairs for run-time efficiency.
The boundp(p,p’) target predicate models the bind-
ing state of two distinct proteins. Its state is known for
certain protein pairs, i.e. those in the training set, and
our goal is to predict its state on the remaining ones.
The boundd(d,d’) predicate plays the same role for
domains. For a complete list of predicates, see Table 1. For
a visualization of the predicates instantiated over a protein
pair, see Figure 2-b.
In what follows we describ e how to de sign inter-level
FOL constraints to properly enforce consistency betwe en
predictions at different levels. We focus on modeling the
constraints tying proteins and domains; it is easy to see
that the one s between domains and residues can be mod-
elled similarly (with one peculiar exception that will be
pointed out later). Table 2 reports the complete list of
rules.
Inter-level constraints can be seen as propagating infor-
mation from the upper layer to the lower one and in the
oppo site direction. To model this mechanism, we use two
distinct constraints : the P→D rule and the D→P rule. A
simplified version of the P→D rule is:
∀ (p,p’) boundp(p,p’) ⇒∃(d,d’) boundd(d,d’)∧
parentpd(p,d) ∧
parentpd(p’,d’)
Intuitively, the rule means that whenever two proteins
are bound (and therefore the left-hand side (LHS) of the
implication is true) then there must be at least one pair of
child domains that are bound (the right-hand side (RHS) is
true). In classical First Order Logic the rule would re quire
that, whenever none of the child domains is bo und (the
Table 1 Predicates
Target predicates
boundp(p,p’) true iff the protein pair (p,p’) is bound
boundd(d,d’) true iff the domain pair (d,d’) is bound
boundr(r,r’) true iff the residue pair (r,r’) is bound
Given predicates
parentpd(p,d) true iff protein p is parent of domain d
parentdr(d,r) true iff domain d is parent of residue r
parentpr(p,r) true iff protein p is parent of residue r
hasdom(p) true iff protein p has at least one domain
hasres(d) true iff domain d has at least one residue
List of predicates used by SBR.
RHS is false), then the parent proteins must not be bound
(the LHS is false).
Note that, in the above formulation, the rule is applied
indiscriminately to all protein pairs, even to those that
have no known child domains in the considered dataset.
Therefore, the rule can be reformulated in order to enforce
it only for those protein pairs that do in fact have child
domains, using the hasdom predicate, as follows:
∀ (p,p’) hasdom(p)∧ hasdom(p’) ⇒
(boundp(p,p’) ⇒∃(d,d’) boundd(d,d’)∧
parentpd(p,d)∧
parentpd(p’,d’))
This is the complete P→D rule. The left-hand side is
always false for proteins without domains, making the
rule always satisfied in this case (effectively disabling the
effect of the rule on the learning process). We define the
complementary D→P rule as follows:
∀(p,p’)
(
∃(d,d’) boundd(d,d’) ∧
parentpd(p,d) ∧
parentpd(p’,d’)
⇒ boundp(p,p’)
)
This rule is applied to all protein pairs, demanding that
if there is a pair of bound children domains then the pro-
teins must be bound too, and vice versa that if the parent
proteins are unbound so are the domains. The P→D and
D→P rules could be merged into a single equivalent r ule
using the double implication (⇔). However, the rules have
be en considered separately to keep their effects on the
results separated and easier to analyze.
To simulate the unidirectional information propaga-
tion between levels, as done by Yip et al. [21] (see
Related work), we modified how SBR converts logic impli-
cations by usi ng the t-norm residuum, which states that
a log ic implication is true if the RHS is at least as true
as the LHS. This modification also removes a bias in the
translation of the implication that was affecting the orig i-
nal formulation of SBR, whose effect is to often move the
LHS toward the false value. See Method s for details.
The constraints for domains and residues can be simi-
larly defined with one important exception. The P →D rule
described above (correctly) requires at least one domain
couple to be bound for each interacting protein pair.
However, when tw o domains are bound, the interaction
interface involves more than one residue pair: for instance,
binding sites collected in the p rotein–protein docking
benchmark version 3.0 [30] consist of 25 re s idues on aver-
age[31].WeintegratethisobservationintheD→R rule
using the n-existential operator ∃
n
in place of the regu-
lar existential (see Table 2 for the complete formulation),
so that whenever two domains are bound, at least n pairs
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 7 of 18
http://www.biomedcentral.com/1471-2105/15/103
Table 2 Rules
Name Definition
P→D ∀ (p,p’) hasdom(p) ∧ hasdom(p’) ⇒
boundp(p,p’) ⇒∃(d,d’) boundd(d,d’) ∧ parentpd(p,d) ∧ parentpd(p’,d’)
D→P ∀(p,p’) ∃ (d,d’)
boundd(d,d’) ∧ parentpd(p,d)∧parentpd(p’,d’) ⇒boundp(p,p’)
D→R ∀(d,d’) hasres(d) ∧ hasres(d’) ⇒
boundd(d,d’) ⇒∃
n
(r,r’) boundr(r,r’) ∧ parentdr(d,r) ∧ parentdr(d’,r’)
R→D ∀(d,d’) ∃(r,r’)
boundr(r,r’) ∧ parentdr(d,r) ∧ parentdr(d’,r’) ⇒ boundd(d,d’)
P→R Same as D→R, with proteins in place of domains
R→P Same as R→D, with proteins in place of domains
List of FOL constraints used by SBR.
of their residues must be bound. Since interfaces in the
employed dataset are typically 5 residues long, n = 5has
been used in the experiments. Our results demonstrate
that this seemingly small modification has a rather exten-
sive impact on the prediction of domain and residue level
interactions.
Related work
In this section we br i efly summarize previous PPI interac-
tion prediction approaches us ing methods that are most
closely related to the present paper: kernel methods,
semi-supervised methods , and logic-based methods. For
a broader exposition of interaction prediction metho ds,
please refer to one of the se veral surveys on the subject
[4,6,9,32].
The earliest attempt to employ ker nel methods [26] for
PPI prediction is the work of Bock et al. [33], which casts
interaction prediction as pairwise classification, using
amino-acid compos ition and physico-chemical properties
alone. Ben-Hur et al. [ 27] extended the previous work by
applying pairwise kernels and combining multiple data
sources (primary sequence, Pfam domains, Gene Ontol-
ogy annotations and interactions betwe e n orthologues) .
Successive publications focused primarily on aggregat-
ing more diverse sources, including phylogenetic pro-
files, genetic interactions, and subcellular localization and
function [6] . Kernel machines have also been applied to
the prediction of binding sites from sequence, as summa-
rized in [10]. The appeal of supervised kernel methods is
that they provide a proved and theoretically grounded set
of techniques that can easily integrate various information
sources, and can naturally handle noise in the data. How-
ever, they have two inherent limitations: (i) the binding
state of two proteins is inferred independently from the
state of all other proteins , and (ii) due to their supervised
nature, they do not take advantage of unsupervised data,
which is very abundant in the biological network s etting.
Semi-supervised learning (SSL) techniques [28,29]
attempt to solve these issues. In the SSL setting the set
of target proteins is known in advance, meaning that the
learning algorithm has access to their distr ibution in fea-
ture space. This way the inference tas k can be simplified
by introducing unsupervised constraints that assign the
same label to proteins that are, e.g., close enough in featu re
space, or linked in the interaction network, instantiat-
ing a form of information propagation. There are several
work s in the PPI literature that embed the known net-
work topology using SSL constraints. Qi et al. [34] employ
SSL methods to the sp ecial case of viral-host protein inter-
actions, where supervised examples are extremely scarce.
Using similar methods, You et al. [35] attempt to detect
spurious interactions in a known network by proje cting
it on a low-dimensional manifold. Other studies [36,37]
applied SSL techniques to the closely related problems
of gene–protein and drug–protein interaction prediction.
Despite the ability of SSL to integrate topology informa-
tion, no study so far has applied it to highly relational
problems such as the MLPIP.
An alternative strategy for interaction prediction is
Inductive Logic Programming (ILP) [38], a group of logic-
based formalisms that extract rules explaining the likely
underlying causes of interactions. ILP methods were stud-
ied in the work of Tran et al. [39] using a large number of
features: SWISS-PROT keywords and enzyme properties,
Gene Ontology functional annotations, gene expression,
cell cycle and subcellular localization. Further advances
in this direction, with a sp ecial focus on using domain
information, can be found in [17,18]. The advantage of
ILP methods over purely statistical methods is that they
are inherently able to deal with relational information,
making them ideal candidates for solving the MLPIP
problem. Alas, contrary to kernel methods, they tend to
be very susceptible to noise, which is a very prominent
feature of interaction dataset, and are less effective in
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 8 of 18
http://www.biomedcentral.com/1471-2105/15/103
exploiting complex feature representations, e.g. involv-
ing highly non-linear i nteractions between continuous
features.
Recently, some works highlighted the importance of
the multi-level nature of protein–protein interactions.
Gonzalez et al. [40] propose a method to infer the residue
contact matrix from a know n set of protein interactions
using SVMs; on the contrary, our goal is to predict the
interactions concurrently at all levels of the hierarchy.
Another study [13] highlights the rele vance of domain-
le vel interactions, and the unfortunate lack of details
thereof, and formulates a method to reinterpret a known
PPI network in terms of its constituent domain interac-
tions; the present work has a different focus and a more
general scope.
Most relevant to this paper is the work of Yip et al.
[21], w here the authors propose a procedure to solve
the MLPIP problem based on a mixture of different
techniques. The idea is to decompose the problem as a
sequence of three prediction tasks, which are solve d iter-
atively. Given an arbitrary order of the three lev els (e.g.
proteins first, then domains, then residues), their proce-
dure involves computing putative interactions in the first
level (in this case proteins), then using the most confident
predictions a s novel training examples at the following
le vel (i.e., domains). The procedure is repeated until a
termination criterion is met.
Intra-level predictions are obtained with Support Vec-
tor Regression (SVR) [41]. In p articular, each object has
an associated SVR machine that models its propensity to
bind any other object in the same level. The extrapolated
values act as confidences for the predictions themselves.
The mechanism for translating the most confident pre-
dictions at one level into training examples for the next
level depends on the relative pos ition of the two levels in
the hierarchy. Downward propagation (e.g. from proteins
to domains) simply associates to each novel example the
same confidence as the parent prediction: in other words,
if two proteins are predicted as bound with high confi-
dence, all their domains will be considered bound with the
same confidence. Upward propagation (e.g. from domains
to proteins) is a bit more involved: the confidence assigned
to the novel example ( protein) is a noisy-OR combination
of confidences for all the involved child objects (domains).
While this method has been shown to work re ason-
ably well, it is afflicted by several flaws. First of all, while
the iterative procedure is grounded in co-training [42],
the specific choice of components is not as theoreti-
cally sound. For instance, the authors apply regression
techniques on a classification task, which may lead to
sub-optimal results. The inter-level example propagation
mechanisms are ad hoc, do not exploit all the informa-
tion at each level (only the most confident predictions
are propagated), and are designed to merely propagate
information between levels, not to enforce consistency on
the predictions. In particular, the downward propagation
rule is rather arbitrary: it is not clear why all domains
of bound proteins s hould be themselves bound with the
same confidence. Finally, these rules, which are intimately
tied to the specifi c implementation, are not defined using
a formal language, and are therefore difficult to extend.
For instance, it would be difficult to implement in said
framework something similar to an n-existential propa-
gation rule, which is extremely useful for dealing with
residue interactions.
Semantic Based Regularization seems to have many
obvious advantages in this context. A first advantage is
that it decouples the implementation of the functions
from how consistency among levels is defined. Indeed,
consistency is implemented via a set of constraints, which
are applied over the output of the predictors. However,
there is no limitation to which kind of predictors are used.
For example, we used kernel machines as basic machinery
for implementing the predictor, where different state-of-
the-art kernels can be used at the single levels, while still
be able to define a single optimization problem.
Furthermore, SBR allows to natively propagate the pre-
dictions of one level to the other levels. Since the pre-
dictions and not the superv isions are propagated, SBR
accuracy can get advantage of the abundant unsupervised
data. The availability of an efficient implementation of
the n-existential quantifier is also a crucial advantage: if
two proteins or domains are interacting, a small set of
residues must be interacting as well. SBR does not simply
propagate a generi c prior to all the residues for a protein
or domain, which could decrease accuracy of the reduc-
tions for the negative supervisions. SBR instead p erforms
a search process in order to select a s ubset of residue can-
didates, where to enforce the interaction. As shown in the
experimental results, this greatly improves residue pre-
diction accuracy. Finally, the circular dependencies that
make learning difficult are dealt in the context of a general
and well defined framework, which implements various
heuristics to make training effective.
Results and discussion
Dataset
In this work we use the dataset of Yip et al. [21], described
here for completeness. The dataset represents proteins,
domains and residues using features gathered from a
variety of different source s :
•
Protein features include phylogenetic profiles derived
from COG, subcellular localization, cell cycle and
environmental response gene expression;
protein-pair features were extracted from Y2H and
TAP-MS data. The gold standard of positive
interactions was constructed by aggregating
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 9 of 18
http://www.biomedcentral.com/1471-2105/15/103
experimentally verified or structurally determined
interactions taken from MIPS, DIP, and iPfam.
•
At the domain level, the dataset includes both
features for domain families and for domain
instances based on frequencies of domains within
one or more species and phylogenetic correlations of
Pfam alignments. The gold standard of positive
interactions was built from 3D structures of
complexed proteins taken from PDB.
•
Residue features consist of sequence-based
properties, namely charge complementarity, Psi-Blast
[43] profiles, predicted secondary structure, and
predicted solvent accessibility.
Kernels computed f rom the individual features were com-
bined additively into a single kernel function for each
level, and then transformed into pairwise kernels using
Equation (1); the resulting f unctions were used as inputs
to SBR. A visualization of the process can be found in
Figure 2.
This procedure yields a dataset of 1681 proteins, 2389
domains, and 3035 residues, with a gold standard of 3201
positive (interacting) protein pairs, 422 domain pairs, and
2000 residue pairs. S ince interaction experiments can not
determine which pairs do not interact, the gold standard
of negative pairs is built by randomly sampling, at each
level, a number of pairs that are not known to interact
(i.e. not positive). This is a common approach to negative
labeling in the PPI prediction literature [44]. To keep the
dataset balanced, the number of sampled negative pairs
is identical to the number of objects in the gold standard
of positives. For more details on the dataset prepara-
tion, please refer to [21]. We further refined the dataset
by r unning CD-HIT [45] with a 20% sequence similar-
ity thre shold, identify i ng 23 redundant proteins. These
proteins were not used when comparing the method per-
formances.
Turning our attention to the resulting d ataset, we note
that most of the supervision is located at the protein level:
out of all pos sible i nteractions between pairs of proteins,
which are (1681 × 1680)/2, 0.226% are known (either pos-
itive or negative). On the contrary, the other levels hold
much less information: only 0.042% of all possible residue
pairs, and 0.014% of all possible domain pairs, are in the
dataset. The low number of residue pairs is due to i) dif-
ferent requirements for experimentally determining the
interactions at the three levels, i.e. whether the structure
is available; and ii) sampling choices operated by Yip et al.
[21].
Evaluation procedure
In this work we compare our method to that of Yip
et al. [21], where the authors evaluated their method
using a 10-fold cross-validation procedure. To keep
the comparison completely fair, we repeated said pro-
cedure with SBR, reusing the very same train/test
splits. Since correlated objects, e.g. a protein and its
domains/residues, share information, the folds were
structured as to avoid such information to leak between
train and test folds: this was achieved by keeping cor-
related objects in the same fold. In order not to bias
the performance estimates, all redundant proteins were
ignored, along with their domains and residues, when
computing the results of both SBR and the metho d of
Yip et al. The full experimental setup and instructions
to replicate the experiments can be downloaded at
http://sites.google.com/site/semanticbasedreg ularization/
home/software/protein_interaction.
SBR has two scalar hyper-parameters that control the
contribution of var ious par ts of the objective function: λ
c
is the weight asso ciated to the constraints (how much the
current solution is consistent with res pect to the rules)
and λ
r
, which controls the model complexity (see the
Methods section for more details). The λ
r
parameter was
optimized on the first fold by training the model with-
outthelogicrulesanditwasthenkeptfixedforallthe
folds of the k-fold cross -validation. The resulting value is
λ
r
= 0.1. The λ
c
parameter has not b een optimized and
kept fixed at λ
c
= 1. Please note that further significant
gains for the propose d method could be achie ved by fine-
tuning this meta-parameter. However, since the dataset
from Yip et al. does not include a validation split , no
sound way to optimize this parameter was possible with-
out looking at the test set, or redefining the splits (making
it diffi cult to compare against the results of Yip et al.).
Therefore, we decided to not perform any tuning for this
meta-parameter.
We computed three perfor mance metrics: the Receiver
Operating Characteristic (ROC) curve, the area under the
ROC (AUCROC, or AUC for short), and the F
1
score. The
ROC curve represents the relation between the false pos-
itive rate (FPR) and the true positive rate (TPR), and can
be seen as the prop ortion o f true po s itives gained by “pay-
ing ” a given proportion of false positive s. By definition, the
ROC curve is monotonically non-decreasing; the steeper
the curve, the better the predictions. The AUC measures
the ability to corre c tly discriminate betw een positives and
negatives, or alternatively, the ability to rank positives
above negatives . It is independent of any classification
threshold, and thus particularly fit to evaluate models
over the whole spectrum of possible decision thresholds.
The F
1
score is the harmonic mean of precision and
recall. Contrary to the AUC, the F
1
takes into account the
predicted label, but not its confidence (margin).
We computed the average AUC and F
1
of our method
and those of our competitor o ver all folds of the cross -
validation; the results can be found in Table 3 and Table 4.
The ROC curves have been computed by collating the
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 10 of 18
http://www.biomedcentral.com/1471-2105/15/103
Table 3 Results (AUC)
Independent Unidirectional Bidirectional Full
Level P→DD→RP→RP↔DD↔RP↔R
Results for Yip et al. [21]
Proteins 0.723 0.722 0.725 0.724
Domains 0.531 0.619 0.688 0.695 0.673
Residues 0.563 0.542 0.549 0.576 0.659 0.722
Results for SBR
Proteins 0.808 0.820 0.819 0.820
Domains 0.605 0.814 0.837 0.896 0.937
Residues 0.591 0.664 0.671 0.675 0.673 0.676
Results for SBR-∃
n
Proteins 0.808 0.820 0.819 0.821
Domains 0.605 0.814 0.837 0.895 0.956
Residues 0.591 0.745 0.760 0.778 0.772 0.797
Area under the ROC curve values attained by Yip et al. [21],SBR,andSBR-∃
n
(SBR equipped with the n-existential quantifier).
results of all test folds, and can be found in Figure 3. Since
the ROC and F
1
are not present in [21], and the dataset
is slightly smaller because of the redundancy elimination
step we introduced, we had to compute their results on a
local re-run of their experiment . As a result, the AUC val-
ues presented in Table 3 are slightly different from those
reportedin[21].However,wenotethattheresultsofour
analysis would still apply if we had chosen to use the AUC
values reported in [21].
Results
To evaluate the e ffects of the constraints on the per-
formances of SBR, we performed three independent
experiments using r ule s of increas ing complexity. This
setup follows closely that of Yip et al. [21].
Independent levels
As a baseline, we estimate the performance of our method
when constraints are ignored. This is equivalent to the
method of Yip et al. when no information flow between
levels is allowed. The results can be found in the “Inde-
pendent” column of Tables 3 and 4.
In absence of constraints SBR reduces to standard
2
-
regularized SVM classification: learning and inference
become convex problems, and the method computes
the globally optimal solution. Thus, the only differences
Table 4 Results (F
1
)
Independent Unidirectional Bidirectional Full
Level P→DD→RP→RP↔DD↔RP↔R
Results for Yip et al. [21]
Proteins 0.665 0.665 0.666 0.666
Domains 0.518 0.620 0.662 0.659 0.661
Residues 0.522 0.510 0.514 0.602 0.609 0.613
Results for SBR
Proteins 0.718 0.722 0.722 0.723
Domains 0.568 0.693 0.696 0.731 0.750
Residues 0.579 0.605 0.605 0.605 0.605 0.602
Results for SBR-∃
n
Proteins 0.717 0.722 0.722 0.722
Domains 0.568 0.693 0.696 0.729 0.757
Residues 0.579 0.635 0.639 0.641 0.644 0.650
F
1
values attained by Yip et al. [21],SBR,andSBR-∃
n
(SBR equipped with the n-existential quantifier).
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 11 of 18
http://www.biomedcentral.com/1471-2105/15/103
Figure 3 ROC curves. ROC curves obtained with the 10-fold cross-validation procedure, for all experimental settings and all levels of the hierarchy.
(Left) Results for Yip et al. (Right) Results for SBR-∃
n
. (Top) ROC curves for protein-level predictions with different sets of constraints, from fully
independent to fully connected levels. (Middle) Domain-level predictions. (Bottom) Residue-level predictions. Each plot includes multiple ROC
curves, one for each experimental setting; see the legends for more details.
between our method and the competitor are: (i) using
classification versus regression, and (ii) using pairwise
classification, instead of training a single model for each
entity (protein, domain, residues). These differences alone
produceasubstantialincreaseinperformance:theF
1
changes by about +0.05 in all three cases. The AUC of
proteins and domains is improved by about +0.09 and
+0.07, respectively, while residues are less affected, with a
+0.03 difference.
Unidirectional constraints
In the second experiment, we evaluate the effect of intro-
ducing unidirectional constraints between pairs of levels.
In the P→D case only the P→D rule is active, me aning
that bo und protein pairs enforce positive domain pairs
and negative domain pairs enforce negative protein pairs.
The D →R and P→R cases are defined similarly. In all
three case s, the level not appearing in the rule (e.g. the
residue le v el in the P→D case) is predicted independently.
This setup makes it easy to study the effects of propa-
gating information from one level to the other without
interferences. The results can be found in the “Unidirec-
tional” column of Tables 3 and 4. In the same column
we also show the results for Yip et al. for the unidirec-
tional flow setting, where examples are propagated from
one le v el to the next but not vice versa. However, since
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 12 of 18
http://www.biomedcentral.com/1471-2105/15/103
the competitor’s algorithm is iterative, information about
lower levels can indeed affect the upper levels in succes-
sive iterations.
The results show that introducing unidirectional con-
straints in SBR improves the predictions in all cases. In
particular, using (predicted and known) protein interac-
tions helps inferring correct domain i nteractions, which
improve by about +0.13 F
1
and +0.2 AUC (P→D case).
Residues improve independently of whether protein or
domain-level information is used, with a +0.03 F
1
in
both cases, and a +0.08/+0.07 AUC difference, respec-
tively. Interestingly, proteins tend to help residue predic-
tions slightly more than domains, despite the indirection
between the two levels ; this is likely an effect of the larger
percentage of supervised pairs available.
Compared to SBR, the method of Yip et al. does not
benefit as much from unidirectional information flow.
Protein-level information allows to improve domain pre-
dictions only (+0.1 F
1
, +0.06 AUC for P→D), while
residue predictions are worse than in the independent
case (−0.05 and −0.04 AUC , and −0.01 F
1
,intheD→R
and P→R c a se s, respectively).
Bidirectional constraints
In the third experiment we study the impact of using
bidirectional constraints betwee n pairs of levels ; the level
not appearing in the rules is predicted independently, as
above. In the P↔D case, both the P→D and D→P rules
are active, me aning that the protein and domain levels are
enforced to be fully consistent; the P↔R and D↔R cases
are defined analogously. This experiment is comparable to
the bidirectional flow setting of Yip et al..Theresultscan
be found in the “Bidirectional” column of Tables 3 and 4.
We observe that the new constraints have a positive
effect on predictions at all three le vels: proteins change
from 0.808 AUC to 0.820, domains from 0.814 to 0.896
and res idues from 0.671 to 0.673. In ter ms of F
1
,the
changes are from 0.718 to (up to) 0.722 for proteins, from
0.693 to 0.731 for domains, and no change for residues.
The change is not as marked as between the independent
and unidirectional experiments. In particular, domains
see the largest increase in performance (+0.08 AUC,
+0.04 F
1
), in particular thanks to the contribution of
residue-level information, which is more abundant. Pro-
teins and residues are less affected. The result is unsur-
prising for proteins, which hold most of the supervision
and are thus (i) more likely to be predicted correctly in the
independent setting, and (ii) less likely to be assisted from
hints coming from the other, less superv ised levels.
As for the method of Yip et al., the bidirectional
flow mostly affects the domain and residue levels,
whose improvement is +0.07 AUC/+0.04 F
1
and +0.11
AUC/+0.09 F
1
, respectively; the change for protein
interactions is negligible. Regardless of the relative
performance increase, SBR is able to largely outperform
the competitor in all configurations except one ( F
1
of the
P↔R case for residues).
We note that the fact that all three cases (P↔D, P↔R
and D↔R) improve over both the independent and the
unidirectional experiments shows that not only the bidi-
rectional constraints are in fact sound, but also that,
despite the increased computational complexity, SBR is
still able to exploit them appropriately.
All constraints
In the final experiment we activate the P→D, D→P, D→R
and R→D rules, as defined in Table 2, making all levels
interact.Thisisthemostcomplexsetting,andproduces
fully consistent predictions through the hierarchy. It is
comparable to the “PDR” bidirectional setting of Yip et al..
The AUC scores can be found in column “Full” of Tables 3
and 4.
In this experiment the P→R and R→P constraints are
not used. Direct information f low between proteins and
residues is not needed, because it would be redundant:
from a formal logic point of view, this corresponds to
the observation that the logic rule expressing protein to
residue consistency is implied by the other consistency
rules. Indeed, we have experimentally verified that adding
this propagation flow does not significantly affect the
results.
In this experiment, protein predictions are stable with
resp ect to the previous experiments, confirming the intu-
ition that the abundance of supervision at this level makes
it less likely to benefit from predictions at the other
ones. On the contrary, domains see a large performance
upgrade, from 0.896 to 0.937 AUC and from 0.731 to 0.750
F
1
, w hen made to interact with both proteins and residues.
The change for residues is instead only marginal.
The results for Yip et al. are mixed, with proteins f aring
almost identically to the previous experiment, domains
showing a slight drop in AUC but an equally slight
increase in F
1
, and residues improving in AUC (+0.08)
but not in F
1
(unchanged) over the bidirectional P↔R
case. The improvement in res idue prediction (in terms
of AUC) stands in contrast with the results of SBR, and
istheonlycaseinwhichthemethodofYipandcol-
leagues works better than SBR. The issue lies within our
formulation of the D→R rule: whenever two domains are
bound, the rule is satisfied when at least one residue pair
is bound. A s already mentioned above, this is not realistic:
protein interfaces span more than two residues, typic ally
five or more. We therefore extended SBR to support the
n-existential quantifier, which allows to reformulate
the D→R rule to take this observation into account (see
the Methods se ction for more details on the n-existential
quantifier). The new D→R rule,showninTable2,requires
for each pair of bound domains at least n = 5residues
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 13 of 18
http://www.biomedcentral.com/1471-2105/15/103
to be bound. We chose the constant n = 5tobeboth
realistic and, since the computational cost increases with
n, small enough to be easily computable. We applied the
same modification to the P→R rule.
The complete re sults for the resulting method, termed
SBR-∃
n
, can be found at the bottom of Tables 3 and
4. When comparing to standard SBR, i.e., w ithout the
n-existential, we se e that the performance of residues
consistently improves in all cases (unidirectional, bidirec-
tional, and with all constraints activated), allowing SBR-∃
n
to always outperform the method of Yip et al. by a sig-
nificant margin also on the residue interactions . As a
side effect of the better residue predictions, thank s to the
D→R and R→D rules domains also improve in the all-
constraints experiment. In particular, in the “Full” exper-
iment the AUC improvement of SBR-∃
n
over Yip et al.
is +0.1/+0.26/+0.07 AUC and +0.06/+0.1/+0.04 F
1
for
proteins, domains and residues res pectively. We show in
Figure 4 an example prediction obtained by SBR-∃
n
for
the VPS25 and VPS36 ESCRT-II complex subunits. The
figure shows that, while the unconstrained (baseline) pre-
dictions are inconsistent, the addition of the constraints
effectively makes the protein- and domain-level predic-
tions correct and consistent, and enables SBR to improve
the residue-level predictions.
Summing up, these results highlight the ability of SBR
to enforce constraints even with highly complex combi-
nations of rules, allowing the modeler to fully exploit the
flexibility and performance improvement offered by non-
standard FOL extensions like the n-existential operator.
Discussion
The results presented in the previous section offer a clear
perspective on the advantages of the proposed method.
By employing appropriate classification techniques and
training a single global pairwise model per level, rather
than relying on the less than optimal local (per-object)
regression models of Yip et al., a considerable improve-
ment was achieved even in the unconstrained experiment.
Furthermore, when enforcing consistency among the pro-
tein, domain and residue levels and using the n-existential
quantifier, the experimental results are significantly better
than both the unconstrained baseline and the correspond-
ing results of Yip and colleagues, at all levels and in all
experimental settings.
It is worth noting that SBR performance improves
monotonically with the increase of constraint complexity
in the rep orted experiments. This result is f ar f rom obvi-
ous, and confirms both that the biologically-motivated
knowledge base is useful, and that SBR is able to effec-
tively apply it. In contrast, the competitor’s method does
not always improve in a similar manner.
In general, the performance gain brought forth by
inter-level p ropagation is not homogeneously distributed
between the three levels. We register a large improvement
for domains and residues, especially when SBR is used in
conjunction with the n-existential quantifier. Proteins are
less affected by consistency enforcement, most likely due
to the availability of more supervised examples.
We note that the FOL rules have a twofold effect. Firstly,
they propagate information between the levels, enabling
predicted interactions at one level to help inferring cor-
rect interactions at the other two levels . This is especially
clear in the “Full” experiment with the n-existential quan-
tifier: in this case, better residue level predictions increase
the overall quality of domain predictions as well. Secondly,
the rules also guarantee that the predictions are consistent
along the object hierarchy.
Summarizing, SBR is able to largely outperform that of
Yip and colleagues, and moreover enforces the predictions
to be consistent among levels. As pre viously mentioned,
the data taken from Yip et al. has some peculiarities worth
discussing. First, it contains a low number of residue–
residue interactions, partially due to design choices taken
YJR102C
PF05871
00022
00029
00083
YLR417W
PF04132
00543
00546
00562
00548
00555
YJR102C
PF05871
00022
00029
00083
YLR417W
PF04132
00543
00546
00562
00548
00555
INDEPENDENT
FULL
Figure 4 Example prediction. Prediction for the interaction between two ESCRT-II complex subunits: VPS25 (YJR102C) and VPS36 (YLR417W). The
two proteins, their domains, and all the residue pairs in the dataset, are known to interact. Solid black lines indicate a predicted interaction, dotted
lines a non-interaction; residue pairs not connected by either a solid or dotted line are not present in the dataset. (Left) SBR-∃
n
predictions with no
constraints: the predictions at the three levels are inconsistent. (Right) SBR-∃
n
predictions with the full set of constraints: the protein- and
domain-level predictions are now both consistent and correct. A further r esidue pair is now correctly predicted as interacting.
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 14 of 18
http://www.biomedcentral.com/1471-2105/15/103
by [21]. Second, it is artificially balanced by including an
appropriate number of non-interactions, while in a real-
world case all possible pairs would qualify as candidates.
We decided to keep the dataset as-is in order to facili-
tate a f air comparison with Yip et al..Wepostponefurther
analysis with other datasets to future work.
Conclusions
In this work we tackle the multi-level protein interaction
prediction (ML PIP) problem, first introduced by Yip et al.
[21], which requires to establish the binding state of all
uncharacteri zed pairs of proteins, domains and residues.
Contrary to standard protein–protein interaction predic-
tion, the MLPIP problem offers many advantages and
opens up new challenges. The primary contribution o f this
paper is the extension and application to the MLPIP task
of a state-of-the-art statistical relational learning tech-
nique called Semantic Based Regularization.
SBR is a flexible framework to inject domain knowledge
into kernel machines. In this paper SBR has been used
to tie together p rotein, domain and residue interaction
predictions tasks. In par ticular, the domain knowledge
expresses that two proteins interact if and only if there is
an interaction between at least one pair of domains of the
proteins. Similarly two domains can interact if and only if
there are at least some residues interacting. While these
tasks could be learned separately, tying them together has
multiple advantages. First the pre dictions will be consis-
tent and more accurate, as the predictions at one level
will help the predictions at the other levels. Secondly, the
domain knowledge can be enforced also on the unsu-
pervised data (proteins, domains and residues for which
interactions are unknown). Unsupervised data is ty pically
abundant in protein interaction prediction task s but often
neglected. This methodology allows to powerfully lever-
age it, significantly improving the prediction accuracy.
Note also that , while the resolved complexes are required
during the training stage, no structural information is
required for per forming inference on novel proteins .
While other work in the literature has exploited the
possibility of tying the predictions at multiple levels, the
presented metho dology employs a more principled infer-
ence process among the levels, where the domain knowl-
edge can be exactly represented and precisely enforced.
The experimental results confirm the theoretical advan-
tages by showing significant improvements in domain and
residue interaction prediction accuracy both with respect
to approaches performing independent predictions and
the only previous approach attempting at linking the pre-
diction tas ks.
Given the flexibility offered by SBR, the proposed
methodcanbeextendedinseveralways.Thesimplest
extension involves engineering a more refined rule set,
for instance by introducing (soft) constraints between the
binding state of consecutive residues, which are likely to
share the same state. More ambitious goals, requiring a
redesign of the exper imental dataset, include enco ding
selected information sources, such as domain types, sub-
cellular co-localization and Gene Ontology annotations,
as First Order Logic constraints rather than with kernels,
to better leverage their relational nature.
Methods
Kernel machines
Machine learning and statistical methods are very well
defined for the linear case, and statistical learning the-
ory can provide optimal solutions in terms of gen-
eralization performance. Unfortunately, non-linearity is
often required in order to solve most applications, where
exploiting complex dependencies is essential to predict
some higher level property of the object s. Kernel meth-
ods try to combine the potential class i fication power of
non-linear methods and the optimality and computational
efficiency of linear methods by mapping the input patterns
into a high dimensional feature space, where parameter
optimization remains linear.
Kernel methods have a wide range of applications in
many fields, and can be used for many different tasks like
regression, clustering and cla ssification (the latter being
the main interest of this paper) . In particular, the repre-
senter Theorem [46] shows that a large class of problems
admits solutions in terms of kernel expansions having the
following form:
f (x) =
N
i=1
w
i
K(x, x
i
) (2)
where x is the representation of the pattern, K(x, x
i
) =<
(x), (x
i
)>is a kernel function, where (·) is some
mapping from the input space to the feature space.
Intuitively, the kernel function me asures the similarity
between p airs of instances, and the prediction f (x) for
a novel instance is computed as a weighted similarity to
training instances x
i
. There is a large body of literature on
kernel machines, see e.g. [26] for an introduction.
The optimization of the weights w
i
of the Kernel
machine can be formulated in different ways. Let us indi-
cate y
j
∈{−1, +1} the desired output for pattern x
i
,
w = [w
1
, ..., w
n
] i s a vector ar ranging the kernel machine
parameters and G is the gram matrix, having its (i, j) ele-
ment defined as G
ij
= K (x
i
, x
j
). ||f ||
2
= w
t
Gw,anditcan
be shown that the following cost function:
||f
2
|| + λ
N
j=1
L(y
j
, f (x
j
))
reduces to the formulation of hard margin
2
SVMs [23]
if L(·) is the hinge loss and λ →∞.Averysimilarcost
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 15 of 18
http://www.biomedcentral.com/1471-2105/15/103
function has been employed to solve the protein, domain,
residue interaction presented in this paper.
First-order logic
Propositional logic is ba sed on the ba sic concept of propo-
sitions, which can assume either a true or false value. It
is possible to perform operations on the propositions by
connecting them via the and (∧), or (∨)andnot (¬)oper-
ators. In particular, given two propositions A, B,itholds
that: A ∧ B is true iff A = true, B = true, A ∨ B is false iff
A = false, B = false and ¬A f lips the current truth value
of A. The operator ⇒ can be used to express a conditional
statement: A ⇒ B expresses the fact that B must hold true
if A is true. The sentence A ⇒ B is false iff A = tr ue
and B = false,anditcanbeexpressedintermsofother
operators through the equivalence A ⇒ B ≡¬A ∨ B.
First-order log ic (FOL) extends propositional logic to
compactly express generic properties for a class of object s,
thanks to the use of predicates, variables and quantifiers.
A variable c an assume as v alue any object in some con-
sidered domain. A variable is said to be grounded once
it is assig ned a specific object. A predicate is a function
that, taking as input some objects (or grounded variables),
returns either true or false. Predicates can be connected
with other p redicates using the same operators defined for
propo s itional logic. The universal quantifier (∀) expresses
the fact that some proposition is true for any object, while
the
existential quantifier (∃) expresses the fact that some
proposition is true for at least one object.
For example, let x be a variable and let
Protein(x),
Enzyme(x), NonEnzyme(x) indicate three predicates
expressing whether, given a grounding x=PDB1a3a,
PDB1a3a is a protein, an enzyme, a non-enzyme, respec-
tively. The following FOL clause can be used to express
that any protein is either an enz yme or it is not:
∀x Protein(x) ⇒ Enzyme(x) ∨ NonEnzyme(x).
Variables and quantifiers can be combined. For exam-
ple, given the predicates Protein(x) holding true i f x
is a p rotein and ResidueOf(x, y) holding true if y is
aresidueofx, the following clause expresses the fact that
any protein has a at least one residue: ∀x Protein(x) ⇒
∃y ResidueOf(x,y).
Semantic-based regularization
Semantic Based Regularization (SBR) [22] is a general
framework for injecting prior knowledge expressed in
FOL into kernel machines for semi-supervised learning
tasks. The prior knowledge is converted into a set of con-
tinuous constraints, which are enforced dur ing training.
The SBR framework is very general and allows to employ
the full expressiveness of FOL in the definition of the prior
knowledge. The SBR framework also allows to perform
collective classification on the te st set, in order to enforce
the output to respect the logic knowledge.
Let us consider a multitask le arning problem, where
each task works on an input domain where labeled and
unlabeled exam ples are sampled from. F or example, in the
case study presented in this paper, three separate tasks
for protein, domain and residue interaction need to be
conjunctively learned. Each input pattern is described via
some representation that is relevant to solve the tasks at
hand. Let us indicate with T the total number of tasks,
where task k is implemented by a function f
k
,whichlives
in an appropriate Reproducing Kernel Hilbert Space. In
the following, f = [ f
1
, ..., f
T
]
indicates the vector col-
lecting all task functions. The basic assumption of SBR
is that the task functions are correlated as they have to
meet a set of constraints that can be expressed by the
functionals φ
h
: H
1
× ... × H
T
→[0,+∞) such that
φ
h
(f ) = 0 h = 1, ..., H must hold for any valid choice
of f
k
∈ H
k
, k = 1, ..., T. Following the classical penalty
approach for constrained optimization, the constraints are
embedded by adding a term that penalizes their violation:
λ
r
T
k=1
||f
k
||
2
+
T
k=1
x
j
k
,y
j
k
∈
L
k
L
f
k
(x
j
k
), y
j
k
+
+ λ
c
H
h=1
φ
h
(S, f ),
(3)
where L(·) is a los s function measuring the distance of
the function output from the desired one and
S is the
considered sample of data points over which the func-
tions are evaluated. In the experimental setting, L(·) has
been set to be the hinge function. It is po ssible to extend
the Representer Theorem to show that the best solution
for Equation 3 can be expressed as a kernel expansion as
showed in Equation 2 [22].
Therefore, using kernel expansions, Equation 3
becomes:
λ
r
T
k=1
w
k
G
k
w
k
+
T
k=1
L
G
k
w
k
, y
k
+
+ λ
c
H
h=1
φ
h
(G
1
w
1
, ..., G
T
w
T
),
where G
k
, w
k
, f
k
= G
k
w
k
and y
k
are the gram matrix,
the weights, the function values over the data sample and
the desired output column vectors for the patterns in the
domain of the k-th task. Evidence tasks do not need to be
approximated as their are fully known.
Optimization of the w
k
parameters for the cost function
in Equation 4 can be done using gradient descent. Con-
straint φ
h
are non-linear in most interesting cases like the
one p resented in this paper. Therefore, the cost function
can present multiple local minima, making optimization
Saccà et al. BMC Bioinformatics
2014, 15:103
Page 16 of 18
http://www.biomedcentral.com/1471-2105/15/103
difficult. SBR use s a two-step heuristic to solve this prob-
lem: first it computes the theoretically global optimum
for all predicates independently (setting λ
c
= 0), which
are convex kernel machines. Then, it introduces the con-
straints and proceeds to find a good solution using a
gradient descent .
In the following we show how to express first order logic
clauses in terms of constraints φ
h
.
Translation of first-order logic clauses into real-valued
constraints
With no loss of generality, we restrict our attention to FOL
clauses in the PNF form, where all the quantifiers (∀, ∃)
and their associated quantified variables are placed at the
beginning of the clause. For example:
Quantifiers
∀v
1
∀
v
2
Quantifier−free expression
A(v
1
) ∧ B(v
2
) ⇒ C(v
1
) (4)
Please note that the quantifier-free part of the expres-
sion is equivalent to an assertion in propositional logic for
any given grounding o f the quantified variables. As stud-
ied in the context of fuzzy log ic and symbolic AI, different
methods can be used for the conversion of a propo sitional
expression into a continuous function with [0, 1] input
variables.
T-norms
T-norms [47] are commonly used in fuzzy logic [ 48] to
generalize propositional logic expressions to real valued
functions of continuos variables. A continuous t-norm is
afunctiont :[0, 1] ×[0, 1] → R, that is continuous, com-
mutative, as sociative, monotonic, and featuring a neutral
element 1 (i.e. t(a,1) = a). A t-norm fuzzy logic is define d
by its t-norm t(a
1
, a
2
) that models the logic AND, wh ile
the negation of a variable ¬a is computed as 1 − a.Once
defined the t-norm functions corresponding to the logi-
cal AND and NOT, these functions can be composed to
convert any arbitrary logic proposition into a continuous
function. Many different t-norm logics have been pro-
posed in the literature. For example, the product t-norm
used in the experimental section:
(a
1
∧ a
2
)
mapped
−→ t(a
1
, a
2
) = a
1
· a
2
(¬a
1
)
mapped
−→ t(a
1
) = 1 − a
1
(a
1
∨ a
2
)
mapped
−→ t(a
1
, a
2
) = a
1
+ a
2
− a
1
· a
2
Please note that the t-norm behaves as classic al logic when
the variable approaches the value 0 (false) or 1 (true).
The equivalence a
1
⇒ a
2
≡¬a
1
∨ a
2
can be used
to represent implications (modus ponens) b efore perform-
ing t-norm conversion. However, this process does not
powerfullycapturetheinferenceprocessperformedin
a probabilistic or fuzzy logic context. Any t-norm has a
corresponding binary operator ⇒ called residuum,which
is used in fuzz y logic to generalize implications in case
of continuous variables. In particular, for a minimum
t-norm, it