Effect of insertions and deletions (indels) on wirings in
protein-protein interaction networks: a large-scale study
Fereydoun Hormozdiari1,5, Michael Hsing2,5, Raheleh Salari1,5,
Alexander Sch¨ onhuth1, Simon K. Chan2,3, S. Cenk Sahinalp1,6, Artem Cherkasov4,6
1School of Computing Science, Simon Fraser University
2Bioinformatics Graduate Program, University of British Columbia
3Canada’s Michael Smith Genome Science Centre,
British Columbia cancer Research Centre
4Division of Infectious Diseases, Department of Medicine,
University of British Columbia
5The authors contributed equally
6Corresponding authors: firstname.lastname@example.org, email@example.com
Although insertions and deletions (indels) are a common type of sequence variation, their
origin and their functional consequences have not yet been fully understood. It has been known
that indels preferably occur in the loop regions of the affected proteins. Moreover, it has re-
cently been demonstrated that indels are significantly more strongly correlated with functional
changes than substitutions. In sum, there is substantial evidence that indels, not substitutions,
are the predominant evolutionary factor when it comes to structural changes in proteins.
As a consequence it comes natural to hypothesize that sizable indels can modify protein
interaction interfaces, causing a gain or loss of protein-protein interactions, thereby signifi-
cantly rewiring the interaction networks. In this paper, we have analyzed this relationship in a
large-scale study. We have computed all paralogous protein pairs in S. cerevisiae (Yeast) and
D. melanogaster (Fruit Fly) and sorted the respective alignments according to whether they
contained indels of significant lengths as per a pair HMM based framework of a recent study.
We subsequently computed well known centrality measures for proteins that participated in
indel alignments (indel proteins) and those that did not. We found that indel proteins indeed
showed greater variation in terms of these measures. This demonstrates that indels have a sig-
nificant influence when it comes to rewiring of the interaction networks due to evolution, which
confirms our hypothesis. In general, this study may yield relevant insights into the functional
interplay of proteins and the evolutionary dynamics behind it.
Insertions/deletions (indels) and amino acid substitutions represent the two most common
types of sequence variations, observed among evolutionarily related proteins (Thorne (2000)).
Unlike amino acid substitutions, whose mechanisms have been studied intensively (Yang et
al. (1994); Whelan et al. (2001)), indels remain less understood and pose several unanswered
Previous indel surveys have been conducted for both DNA (Thorne et al., 1992; Gu and
Li.,1995; Lunter et al., 2007) and proteins (Pascarella and Argos, 1992; Sibanda and Thornton,
1993; Benner et al., 1993; Fechteler et al., 1995; Qian and Goldstein, 2001; Chang and Benner,
2004; Pang et al., 2005) to address specific questions for limited datasets such as indel length
distribution, sequence and structure composition as well as divergent evolution of proteins.
Recently, a large-scale indel analysis was performed in 136 bacterial and protozoan genomes,
and the results have shown that 5-10% of proteins in the studied species contained sizable
indels in comparison to homologous proteins in human (Cherkasov et al., 2006).
Examples of indel occurrence in conserved and essential proteins have also been reported
in recent studies (Nandan et al., 2003; Chan et al., 2007). Another interesting finding is that
indels are decisively involved in disease-causing mutational hot spots in DNA (Kondrashov
and Rogozin, 2004). Last but not least, a recent large-scale study revealed that indel length
and functional divergence are clearly correlated (Salari et al., 2008). This study also aimed
at laying a computational foundation of a novel large-scale approach to drug target search by
extending the previous indel studies of Cherkasov et al. (2005, 2006); Nandan et al. (2007).
Despite the common occurrence of indels as evolutionary sequence variation, no broadly
accepted hypotheses have been established to explain their origins and their functional con-
sequences. Previous structural analyses showed that indels occurred preferably in the protein
surface regions without disrupting the core protein folds (Benner et al., 1993; Pascarella and
Argos, 1992; Sibanda and Thornton, 1993; Fechteler et al., 1995). In addition, studies have
shown that protein-protein interactions were established mainly through non-covalent contacts
between protein surfaces (Gao et al., 2004; Deremble et al., 2005). Combining these observa-
tions, we hypothesize that sizable indels can modify protein interaction interfaces, causing a
gain or loss of protein-protein interactions. Thus, indels may “fine-tune” a certain number of
interactions on proteins, while keeping the other interactions intact, thereby optimizing cellular
behavior on a systemic level. In this paper, we will investigate this hypothesis and thoroughly
study the relationship between sizable indels and protein-protein interactions, that is, systemic
changes, by means of large-scale genomic and interaction data.
Recent advancements inhigh-throughput experiments have enabled thedetection ofprotein-
protein interactions on a genome-wide scale, and generated a large amount of interaction data
for several species. Techniques such as the yeast two-hybrid assay and affinity purification
followed by mass spectrometry have been successfully used to identify protein interactions
in a number of organisms including Saccharomyces cerevisiae (Uetz et al., 2000; Ito et al.,
2001; Ho et al., 2002; Gavin et al., 2006; Krogan et al., 2006), Drosophila melanogaster (Giot
et al., 2003), Escherichia coli (Butland et al., 2005), and Caenorhabditis elegans (Li et al.,
2004). The corresponding interaction data are publicly accessible through databases such as
DIP (Salwinski et al., 2004) and IntAct (Hermjakob et al., 2004).
The available interaction data have facilitated recent protein interaction network (PIN)
studies, e.g. the investigation of topological properties of PINs (Barabasi et al., 2004; Albert
et al., 2005) and their generation mechanisms (Eisenberg et al., 2003; Bhan, 2002; Vazquez,
2003) in terms of evolution and/or human “sampling” of the nodes and edges of the PINs. All
of these studies aimed at improving a general understanding of complex cellular behaviour.
Our study on indels and their effects on protein-protein interactions may also contribute to
these discussions by improving the current understanding of protein interaction network evo-
To the best of our knowledge for the first time, we elucidate the im-
pact of indels on the complex wirings in PPI networks. Therefore, we have computed all
paralogous protein pairs in Yeast and Fruit Fly and separated the pairs according to whether
their alignments contained significant (in terms of length) indels (“indel alignments”), hence
more likely refer to truly evolutionary events, or not, as determined by the statistical framework
presented in a recent study (Salari et al., 2008) (see subs. 2.2).
We demonstrate that proteins which participate in indel alignments have higher variation in
terms of well known centrality measures (Joy et al., 2005; Jeong et al., 2001). As it was shown
that the considered centrality measures can reliably monitor the relevance of proteins within
the complex wirings of PPI networks (Hormozdiari et al., 2007; Yu et al., 2007), this leads us
to conclude that indels indeed have a significant impact on systemic changes in the considered
2.1 Network Centrality Metrics
There are a large number of well known metrics with which to measure the topological central-
ity of nodes in PINs, hence to measure the biological relevance of the proteins associated with
the nodes. Here, we have opted for employing three different metrics to measure centrality of
a node v ∈ V of a network Net(V,E):
1. Deg(v) is the classical degree of v which is just the number of edges attached to the
node under consideration.
2. Betweenness Bet(v) is the number of shortest paths (between any two nodes x,y ∈ V )
that pass through v. More formally, let σx,ybe the number of shortest path from x ∈ V to
y ∈ V and σx,y(v) the number of shortest paths which pass through v. Then betweenness
is defined as
Betweenness is computed by means of the algorithm proposed by Brandes (2001). Note
that it has been shown that nodes with high betweenness represent proteins with impor-
tant functional properties. It may come as a surprise that proteins of high betweenness
are even more essential than proteins of high degree. As pathways do not solely pass
through the shortest paths between their end points we finally study
3. Random walk betweenness RWBet(v), as introduced by Newman (2005), which is the
number of all paths through a node v as computed by a random walk procedure. In more
detail, given two nodes x,y ∈ V , random walks between x and y are computed and the
number of such walks that pass through v is divided by the number of all random walks
between x and y. The overall random walk betweenness is then given by the number
of all such random walks that pass through v divided by all random walks computed.
It can be interpreted as the probability that v is visited by arbitrary random walks. The
algorithm we use for computation of random walks is a classical procedure based on
known techniques from linear algebra (see Newman (2005) for details). It calculates the
random walk betweenness for all nodes in the network in running time of O(n3) (where
n is the size of the network) for sparse networks (PINs are sparse) and thereby requires
2.2Indel Length Statistics
In order to distinguish between gaps that are artifacts of the alignment procedure and those
that significantly likely reflect insertions/deletions as introduced by evolution we have adopted
a pair HMM based approach that has recently been described in Salari et al. (2008). The
approach is centered around computing the probabilities
Pn,T(IA(x,y)≥k) := P(IA(x,y) ≥ k |LA(x,y) = n,(x,y) ∈ T)
where T is a pool of protein pairs of interest, n,k are integers, A is an alignment procedure,
is the length of the alignment of two proteins x = x1...xm,y = y1...yn(xi,yjare amino
acids), as computed by A, and
is the length of the largest indel in the alignment. Probabilities (2) are interpreted as the
probabilities that the largest indel in the alignment of x and y is greater than k or, equivalently,
that the alignment contains an indel of length at least k, given that x and y have been drawn
from T such that the alignment of x and y is of length n. In this paper, A is the classical
Needleman-Wunsch algorithm and pools T contain protein pairs whose Needleman-Wunsch
alignment yields similar alignment similarity scores. This way, one can account for that gaps
in high score alignments, hence between proteins that are more likely to be homologous, are
more trustworthy in terms of being evolutionary events.
The approach of Salari et al. (2008) can be summarized as follows:
1. The alignments referring to protein pairs from a pool T are used to train a standard
pair HMM for global alignments. We recall that Needleman-Wunsch alignments can be
interpreted as Viterbi paths in such a pair HMM (Durbin et al., 1999).
2. Deletion and insertion states of the underlying Markov chain of the resulting pair HMM
are collapsed into a combined ’Indel’ state.
3. Pn,T(IA(x,y)≥k) is finally interpreted as the probability PMC(n,k)1that a sequence of
length n, as generated by the Markov chain resulting from 2., contains a consecutive
’Indel’ stretch of length at least k.
Efficient computation of probabilities PMC(n,k) for arbitrary choices of n and k such that 1 ≤
k ≤ n ≤ Nmax(where Nmaxis the maximum length of the alignments under consideration)
is guaranteed by a novel dynamic programming approach. See Salari et al. (2008) for details.
1PMC(n,k) is corresponding to P(Cn,k) in Salari et al. (2008).
3.1Data, Alignment Model, and Indel/Non-Indel Pairs
In our experiments we focused on Yeast and Fruit Fly. We downloaded all Core Yeast2and
Fruit Fly proteins together with their interactions from the Database of Interacting Proteins
(DIP) Salwinski et al. (2004). The corresponding Core Yeast PIN consisted of 2808 proteins
and 6459 interactions. The Fruit Fly PIN was composed of 7070 proteins and 22009 interac-
We used the “GGSEARCH” tool from the FASTA sequence comparison package Pear-
son and Lipman (1988) to calculate pairwise global alignments between all proteins of Core
Yeast resp. Fruit Fly. As a substitution matrix, BLOSUM50 (default) was used. GGSEARCH
implements the classical Needleman-Wunsch alignment algorithm with affine gap penalties.
To ensure high quality of the alignments, we first discarded all alignments of less than
50% alignment similarity3. To further increase the likelihood that alignments under con-
sideration represent paralogous protein pairs while still remaining with sufficient amounts
of data, we moreover removed alignments of less than 20% identity or E-value of greater
than 10−1. In order to remove redundant alignments we screened the NCBI GENE database
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene) for any cases of spliced variants and re-
moved the respective protein pairs from our list. As a result, we obtained 1933 paralogous
protein pairs for Yeast and 6096 pairs of paralogous proteins for Fruit Fly. For each species,
wegrouped the aligned protein pairs according to their similarity scores into ten different pools
and inferred the parameters of the corresponding Markov chains, according to the procedure
described in subsection 2 (see Salari et al. (2008) for more details). The idea of different score
pools is to ensure that indel length is not correlated to alignment similarity scores and also to
reflect that shorter indels tend to be more trustworthy in more similar alignments. Otherwise
indel pairs would more likely refer to low similarity score alignments which would falsify the
2Core Yeast represents a subset of Yeast proteins the interactions among which are of higher confidence levels and
therefore more reliable.
3This is the number of matches and “good” substitutions as specified by GGSEARCH, divided by the length of the
core idea of our study. See table 1, for Markov chain parameters of the different pools for both
We would like to distinguish between two kinds of alignments, indel resp. non-indel align-
ments, that is, alignments that contain indels of significant sizes resp. alignments that contain
only gaps whose lengths are significantly small. To be more precise, we are interested in
alignments of length n and maximal indel length k such that
PMC(n,k) ≤ θI
1 − PMC(n,k + 1) ≤ θNI
where θI and θNI are appropriate significant levels. Note that 1 − P(Cn,k+1) is just the
probability that an alignment does not contain an indel of size larger than k. In our experiments
we have set θI = 0.2 and θNI = 0.2 which established an optimal trade-off between the
reliability of the occurring indels and the amounts of data under consideration.
3.2Network Centrality and Indels
Wecomputed the network centrality measures degree, betweenness and random walk between-
ness (see subs. 2.1) for all proteins from Core Yeast and Fruit Fly. Subsequently, we computed
the average of the differences of these measures between indel (I) and non-indel (NI) paralo-
gous proteins. As a baseline, we also computed the overall averages of the differences of these
measure of between all paralogous protein pairs (O). Results are displayed in Fig. 1.
Clearly, the differences between indel pairs are usually greater than those between non-
indel pairs and also than the overall average, for both species and all centrality metrics under
consideration. This is reflected by that in figure 1 the curve of the indel pairs (red), is always
above that of the overall averages (blue) which in turn is above that of the non-indel pairs
(green). This confirms our hypothesis that paralogous proteins that evolved from each other by
the introduction of indels have higher variation in PPI network essentiality in comparison to
paralogous pairs whose alignments do not contain indels. Note that, as per sorting the proteins
into different alignment similarity pools, we have ruled out that greater differences between
indel pair proteins are due to lower alignment similarity scores, as alignment similarity is
clearly correlated to indel length in general. Moreover, it is interesting to observe in Fig. 1
that the centrality plots for all classes (indel, non-indel, overall) of paralogous proteins exhibit
similar patterns across the different similarity values.
We have also listed the average differences between centrality values for Core Yeast and
Fruit Fly when sorting paralogous protein pairs according to sequence identity as given by the
corresponding alignments. Results are displayed in tables 2 (Core Yeast) and 3 (Fruit Fly).
We first recall that it is a well known fact that one can expect structural divergence only in
the twilight zone of between roughly 20% and 35% sequence identity (Rost, 1999). Clearly,
our results again confirm this well established hypothesis, as for protein pairs beyond 35%
sequence identity no significant changes in the centrality measure differences can be observed.
in tables 2 and 3. However, in the twilight zone, differences are significant, as determined by
Overall, our results significantly confirm our core hypothesis.
4 Discussion and Outlook
The main question studied in this manuscript is the effect of indel sites on PPI networks. We
have hypothesized that, due to indel sites being located in close proximity of protein-protein
interaction interfaces, occurrence of truly evolutionary indels can significantly modify protein-
protein bindings and interactions, leading to a loss of existing interactions or a gain of new
The latter part of the above hypothesis was supported by our observation that indel protein
pairs had a greater variation of centrality measures than non-indel pairs, including degree,
betweenness and random walk betweenness (figure 1, table 2 and table 3). Our results imply
that indels have changed the degree of similar proteins in the studied PINs. In addition, as was
argued in Yu et al. (2007) betweenness is a more important indicator of a protein being a ”key
connector protein with surprising functional and dynamic properties”, and hence essential, that
is, its removal leads to a collapse of the cellular interplay. Thus, by changing the betweenness
measure, indels could modify the essentiality of a protein in a PIN. The relationship between
indels and essentiality has been studied in our recent work (Chan et al., 2007), and the results
have shown that essential proteins are more likely to contain indels than non-essential proteins
in the studied species.
In conclusion, the study has shown an interesting result on the relationships between indels
and protein interaction networks in S. cerevisiae and D. melanogaster. Indel protein pairs had
greater variations of centrality measures than non-indel pairs; this suggested that indels can
decisively modify the interaction spectrum of an affected protein.
In a future study, we will focus on the characterization of indel sites in known protein
structures to better understand their role in network rewiring also in terms of the involved
Albert, R. 2005. Scale-free networks in cell biology. J. Cell Sci 118:4947-57
Barabasi, AL. and Gonnet, GH. 2004. Network biology: understanding the cell’s functional
organization. Nat Rev Genetic. 5:101-113.
Benner, S.A., Cohen, M.A., and Gonnet, G.H. 1993. Empirical and structural models for in-
sertions and deletions in the divergent evolution of proteins. J Mol Biol, 229: 1065-1082.
Bhan, A., Galas, D.J., Dewey, T.G. 2002. A duplication growth model of gene expression
networks. Bioinformatics, 18: 1486ˆ a1493.
Brandes, U. 2001. A faster algorithm for betweenness centrality. Journal of Mathematical
Sociology, 25: 163–177.
Butland, G., Peregrin-Alvarez, J.M., Li, J., Yang, W., Yang, X., Canadien, V., Starostine, A.,
Richards, D., Beattie, B., Krogan, N., et al. 2005. Interaction network containing conserved
and essential protein complexes in Escherichia coli. Nature, 433: 531-537.
Chan, S.K., Hsing, M., Hormozdiari, F., and Cherkasov A. 2007. Relationship between inser-
tion/deletion (indel) frequency of proteins and essentiality. BMC Bioinformatics, 8: 227.
Chang, M. S. S. and Benner, S. A. 2004. Empirical analysis of protein insertions and deletions
determining parameters for the correct placement of gaps in protein sequence alignments.
Journal of Molecular Biology, 341, 617-631.
Cherkasov, A., Nandan, D. and Reiner, N. E. 2005. Selective targetting of indel-inferred dif-
ferences in 3D structures of highly homologous proteins. Proteins: Structure, Function and
Bioinformatics, 58, 950-954.
Cherkasov, A., Lee, S.J., Nandan, D., and Reiner, N.E.2006. Large-scale survey for potentially
targetable indels in bacterial and protozoan proteins. Proteins, 62: 371-380.
Deremble, C., and Lavery, R. 2005. Macromolecular recognition. Curr Opin Struct Biol., 15:
Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. 1999. Biological Sequence Analysis: Prob-
abilistic Models of Proteins and Nucleic Acids .Cambridge Univ Pr.
Eisenberg, E., and Levanon, E.Y. 2003. Preferential attachment in the protein network evolu-
tion. Phys Rev Lett. 91:138701
Fechteler, T., Dengler, U., and Schomburg, D. 1995. Prediction of protein three-dimensional
structures in insertion and deletion regions: a procedure for searching data bases of repre-
sentative protein fragments using geometric scoring criteria. J Mol Biol, 253: 114-131.
Gao, Y., Wang, R., and Lai, L. 2004. Structure-based method for analyzing protein-protein
interfaces. J Mol Model, 10: 44-54.
Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen,
L.J., Bastuck, S., Dumpelfeld, B., et al. 2006. Proteome survey reveals modularity of the
yeast cell machinery. Nature, 440: 631-636.
Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E.,
Godwin, B., Vitols, E., et al. 2003. A protein interaction map of Drosophila melanogaster.
Science, 302: 1727-1736.
Gu, X., and Li, W.H. 1995. The size distribution of insertions and deletions in human and
rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol
Evol , 40: 464-473.
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S.,
Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., et al. 2004. IntAct: an open source
molecular interaction database. Nucleic Acids Res. 32 Database issue: D452-D455.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor,
P., Bennett, K., Boutilier, K., et al. 2002. Systematic identification of protein complexes in
Saccharomyces cerevisiae by mass spectrometry. Nature, 415: 180-183.
Hormozdiari, F., Berenbrink, P., Przulj, N., and Sahinalp, S.C. 2007. Not all scale-free net- Download full-text
works are born equal: the role of the seed graph in PPI network evolution. PloS Computa-
tional Biology, e118.doi:10.1371/journal.pcbi.0030118.
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive
two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A, 98:
Jeong, H., Manson, SP., Barabasi, AL., and Oltvai, ZN. 2001. Lethality and centrality in pro-
tein networks. Nature 411:41-42.
Joy, MP., Brock, A., Ingber, DE., and Huang, S. 2005. High-Betweenness Proteins in Yeast
Protein Interaction Network. J. of Biomedicine and Biotechnology, 2:96-103.
Kondrashov, A. S. and Rogozin, I. B. 2004. Context of Deletions and Insertions in Human
Coding Sequences. Human Mutation, 23, 177-185.
Krogan, N.J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta,
N., Tikuisis, A.P., et al. 2006. Global landscape of protein complexes in the yeast Saccha-
romyces cerevisiae. Nature, 440: 637-643.
Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.O., Han, J.D.,
Chesneau, A., Hao, T., et al. 2004. A map of the interactome network of the metazoan C.
elegans. Science, 303: 540-543.
Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A. and Hein, J. 2007. Uncertainty
in homology inferences: Assessing and improving genomic sequence alignment. Genome
Research, 18, doi:10.1101/gr.6725608.
Nandan, D., Cherkasov, A., Sabouti, R., Yi, T., and Reiner, N.E. 2003. Molecular cloning,
biochemical and structural analysis of elongation factor-1 alpha from Leishmania donovani:
comparison with the mammalian homologue. Biochem Biophys Res Commun, 302: 646-