SPICi: A fast clustering algorithm for large biological networks

Lewis-Sigler Institute for Integrative Genomics and Department of Computer Science, Princeton University, Princeton, NJ 08544, USA.
Bioinformatics (Impact Factor: 4.98). 02/2010; 26(8):1105-11. DOI: 10.1093/bioinformatics/btq078
Source: PubMed


Clustering algorithms play an important role in the analysis of biological networks, and can be used to uncover functional modules and obtain hints about cellular organization. While most available clustering algorithms work well on biological networks of moderate size, such as the yeast protein physical interaction network, they either fail or are too slow in practice for larger networks, such as functional networks for higher eukaryotes. Since an increasing number of larger biological networks are being determined, the limitations of current clustering approaches curtail the types of biological network analyses that can be performed.
We present a fast local network clustering algorithm SPICi. SPICi runs in time O(V log V+E) and space O(E), where V and E are the number of vertices and edges in the network, respectively. We evaluate SPICi's performance on several existing protein interaction networks of varying size, and compare SPICi to nine previous approaches for clustering biological networks. We show that SPICi is typically several orders of magnitude faster than previous approaches and is the only one that can successfully cluster all test networks within very short time. We demonstrate that SPICi has state-of-the-art performance with respect to the quality of the clusters it uncovers, as judged by its ability to recapitulate protein complexes and functional modules. Finally, we demonstrate the power of our fast network clustering algorithm by applying SPICi across hundreds of large context-specific human networks, and identifying modules specific for single conditions.
Source code is available under the GNU Public License at http://compbio.cs.princeton.edu/spici.

Download full-text


Available from: Peng Jiang, Oct 21, 2014
  • Source
    • "include node 5 in the cluster. However, two vertices are more likely to be in the same module if the weight on the edge between them is higher (Jiang and Singh, 2010). In Fig. 1, we can see that the average weight with which node 4 is connected with nodes 2 and 3 is higher than the same with which node 5 is connected with nodes 1, 2 and 3. So, a cluster comprising the set {4, 2, 3} seems more meaningful than a cluster comprising the set {5, 1, 2, 3}. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional clustering algorithms often exhibit poor performance for large networks. On the contrary, greedy algorithms are found to be relatively efficient while uncovering functional modules from large biological networks. The quality of the clusters produced by these greedy techniques largely depends on the underlying heuristics employed. Different heuristics based on different attributes and properties perform differently in terms of the quality of the clusters produced. This motivates us to design new heuristics for clustering large networks. In this paper, we have proposed two new heuristics and analyzed the performance thereof after incorporating those with three different combinations in a recently celebrated greedy clustering algorithm named SPICi. We have extensively analyzed the effectiveness of these new variants. The results are found to be promising.
    Computational biology and chemistry 09/2015; 59(Pt A):28-36. DOI:10.1016/j.compbiolchem.2015.05.007 · 1.12 Impact Factor
  • Source
    • "We use RandIndex for comparison with the results of 6 algorithms given in Yang et al. (2011). These 6 algorithms are RankClus (Sun et al. 2009), Walktrap (Pons and Latapy 2006), K-means (Dhillon et al. 2005), LinkCommunity (Ahn et al. 2010), SPICi (Jiang and Singh 2010), and Betweenness (Girvan and Newman 2002). We also use Normalized Mutual Information (NMI) for comparison with the results of another set of 6 algorithms given in Hajibagheri et al. (2013), which include GPSODM (Hajibagheri et al. 2013), GGADM (Hajibagheri et al. 2012), HA (Leung et al. 2009), MMC (van Dongen 2000), LPA (Raghavan et al. 2007), and Infomap (Rosvall and Bergstrom 2008). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Node centrality and vertex similarity in network graph topology are two of the most fundamental and significant notions for network analysis. Defining meaningful and quantitatively precise measures of them, however, is nontrivial but an important challenge. In this paper, we base our centrality and similarity measures on the idea of influence of a node and exploit the implicit knowledge of influence-based connectivity encoded in the network graph topology. We arrive at a novel influence diffusion model, which builds egocentric influence rings and generates an influence vector for each node. It captures not only the total influence but also its distribution that each node spreads through the network. A Shared-Influence-Neighbor (SIN) similarity defined in this influence space gives rise to a new, meaningful and refined connectivity measure for the closeness of any pair of nodes. Using this influence diffusion model, we propose a novel influence centrality for influence analysis and an Influence-Guided Spherical K-means (IGSK) algorithm for community detection. Our approach not only differentiates the influence ranking in a more detailed manner but also effectively finds communities in both undirected/directed and unweighted/weighted networks. Furthermore, it can be easily adapted to the identification of overlapping communities and individual roles in each community. We demonstrate its superior performance with extensive tests on a set of real-world networks and synthetic benchmarks.
    05/2015; 5(1). DOI:10.1007/s13278-015-0254-4
  • Source
    • "Since most previous approaches detect complexes based solely on the PPI network, we concentrate on testing the effectiveness of GMFTP using only the topological property first. We compare it to a representative set of approaches: AP [11], CFinder [50], ClusterONE [17], Linkcomm [38], MCL [9], MCODE [10], MINE [51], SPICi [12] and SR-MCL [32]. For the four algorithms (AP, ClusterONE, MCL and SPICi) which can handle weights, we implement them on both the weighted and the unweighted versions of the four networks (Collins, Gavin, Krogan core and Krogan extended) which include edge weights. "
    [Show abstract] [Hide abstract]
    ABSTRACT: BackgroundIdentification of protein complexes can help us get a better understanding of cellular mechanism. With the increasing availability of large-scale protein-protein interaction (PPI) data, numerous computational approaches have been proposed to detect complexes from the PPI networks. However, most of the current approaches do not consider overlaps among complexes or functional annotation information of individual proteins. Therefore, they might not be able to reflect the biological reality faithfully or make full use of the available domain-specific knowledge.ResultsIn this paper, we develop a Generative Model with Functional and Topological Properties (GMFTP) to describe the generative processes of the PPI network and the functional profile. The model provides a working mechanism for capturing the interaction structures and the functional patterns of proteins. By combining the functional and topological properties, we formulate the problem of identifying protein complexes as that of detecting a group of proteins which frequently interact with each other in the PPI network and have similar annotation patterns in the functional profile. Using the idea of link communities, our method naturally deals with overlaps among complexes. The benefits brought by the functional properties are demonstrated by real data analysis. The results evaluated using four criteria with respect to two gold standards show that GMFTP has a competitive performance over the state-of-the-art approaches. The effectiveness of detecting overlapping complexes is also demonstrated by analyzing the topological and functional features of multi- and mono-group proteins.ConclusionsBased on the results obtained in this study, GMFTP presents to be a powerful approach for the identification of overlapping protein complexes using both the PPI network and the functional profile. The software can be downloaded from http://mail.sysu.edu.cn/home/stsddq@mail.sysu.edu.cn/dai/others/GMFTP.zip.
    BMC Bioinformatics 06/2014; 15(1):186. DOI:10.1186/1471-2105-15-186 · 2.58 Impact Factor
Show more