Conference PaperPDF Available

Mining Large Networks with Subgraph Counting

Authors:
  • Mix, United States

Abstract and Figures

The problem of mining frequent patterns in networks has many applications, including analysis of complex networks, clustering of graphs, finding communities in social networks, and indexing of graphical and biological databases. Despite this wealth of applications, the current state of the art lacks algorithmic tools for counting the number of subgraphs contained in a large network. In this paper we develop data-stream algorithms that approximate the number of all subgraphs of three and four vertices in directed and undirected networks. We use the frequency of occurrence of all subgraphs to prove their significance in order to characterize different kinds of networks: we achieve very good precision in clustering networks with similar structure. The significance of our method is supported by the fact that such high precision cannot be achieved when performing clustering based on simpler topological properties, such as degree, assortativity, and eigenvector distributions. We have also tested our techniques using swap randomization.
Content may be subject to copyright.
Mining large networks with subgraph counting
Ilaria Bordino
Sapienza Universit`
a di Roma
Rome, Italy
bordino@dis.uniroma1.it
Debora Donato
Yahoo! Research
Barcelona, Spain
debora@yahoo-inc.com
Aristides Gionis
Yahoo! Research
Barcelona, Spain
gionis@yahoo-inc.com
Stefano Leonardi
Sapienza Universit`
a di Roma
Rome, Italy
Stefano.Leonardi@dis.uniroma1.it
Abstract
The problem of mining frequent patterns in networks has
many applications, including analysis of complex networks,
clustering of graphs, finding communities in social net-
works, and indexing of graphical and biological databases.
Despite this wealth of applications, the current state of the
art lacks algorithmic tools for counting the number of sub-
graphs contained in a large network.
In this paper we develop data-stream algorithms that
approximate the number of all subgraphs of three and
four vertices in directed and undirected networks. We
use the frequency of occurrence of all subgraphs to prove
their significance in order to characterize different kinds
of networks: we achieve very good precision in clustering
networks with similar structure. The significance of our
method is supported by the fact that such high precision
cannot be achieved when performing clustering based on
simpler topological properties, such as degree, assortativ-
ity, and eigenvector distributions. We have also tested our
techniques using swap randomization.
1 Introduction
Graphs are ubiquitous data representations that are used
to model complex relations in a wide variety of applica-
tions, including biochemistry, neurobiology, ecology, social
sciences, and information systems. One of the most basic
tools for analysing graph structures and revealing the prop-
erties of the underlying data is finding frequent patterns in
graphs. This task has numerous applications, like network
characterization [24, 25], modeling complex networks[22],
detecting anomalies [8], and indexing graph databases [34].
This work was partially supported by the EU within the 6th Frame-
work Programme under contract 001907 “Dynamically Evolving, Large
Scale Information Systems” (DELIS)
Despite the fact that many interesting graph datasets have
really large scale (a prominent example is the Web graph),
a lot of the existing work has been restricted to the anal-
ysis of small networks [25]. Counting subgraphs on large
datasets is a challenging computational task, due to the ex-
plosion of the number of candidate subgraphs involved. A
natural approach that can be used to perform computations
on huge data sets is the one based on the adoption of the
data-stream model [26]. The most basic subgraph-counting
problem, counting the number of triangles in an undirected
graph, has indeed been studied in this model [6, 10, 20].
However, to the best of our knowledge, the problem of de-
signing efficient data stream algorithms for counting other
patterns in large-scale graphs has not been studied before.
Our contributions in this paper are summarized as fol-
lows:
We extend the techniques of Buriol et al. [10] for count-
ing triangles in data streams, and we develop data-stream
algorithms that approximate the number of all graph minors
of three and four vertices in directed and undirected graphs.
We demonstrate the practical applicability of our al-
gorithms by developing an optimized implementation and
evaluating their performance on real networks of size up to
one billion edges.
We perform extensive experiments that demonstrate the
relevance and usefulness of our graph-minor counting algo-
rithm for the task of recognizing families of networks.
We show that the precision obtained by clustering algo-
rithms that use as features the distributions of the graph mi-
nors in the networks cannot be achieved with simpler topo-
logical features. We also assess the statistical significance
of our method by swap randomization [15].
The rest of this paper is organized as follows. Section 2
describes previous work on mining frequent subgraphs and
on data stream computation. In section 3 we present our
1
general algorithm for counting graph minors. In Section 4
we present the experimental results on the quality of the ap-
proximations and our network clustering algorithm. Finally,
Section 5 is a short conclusion.
2 Related work
The problem of discovering frequent subgraphs has been
studied extensively in the area of data mining [13, 18, 23,
33]. The algorithmic techniques for this problem are mostly
based on the a-priori principle [5]. A key difference be-
tween our paper and the above line of research is that we
focus on counting the occurrences of subgraphs in one sin-
gle large graph.
Our algorithm for clustering networks is based on the
distribution of minors and it is, to a large extent, inspired by
the work of Milo et al. [24, 25], who search for those minors
whose frequency is significantly higher than the frequency
that one would expect in random networks with the same
degree distribution. A similar idea of testing data mining
results against random networks with a given degree dis-
tribution was also proposed in [15]. Milo et al. [21] have
also proposed an algorithm that uses a randomly sampled
set of subgraphs to estimate subgraphs concentrations and
to detect network motifs. However, this algorithm has been
shown to have a bias for sampling certain subgraphs more
often than others [32].
The problem of finding graphlets of three and four nodes
in protein interaction networks was studied recently in [30].
Differently from our algorithms, the heuristics proposed
in [30] do not have any performance guarantee on the pro-
cessing time and storage requirements. Moreover, their al-
gorithm is not scalable for large graphs.
The large body of work on data stream algorithms [17]
contrasts with a lack of efficient solutions for many natu-
ral graph problems. Previous to this work, algorithms for
counting triangles have been presented in [6, 20, 10, 11]. In
this paper we extend the techniques of counting triangles to
counting all minors of size 3 and 4 for directed and undi-
rected graphs.
3 Algorithms for counting graph minors
Let G= (V, E )be a graph, which can be either directed
or undirected. Our basic model for the representation of G
is the “incidence” stream model, in which we assume that
all edges incident to the same vertex appear subsequently
in the stream. In the case of undirected graphs, every edge
appears twice—in the incidence list of both incident nodes.
The ordering v1, . . . , vnof the vertices can be arbitrary. We
denote by dithe number of vertices incident to vertex vi.
We present a general three-pass sampling algorithm for
counting the number of occurrences of a specific minor M
of cvertices in the graph G. We denote by M= (XM, YM)
Figure 1. All minors of size 3 and 4 for undirected graphs
Figure 2. Minors of size 3 and 4 for directed graphs
a prototype minor of G, which is a simply connected sub-
graph of Ghaving no multiple edges and no vertex loops.
Here XM={¯x1, . . . , ¯xc}and YMare the vertices and
edges of M, respectively. The minors of size 3 and 4
for undirected and directed graphs are shown in Figures 1
and 2. In the figures all the possible minors are shown, ex-
cept for the case of directed minors of four nodes, where
only four minors are shown out of 199 possible ones. We
now describe in detail the algorithm for counting minors.
First we fix a minor M= (XM, YM), whose number of oc-
currences we want to count in the graph G. The algorithm
uses two basic concepts: (i) the concept of a prototype sub-
graph SMfor M, and (ii) the concept of the sample space
Sfor Mwith respect to SM.
The prototype subgraph SM= (XM, Y 0
M),Y0
MYM
is simply a subgraph of Mdefined on the same set XMof c
vertices as M. For example, if Mis the minor representing
an undirected triangle, then SMcan be a path of length 2.
The sample space Sis defined to be the set of all distinct
subgraphs Sof Gthat are isomorphic to SM.
At a very intuitive level, Sdefines the set of “candidate
places” in which Mcan potentially appear in G. The algo-
rithm samples such candidate places from S, checks if M
actually appears in those places, and then uses the count of
the occurrences of Min the sample to estimate the number
of occurrences of Min G.
More formally, we define Sto be the set of subgraphs
S= (X, Y )of Gfor which there is a bijection f:X
XM, such that for each x1, x2Xit is (x1, x2)Y
if and only if (f(x1), f (x2)) Y0
M. Given a subgraph
S= (X, Y )in S, we define ¯
Y(X)to be the edges that
are needed to extend Sto an occurrence of M, that is,
¯
Y(X) = {(f1x1), f 1x2)) : ( ¯x1,¯x2)YM/Y 0
M}. Fi-
nally, we denote by |S| the size of the sample space S.
The general three-pass algorithm is the following:
SAM PLE MINOR
1st Pass:
Compute the size |S| of the sample space.
2nd Pass:
Uniformly choose a member S= (X, Y )of
the sample space.
3rd Pass:
Run the following test:
if all edges in ¯
Y(X)are in the graph
then β= 1
else β= 0
return β
The accuracy and the performance of the algorithms we pro-
pose rely crucially on the structure of the sample space and,
consequently, on the choice of the prototype subgraph SM.
We observe that in order to provide a uniform sampling of
all occurrences of M, we need to ensure that every single
occurrence of Min the graph can be detected by extending
the same number of distinct subgraphs in the sample space.
This number, which we denote by nM, is defined to be the
number of isomorphic mappings of the subgraph SMto the
minor M. The specified requirement is clearly guaranteed
by our method since we check if all the edges needed to ex-
tend a sampled subgraph to an occurrence of the minor M
exist in the graph.
We make a few remarks on the SAMP LEMI NOR algo-
rithm. The first pass requires to design a sample space
whose size can be easily determined in a streaming pass
over the graph. Thus, the choice of the prototype subgraph
SMshould be made in a way that ensures that the size of
the sample space depends only on simple parameters, like
the number of vertices, the number of edges, and the degree
of every vertex. The second pass of the algorithm requires
to list all samples of the sample space in linear order, and to
efficiently identify the sample corresponding to some posi-
tion in the order. This task is easy if, say, the sample space
is formed by the set of all edges of the graph. In general,
more complex enumeration schemes might be needed.
Now, let us denote by TMthe number of occurrences of
minor Min the graph G. We recall that the objective of the
algorithm is to provide a good estimation of TM.
Lemma 3.1. The algorithm SA MPLEMINOR outputs a
value β, which has expected value
E[β] = nM·TM
|S|
Proof. The algorithm chooses a random element S(X)of
the sample space defined on a subset Xof cvertices. Each
of the TMminors can be obtained in nMdifferent ways, i.e.,
starting from nMdifferent samples. Since each choice of a
S(X)has the same probability, the probability of choosing
a sample that can be extended to a minor is nM·TM
|S| .
A single run of the SAMP LEMI NOR algorithm returns
a binary value. To obtain an estimation of the expectation
of the parameter β, we perform multiple runs of the SAM-
PL EMINOR algorithm. In particular, we start
s3
²2·|S|
nM·TM
·ln(2
δ)(1)
parallel instances of SAMP LEMI NOR and return the value
g
TM:= Ã1
s·
s
X
i=1
βi!·|S|
nM
as an estimate of TM, the number of occurrences of the mi-
nor Min the graph. For the quality of approximation pro-
vided by g
TMwe can show the following.
Lemma 3.2. With probability 1δthe following statement
holds:
(1 ²)·TM<g
TM<(1 + ²)·TM.
Proof. (Sketch) We use the Chernoff bounds
Pr[1
s
s
X
i=1
βi(1 + ²)E[β]] < e²2·E[β]·s/3
and
Pr[1
s
s
X
i=1
βi(1 ²)E[β]] < e²2·E[β]·s/2,
and the definition of sfrom Equation (1) for the number of
repetitions of the algorithm.
We complete this section with the analysis of the running
time of the algorithm. The time complexity of the first two
passes of the algorithm cannot be stated in general, since
they depend on the specific sampling strategy. However, if
there is a constant-time method to access the i-th element
of the sample space, the uniform selection of a sample can
be implemented in constant time per edge of the graph by
reservoir sampling [31], as we will discuss in the next sec-
tion. For the third pass of the algorithm, if we implement the
different instances of the SA MPL EMINOR algorithm inde-
pendently of each other, we require O(1
²2·log( 1
δ)·(|S|
nMTM))
time to process each edge. However, as we will also de-
scribe in the next section, the cost of checking for Min all
instances of SMcan be reduced to expected constant time
per edge of the graph Gvia hashing. We therefore conclude
with the following theorem.
Theorem 1. There is a three-pass streaming algorithm to
count the number of minors in incidence streams up to a
multiplicative error of 1±², with probability at least 1
δ, which needs O(s)memory cells and amortized expected
update time O(1 + s·|V|
|E|), where
s3
²2·|S|
nM·TM
·ln( 2
δ).
Table 1. The datasets used for experimental evaluation.
Graph class Type # Instances Max |V|Max |E|Graph class Type # Instances Max |V|Max |E|
(thousands) (thousands) (thousands) (thousands)
Synthetic [22, 7] un/directed 39 850 30 Word adjacency [24] directed 4 30 30
Wikipedia [9] un/directed 7 350 5 000 Author Collaboration [7, 28, 27] undirected 5 400 7 000
Webgraphs [3] un/directed 5 40 000 900 000 Autonomous Systems (AS) [1] undirected 12 12 43
Cellular [19] directed 43 3 7 Protein Interaction [2] undirected 3 4 16
Citation [14] directed 3 4 000 16 000 US Road [4] undirected 12 24 000 58 000
Food webs [1] directed 6 0.3 0.3
4 Experimental evaluation
We provide an optimized implementation of our algo-
rithm for counting all the subgraphs of three and four nodes.
Figure 3 shows the prototype subgraphs that we have used
for sampling. In the implementation, we merge the first two
Figure 3. Prototype sampling subgraphs
passes of our algorithm by applying reservoir sampling [31]
to choose elements at random for a sample set whose size
is not known in advance. We also implement efficiently the
third pass using a uniform hash function that requires linear
space, as proposed in [29]. We evaluate our algorithms on
an extensive collection of real and synthetic datasets, which
we summarize in Table 1. All the datasets are made avail-
able.
4.1 Quality of approximation
We have performed extensive experiments to evaluate
the accuracy of the counts produced by our methods. For
the sake of concreteness and clarity of presentation, we
present approximation results on counting the occurrences
of one particular directed 3-node minor, which we call M11
and it is shown in Table 2. We have verified that the quality
of approximation is similar for many other minors.
Table 2 reports the results on the quality of approxima-
tion of counting the occurrences of M11 in the 5largest
Webgraphs in our dataset. We run the algorithm three times
using 10K,100Kand 1Msamples. For each run, Table 2
reports: (i) the estimated number e
Nof occurrences of M11;
(ii) the quality Qlt(%) of the result, expressed in terms of
percentage deviation from the exact count: a positive value
indicates an overestimation and a negative value indicates
an underestimation; (iii) the running time in seconds.
Our algorithm obtains an approximation as good as 5%
even when using only 10,000 samples. For the two largest
datasets, uk-2002 and uk-2005, which have 200 and 940
million edges respectively, we do not indicate the quality
of the approximation since we are not able to compute the
exact value.
Alternative sampling strategies for undirected 4-node
minors: For each undirected minor of 4nodes we imple-
ment two of the three sampling strategies S4u
1,S4u
2,S4u
3,
presented in Figure 3. Strategy S4u
1is used for all minors.
We then implement S4u
3as alternative strategy for the mi-
nors that contain S4u
3, while we use S4u
2for the rest of mi-
nors.
None of the above sampling strategies can be a-priori
considered as the best one for all cases. The reason is that,
according to our analysis, the smaller the size of the sam-
ple space, the smaller is the variance of the estimation, the
better the strategy, since we need fewer samples to obtain a
good approximation. We compare the accuracy of the alter-
native sampling strategies for 6 Wikipedia graphs and 6 AS
graphs. The experimental results match the theoretical anal-
ysis: for all graphs the winning strategy is the one that uses
the smallest sample space. We do not present the results due
to lack of space.
4.2 Clustering networks
Inspired by previous work [25, 24], we use the distribu-
tion of subgraph counts for characterizing different types of
networks. We use clustering to investigate whether features
based on minors can be used to partition the input networks
into meaningful families (e.g., web graphs, food-chain net-
works, protein interaction networks, etc.) and if they out-
perform simpler features based on classical measures used
in literature to characterize complex networks [12].
Our approach is to represent graphs by vectors, where
each coordinate corresponds to the number of occurrences
of one particular minor. Let Gbe a graph and let tnbe the
number of distinct minors of nnodes. Let |Mj|be the num-
ber of instances of the j-th minor with j= 0, .., tn. We
then represent Gby the vector m(G) =< m1, . . . , mtn>,
where the coordinate mjis the normalized number of oc-
currences of minor Mj. Then, given a set of graphs from
different families we cluster the graphs by clustering the
vectors by which the graphs are represented.
For the clustering task we use Weka.1We use two clus-
tering algorithms, the Expectation-maximization (EM) and
the k-means algorithms [16], varying the number of clusters
1http://www.cs.waikato.ac.nz/ml/weka
Table 2. Results for minor M11 in the graphs extracted from 5 crawls of the Web domains .cnr, .eu, .in, .uk
Graph Nodes Edges |S|= 10,000 |S|= 100,000 |S|= 1,000,000
e
NQlt(%) Time e
NQlt(%) Time e
NQlt(%) Time
cnr-2000 3.2·1053.2·1068 910 124 -5.43 1.02 9 261 635 -1.70 5.71 9 349 349 -0.77 80.23
eu-2005 8.6·1051.9·107137 742 793 -4.50 4.85 140 576 158 -2.54 9.82 142 990 124 -0.87 34.86
in-2004 1.38 ·1061.69 ·107820 098 732 4.84 4.00 797 908 977 2.01 9.70 780 542 774 -0.21 69-92
uk-2002 1.8·1071.9·1084.3·109- 77.43 4.4·109- 117.45 4.3·109- 206.49
uk-2005 3.9·1079.36 ·1083·1010 - 302.64 2.9·1010 - 354.23 2.9·1010 - 649.93
from 4 to 8. We compare how well the computed clusters
match with the original classes, adopting the classes to clus-
ter evaluation method implemented in Weka. We label each
instance with the type of network it belongs to. During the
clustering, this label is ignored. In a second phase, the ma-
jority class in each cluster is determined.
Undirected graphs. The best result is achieved with the
EM method for k= 7 clusters: more than 75% of the in-
stances are correctly classified. Due to lack of space, results
are deferred to an extended version of this paper.
Directed graphs. For the directed graphs we compare the
following 5groups of features:
1. Standard topological properties: We consider 16 dif-
ferent measures for every node, including indegree,
outdegree, average indegree of successors, average
outdegree of predecessors, assortativity, edge reci-
procity, PageRank, and local number of triangles. In
order to assign the considered microscopic measures
to each graph, we compute the mean, the variance, the
median, the 10-percentile and 90-percentile of them,
for a total number of 81 features.
2. Minors with 3nodes, for a total number of 13 features.
3. Minors with 4nodes containing a K12 clique, for a
total number of 190 features.
4. All the minors of size 3and 4, for a total number of 203
features.
5. All the properties listed above, for a total number of
81 + 203 = 284 features.
For all the listed cases, we run the k-means algorithm im-
posing k= 7 clusters. The matching matrices for the first
4cases are shown in Table 3. In all the cases at least 5out
of 7classes are correctly identified. We number from 0to 6
the clusters that match the classes cellular, food-web, word,
citation, wikipedia, Webgraph and synthetic and we use the
numbers greater than 7for the clusters matching no class.
In the first case, i.e., using standard topological proper-
ties, the k-means algorithm is able to correctly classify the
74%of the instances. When using only minors of size 3,
77.78%of the instances are correctly classified. In the third
case, i.e., using only minors of size 4, we correctly classify
84.26%of the instances. Using all the minors, 90.74%of
the instances are correctly classified. It is worth observing
that using all the 284 features together (classical + minors)
does not improve the result achieved using only the minors-
counting methodology.
The results show that minors outperform standard topo-
logical features and play a fundamental role for the charac-
terization of complex networks.
4.3 Swap randomization
We assess the statistical significance of our method by
comparing the number of minors found in a network with
the number of minors observed in random networks with
the same degree sequence, generated through a sequence of
swaps between pairs of edges of the graph [15].
We generate 5randomized networks for each of the 43
real directed cellular graphs. For each graph we perform
1000·#{edges}swaps, a number much larger than what has
been empirically considered sufficient to obtain a random
graph [15]. The application of the EM algorithm with two
clusters correctly separates all but one of the randomized
networks from all the real ones. We then conclude that 3-
and 4-node minors contain valuable information about the
structure of the real networks.
5 Conclusions
We have proposed a suite of methods for approximating
the number of 3- and 4-node minors for directed and undi-
rected graphs. Our algorithms are based on random sam-
pling. They can be used to estimate with high precision the
number of minors in a graph using limited storage and con-
stant per unit processing time, and making three passes on
the input graph.
We provide an optimized implementation of the algo-
rithms and test them on networks extracted from more
than 10 application domains. We then propose a network-
clustering algorithm based on the frequency of occurrence
of all minors, which achieves a precision by far higher than
performing clustering based on simpler topological prop-
erties. The quality of our data mining techniques has also
been tested using swap randomization.
Table 3. Matching Matrix for the clustering using k-means method with k=7 clusters using (1) classical topological properties; (2)
only 3nodes minors; (3) only 4nodes minors; (4) 3and 4nodes minors
(1) (2) (3) (4)
assigned to 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7
cellular 43 0 0 - 0 - 0 0 0 31 0 0 0 0 - 0 12 43 0 - - 0 0 0 0 0 43 0 0 - 0 0 0 0
food-web 0 5 0 - 0 - 1 0 0 0 6 0 0 0 - 0 0 2 3 - - 0 0 0 0 1 0 4 0 - 0 0 0 2
word 0 0 4 - 0 - 0 0 0 0 0 4 0 0 - 0 0 4 0 - - 0 0 0 0 0 0 0 4 - 0 0 0 0
citation 0 0 0 - 3 - 0 0 0 0 1 0 2 0 - 0 0 0 0 - - 3 0 0 0 0 0 0 0 - 3 0 0 0
wikipedia 0 0 0 - 7 - 0 0 0 0 0 0 0 6 - 1 0 0 0 - - 7 0 0 0 0 0 0 0 - 7 0 0 0
Webgraph 0 0 0 - 6 - 0 0 0 0 0 0 0 2 - 4 0 0 0 - - 0 2 3 0 1 0 1 0 - 2 1 2 0
synthetic 0 0 0 - 9 - 21 5 4 0 0 0 4 0 - 35 0 0 0 - - 0 0 36 3 0 0 0 0 - 0 0 39 0
Correct 74%77.78% 84,26% 90.74%
References
[1] COSIN Project. http://www.cosin.org.
[2] DIP database at the University of California. http://dip.doe-
mbi.ucla.edu.
[3] Laboratory of Web Algorithmics, Universit`
a degli Studi di
Milano. http://law.dsi.unimi.it.
[4] The 9th DIMACS Implementation Challenge on Shortest
Paths. http://www.dis.uniroma1.it/˜
challenge9.
[5] R. Agrawal and R. Srikant. Fast algorithms for mining asso-
ciation rules. In VLDB, 1994.
[6] Z. Bar-Yosseff, R. Kumar, and D. Sivakumar. Reductions in
streaming algorithms, with an application to counting trian-
gles in graphs. In SODA, 2002.
[7] A. L. Barab´
asi and R. Albert. Emergence of scaling in ran-
dom networks. Science, 286(5439):509–512, October 1999.
[8] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and
R. Baeza-Yates. Link-based characterization and detection
of Web Spam. In AIRWeb, 2006.
[9] L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Mil-
lozzi. Temporal evolution of the wikigraph. In Web Intelli-
gence, 2006.
[10] L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-
Spaccamela, and C. Sohler. Counting triangles in data
streams. In PODS, 2006.
[11] L. Buriol, G. Frahling, S. Leonardi, and C. Sohler. Estimat-
ing clustering indexes in data streams. In ESA, 2007.
[12] Da, F. A. Rodrigues, G. Travieso, and Villas. Characteriza-
tion of complex networks: A survey of measurements. Ad-
vances in Physics, 56(1), 2007.
[13] M. Deshpande, M. Kuramochi, and N. Wale. Frequent
substructure-based approaches for classifying chemical com-
pounds. TKDE, 17(8), 2005.
[14] J. Gehrke, P. Ginsparg, and J. Kleinberg. Overview of the
2003 KDD Cup. SIGKDD Explorarion Newsletters, 2003.
[15] A. Gionis, H. Mannila, T. Mielik¨
ainen, and P. Tsaparas. As-
sessing data mining results via swap randomization. In KDD,
2006.
[16] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements
of Statistical Learning. Springer, 2001.
[17] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Com-
puting on data streams. External memory algorithms, 1999.
[18] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent
subgraphs in the presence of isomorphism. In ICDM, 2003.
[19] H. Jeong, B. Tombor, R. Albert, Z. Oltvai, and A. Barabasi.
The large-scale organization of metabolic networks. Nature,
407, 2000.
[20] H. Jowhari and M. Ghodsi. New streaming algorithms for
counting triangles in graphs. In COCOON, 2005.
[21] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. Efficient
sampling algorithm for estimating subgraph concentrations
and detecting network motifs. Bioinformatics, 20(11):1746–
1758, July 2004.
[22] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar,
A. Tomkins, and E. Upfal. Stochastic models for the web
graph. In FOCS, 2000.
[23] M. Kuramochi and G. Karypis. An efficient algorithm for
discovering frequent subgraphs. TKDE, 16(9), 2004.
[24] R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr,
I. Ayzenshtat, M. Sheffer, and U. Alon. Superfamilies of
evolved and designed networks. Science, 303, 2004.
[25] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,
D. Chklovskii, and U. Alon. Network motifs: simple
building blocks of complex networks. Science, 298, 2002.
[26] S. Muthukrishnan. Data streams: algorithms and applica-
tions. FTTCS, 1(2), 2005.
[27] M. Newman. Finding community structure in networks using
the eigenvectors of matrices. Physical Review E, 2006.
[28] M. E. J. Newman. The structure of scientific collaboration
networks. Proc Natl Acad Sci USA, 98(2), 2001.
[29] A. Ostlin and R. Pagh. Uniform hashing in constant time and
linear space. In STOC, 2003.
[30] N. Prˇ
zulj, D. G. Corneil, and I. Jurisica. Efficient estimation
of graphlet frequency distributions in protein–protein inter-
action networks. Bioinformatics, 22(8), 2006.
[31] J. S. Vitter. Random sampling with a reservoir. ACM Trans-
actions Mathematical Softwware, 11(1), 1985.
[32] S. Wernicke. Efficient detection of network motifs.
IEEE/ACM Trans. Comput. Biol. Bioinformatics, 3(4), 2006.
[33] X. Yan and J. Han. gspan: Graph-based substructure pattern
mining. In ICDM, 2002.
[34] X. Yan, P. Yu, and J. Han. Graph indexing: A frequent struc-
ture based approach. In SIGMOD, 2004.
... • Bioinformatics: The frequency or distribution of the occurrence of each different testing templates may characterize a protein-protein interaction network [3] [4], where repeated subgraphs are crucial in understanding cell physiology as well as developing new drugs. [5] • Computing kernel of other algorithms: Sub-tree counting is one of the computing kernels of bounded treewidth subgraph (such as circles, cactus graphs, series-parallel graphs etc.) counting problem [6] and also the kernel of network clustering [7]. Despite subgraph counting plays an important role in discovery of patterns in a graph network, counting the exact number of subgraphs of size k in a n-vertex network takes O(n k ) time [4], which is computationally challenging even for moderate values of n and k. ...
... al [16] as a global comparative measure based on the local structural characteristics of different networks. Bordino et al. [7] demonstrates that one can use the relative frequency of subgraphs within networks to distinguish and cluster different networks. ...
Preprint
Full-text available
Subgraph counting aims to count occurrences of a template T in a given network G(V, E). It is a powerful graph analysis tool and has found real-world applications in diverse domains. Scaling subgraph counting problems is known to be memory bounded and computationally challenging with exponential complexity. Although scalable parallel algorithms are known for several graph problems such as Triangle Counting and PageRank, this is not common for counting complex subgraphs. Here we address this challenge and study connected acyclic graphs or trees. We propose a novel vectorized subgraph counting algorithm, named Subgraph2Vec, as well as both shared memory and distributed implementations: 1) reducing algorithmic complexity by minimizing neighbor traversal; 2) achieving a highly-vectorized implementation upon linear algebra kernels to significantly improve performance and hardware utilization. 3) Subgraph2Vec improves the overall performance over the state-of-the-art work by orders of magnitude and up to 660x on a single node. 4) Subgraph2Vec in distributed mode can scale up the template size to 20 and maintain good strong scalability. 5) enabling portability to both CPU and GPU.
... Costs are thus comparatively cheap considering methods that identify specific connectivity patterns by counting occurrences of particular sub-graphs (e.g. [44,[48][49][50][51][52]); such motif-counts also scale at least linearly in network size, but they show exponentially growing costs as the size of the motif-pattern increases [50]. In practice this often means that counts can not be determined for patterns involving 10 nodes or more [53], which renders some domains computationally intractable for this approach, but eventually not for BtA. ...
Preprint
Full-text available
Complex networks have been characterised by their specific connectivity patterns (network motifs), but their building blocks can also be identified and described by node-motifs---a combination of local network features. One technique to identify single node-motifs has been presented by Costa et al. (L. D. F. Costa, F. A. Rodrigues, C. C. Hilgetag, and M. Kaiser, Europhys. Lett., 87, 1, 2009). Here, we first suggest improvements to the method including how its parameters can be determined automatically. Such automatic routines make high-throughput studies of many networks feasible. Second, the new routines are validated in different network-series. Third, we provide an example of how the method can be used to analyse network time-series. In conclusion, we provide a robust method for systematically discovering and classifying characteristic nodes of a network. In contrast to classical motif analysis, our approach can identify individual components (here: nodes) that are specific to a network. Such special nodes, as hubs before, might be found to play critical roles in real-world networks.
... Another approach utilizes graph kernels, which are kernel functions that compute inner products on graphs measuring their similarity (Ralaivola et al., 2005;Vishwanathan et al., 2010;Kashima et al., 2004;Kang et al., 2012;Hammond et al., 2013). Motif counting is an alternative frequently used technique (Milo et al., 2002;Pržulj, 2007;Ikehara & Clauset, 2017;Faust, 2006;Bordino et al., 2008;Janssen et al., 2012), which refers to the investigation of the distribution of small "graphlets" (e.g., all graphs of size 3 or 4). NetEmd, a network comparison method introduced by Wegner et al. (2018) is also based on the distribution of motifs. ...
Article
Full-text available
Data-driven analysis of complex networks has been in the focus of research for decades. An important area of research is to study how well real networks can be described with a small selection of metrics, furthermore how well network models can capture the relations between graph metrics observed in real networks. In this paper, we apply machine-learning techniques to investigate the aforementioned problems. We study 500 real-world networks along with 2000 synthetic networks generated by four frequently used network models with previously calibrated parameters to make the generated graphs as similar to the real networks as possible. This paper unifies several branches of data-driven complex network analysis, such as the study of graph metrics and their pair-wise relationships, network similarity estimation, model calibration, and graph classification. We find that the correlation profiles of the structural measures significantly differ across network domains and the domain can be efficiently determined using a small selection of graph metrics. The structural properties of the network models with fixed parameters are robust enough to perform parameter calibration. The goodness-of-fit of the network models highly depends on the network domain. By solving classification problems, we find that the models lack the capability of generating a graph with a high clustering coefficient and relatively large diameter simultaneously. On the other hand, models are able to capture exactly the degree-distribution-related metrics.
... We have implemented a globally optimal strategy that extracts and maintains only the best paths between every pair of nodes (Huang et al., 2009) Casas et al., 2011Casas et al., 2013;Boukhayma and Boyer, 2015;Boukhayma and Boyer, 2017). This strategy corresponds to extracting the essential sub-graph from the complete digraph induced from the input sequences (Bordino et al., 2008). This method ensures the existence of at least one transition between any two nodes in the graph, which potentially yields a better use of the original data with less dead ends. ...
Article
Full-text available
This paper introduces Deep4D a compact generative representation of shape and appearance from captured 4D volumetric video sequences of people. 4D volumetric video achieves highly realistic reproduction, replay and free-viewpoint rendering of actor performance from multiple view video acquisition systems. A deep generative network is trained on 4D video sequences of an actor performing multiple motions to learn a generative model of the dynamic shape and appearance. We demonstrate the proposed generative model can provide a compact encoded representation capable of high-quality synthesis of 4D volumetric video with two orders of magnitude compression. A variational encoder-decoder network is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. This enables high-quality 4D volumetric video synthesis to be driven by skeletal motion, including skeletal motion capture data. This encoded latent space supports the representation of multiple sequences with dynamic interpolation to transition between motions. Therefore we introduce Deep4D motion graphs, a direct application of the proposed generative representation. Deep4D motion graphs allow real-tiome interactive character animation whilst preserving the plausible realism of movement and appearance from the captured volumetric video. Deep4D motion graphs implicitly combine multiple captured motions from a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail.
Article
This article proves strong lower bounds for distributed computing in the congest model, by presenting the bit-gadget : a new technique for constructing graphs with small cuts. The contribution of bit-gadgets is twofold. First, developing careful sparse graph constructions with small cuts extends known techniques to show a near-linear lower bound for computing the diameter, a result previously known only for dense graphs. Moreover, the sparseness of the construction plays a crucial role in applying it to approximations of various distance computation problems, drastically improving over what can be obtained when using dense graphs. Second, small cuts are essential for proving super-linear lower bounds, none of which were known prior to this work. In fact, they allow us to show near-quadratic lower bounds for several problems, such as exact minimum vertex cover or maximum independent set, as well as for coloring a graph with its chromatic number. Such strong lower bounds are not limited to NP-hard problems, as given by two simple graph problems in P, which are shown to require a quadratic and near-quadratic number of rounds. All of the above are optimal up to logarithmic factors. In addition, in this context, the complexity of the all-pairs-shortest-paths problem is discussed. Finally, it is shown that graph constructions for congest lower bounds translate to lower bounds for the semi-streaming model, despite being very different in its nature.
Article
We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.
Article
Full-text available
“Perhaps he could dance first and think afterwards, if it isn’t too much to ask him.” S. Beckett, Waiting for Godot Given a labeled graph, the collection of -vertex induced connected subgraph patterns that appear in the graph more frequently than a user-specified minimum threshold provides a compact summary of the characteristics of the graph, and finds applications ranging from biology to network science. However, finding these patterns is challenging, even more so for dynamic graphs that evolve over time, due to the streaming nature of the input and the exponential time complexity of the problem. We study this task in both incremental and fully-dynamic streaming settings, where arbitrary edges can be added or removed from the graph. We present TipTap , a suite of algorithms to compute high-quality approximations of the frequent -vertex subgraphs w.r.t. a given threshold, at any time (i.e., point of the stream), with high probability. In contrast to existing state-of-the-art solutions that require iterating over the entire set of subgraphs in the vicinity of the updated edge, TipTap operates by efficiently maintaining a uniform sample of connected -vertex subgraphs, thanks to an optimized neighborhood-exploration procedure. We provide a theoretical analysis of the proposed algorithms in terms of their unbiasedness and of the sample size needed to obtain a desired approximation quality. Our analysis relies on sample-complexity bounds that use Vapnik–Chervonenkis dimension, a key concept from statistical learning theory, which allows us to derive a sufficient sample size that is independent from the size of the graph. The results of our empirical evaluation demonstrates that TipTap returns high-quality results more efficiently and accurately than existing baselines.
Chapter
Mining frequent subgraphs from graph databases is a basic task with broad applications. Frequent subgraph mining is defined as finding all subgraphs that appear more than specified threshold value. It consists of mainly two steps, candidate generation and frequency calculation. In candidate generation step, most of the existing work starts with a frequent edge or vertex to generate frequent candidate patterns. This process is not scalable due to exponential number of candidate patterns generation. In this paper, an optimized algorithm is presented to generate candidate patterns for mining frequent subgraphs from a large single graph. The proposed algorithm starts and extends candidates with frequent subgraphs. The proposed algorithm uses graph invariant properties and symmetries present in a graph to generate candidate subgraphs thus reducing generation of enormous amount of candidate subgraphs. Subgraphs are extended by adding another frequent subgraph determined by the symmetry mapping of subgraph there by reduces the complexities involved in candidate generation and frequency counting. An evaluation study on datasets explores the strengths and limitations of the proposed work. The results make sure that, this is an optimized approach to generate candidate subgraphs directly using invariant properties.
Article
Full-text available
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
Article
Full-text available
Each complex network (or class of networks) presents specific topological features which characterize its connectivity and highly influence the dynamics of processes executed on the network. The analysis, discrimination, and synthesis of complex networks therefore rely on the use of measurements capable of expressing the most relevant topological features. This article presents a survey of such measurements. It includes general considerations about complex network characterization, a brief review of the principal models, and the presentation of the main existing measurements. Important related issues covered in this work comprise the representation of the evolution of complex networks in terms of trajectories in several measurement spaces, the analysis of the correlations between some of the most traditional measurements, perturbation analysis, as well as the use of multivariate statistics for feature selection and network classification. Depending on the network and the analysis task one has in mind, a specific set of features may be chosen. It is hoped that the present survey will help the proper application and interpretation of measurements.
Article
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mech-anisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Article
Our previous work ([ 1 ][1]) presented a phenomenological observation on real-world networks: They show distinct subgraph significance profiles (SP) when compared with randomized networks with the same degree sequence as the real networks. This observation calls for a theory—a model that