Content uploaded by Aristides Gionis
Author content
All content in this area was uploaded by Aristides Gionis on Oct 10, 2014
Content may be subject to copyright.
Mining large networks with subgraph counting∗
Ilaria Bordino
Sapienza Universit`
a di Roma
Rome, Italy
bordino@dis.uniroma1.it
Debora Donato
Yahoo! Research
Barcelona, Spain
debora@yahoo-inc.com
Aristides Gionis
Yahoo! Research
Barcelona, Spain
gionis@yahoo-inc.com
Stefano Leonardi
Sapienza Universit`
a di Roma
Rome, Italy
Stefano.Leonardi@dis.uniroma1.it
Abstract
The problem of mining frequent patterns in networks has
many applications, including analysis of complex networks,
clustering of graphs, finding communities in social net-
works, and indexing of graphical and biological databases.
Despite this wealth of applications, the current state of the
art lacks algorithmic tools for counting the number of sub-
graphs contained in a large network.
In this paper we develop data-stream algorithms that
approximate the number of all subgraphs of three and
four vertices in directed and undirected networks. We
use the frequency of occurrence of all subgraphs to prove
their significance in order to characterize different kinds
of networks: we achieve very good precision in clustering
networks with similar structure. The significance of our
method is supported by the fact that such high precision
cannot be achieved when performing clustering based on
simpler topological properties, such as degree, assortativ-
ity, and eigenvector distributions. We have also tested our
techniques using swap randomization.
1 Introduction
Graphs are ubiquitous data representations that are used
to model complex relations in a wide variety of applica-
tions, including biochemistry, neurobiology, ecology, social
sciences, and information systems. One of the most basic
tools for analysing graph structures and revealing the prop-
erties of the underlying data is finding frequent patterns in
graphs. This task has numerous applications, like network
characterization [24, 25], modeling complex networks[22],
detecting anomalies [8], and indexing graph databases [34].
∗This work was partially supported by the EU within the 6th Frame-
work Programme under contract 001907 “Dynamically Evolving, Large
Scale Information Systems” (DELIS)
Despite the fact that many interesting graph datasets have
really large scale (a prominent example is the Web graph),
a lot of the existing work has been restricted to the anal-
ysis of small networks [25]. Counting subgraphs on large
datasets is a challenging computational task, due to the ex-
plosion of the number of candidate subgraphs involved. A
natural approach that can be used to perform computations
on huge data sets is the one based on the adoption of the
data-stream model [26]. The most basic subgraph-counting
problem, counting the number of triangles in an undirected
graph, has indeed been studied in this model [6, 10, 20].
However, to the best of our knowledge, the problem of de-
signing efficient data stream algorithms for counting other
patterns in large-scale graphs has not been studied before.
Our contributions in this paper are summarized as fol-
lows:
We extend the techniques of Buriol et al. [10] for count-
ing triangles in data streams, and we develop data-stream
algorithms that approximate the number of all graph minors
of three and four vertices in directed and undirected graphs.
We demonstrate the practical applicability of our al-
gorithms by developing an optimized implementation and
evaluating their performance on real networks of size up to
one billion edges.
We perform extensive experiments that demonstrate the
relevance and usefulness of our graph-minor counting algo-
rithm for the task of recognizing families of networks.
We show that the precision obtained by clustering algo-
rithms that use as features the distributions of the graph mi-
nors in the networks cannot be achieved with simpler topo-
logical features. We also assess the statistical significance
of our method by swap randomization [15].
The rest of this paper is organized as follows. Section 2
describes previous work on mining frequent subgraphs and
on data stream computation. In section 3 we present our
1
general algorithm for counting graph minors. In Section 4
we present the experimental results on the quality of the ap-
proximations and our network clustering algorithm. Finally,
Section 5 is a short conclusion.
2 Related work
The problem of discovering frequent subgraphs has been
studied extensively in the area of data mining [13, 18, 23,
33]. The algorithmic techniques for this problem are mostly
based on the a-priori principle [5]. A key difference be-
tween our paper and the above line of research is that we
focus on counting the occurrences of subgraphs in one sin-
gle large graph.
Our algorithm for clustering networks is based on the
distribution of minors and it is, to a large extent, inspired by
the work of Milo et al. [24, 25], who search for those minors
whose frequency is significantly higher than the frequency
that one would expect in random networks with the same
degree distribution. A similar idea of testing data mining
results against random networks with a given degree dis-
tribution was also proposed in [15]. Milo et al. [21] have
also proposed an algorithm that uses a randomly sampled
set of subgraphs to estimate subgraphs concentrations and
to detect network motifs. However, this algorithm has been
shown to have a bias for sampling certain subgraphs more
often than others [32].
The problem of finding graphlets of three and four nodes
in protein interaction networks was studied recently in [30].
Differently from our algorithms, the heuristics proposed
in [30] do not have any performance guarantee on the pro-
cessing time and storage requirements. Moreover, their al-
gorithm is not scalable for large graphs.
The large body of work on data stream algorithms [17]
contrasts with a lack of efficient solutions for many natu-
ral graph problems. Previous to this work, algorithms for
counting triangles have been presented in [6, 20, 10, 11]. In
this paper we extend the techniques of counting triangles to
counting all minors of size 3 and 4 for directed and undi-
rected graphs.
3 Algorithms for counting graph minors
Let G= (V, E )be a graph, which can be either directed
or undirected. Our basic model for the representation of G
is the “incidence” stream model, in which we assume that
all edges incident to the same vertex appear subsequently
in the stream. In the case of undirected graphs, every edge
appears twice—in the incidence list of both incident nodes.
The ordering v1, . . . , vnof the vertices can be arbitrary. We
denote by dithe number of vertices incident to vertex vi.
We present a general three-pass sampling algorithm for
counting the number of occurrences of a specific minor M
of cvertices in the graph G. We denote by M= (XM, YM)
Figure 1. All minors of size 3 and 4 for undirected graphs
Figure 2. Minors of size 3 and 4 for directed graphs
a prototype minor of G, which is a simply connected sub-
graph of Ghaving no multiple edges and no vertex loops.
Here XM={¯x1, . . . , ¯xc}and YMare the vertices and
edges of M, respectively. The minors of size 3 and 4
for undirected and directed graphs are shown in Figures 1
and 2. In the figures all the possible minors are shown, ex-
cept for the case of directed minors of four nodes, where
only four minors are shown out of 199 possible ones. We
now describe in detail the algorithm for counting minors.
First we fix a minor M= (XM, YM), whose number of oc-
currences we want to count in the graph G. The algorithm
uses two basic concepts: (i) the concept of a prototype sub-
graph SMfor M, and (ii) the concept of the sample space
Sfor Mwith respect to SM.
The prototype subgraph SM= (XM, Y 0
M),Y0
M⊆YM
is simply a subgraph of Mdefined on the same set XMof c
vertices as M. For example, if Mis the minor representing
an undirected triangle, then SMcan be a path of length 2.
The sample space Sis defined to be the set of all distinct
subgraphs Sof Gthat are isomorphic to SM.
At a very intuitive level, Sdefines the set of “candidate
places” in which Mcan potentially appear in G. The algo-
rithm samples such candidate places from S, checks if M
actually appears in those places, and then uses the count of
the occurrences of Min the sample to estimate the number
of occurrences of Min G.
More formally, we define Sto be the set of subgraphs
S= (X, Y )of Gfor which there is a bijection f:X→
XM, such that for each x1, x2∈Xit is (x1, x2)∈Y
if and only if (f(x1), f (x2)) ∈Y0
M. Given a subgraph
S= (X, Y )in S, we define ¯
Y(X)to be the edges that
are needed to extend Sto an occurrence of M, that is,
¯
Y(X) = {(f−1(¯x1), f −1(¯x2)) : ( ¯x1,¯x2)∈YM/Y 0
M}. Fi-
nally, we denote by |S| the size of the sample space S.
The general three-pass algorithm is the following:
SAM PLE MINOR
1st Pass:
Compute the size |S| of the sample space.
2nd Pass:
Uniformly choose a member S= (X, Y )of
the sample space.
3rd Pass:
Run the following test:
if all edges in ¯
Y(X)are in the graph
then β= 1
else β= 0
return β
The accuracy and the performance of the algorithms we pro-
pose rely crucially on the structure of the sample space and,
consequently, on the choice of the prototype subgraph SM.
We observe that in order to provide a uniform sampling of
all occurrences of M, we need to ensure that every single
occurrence of Min the graph can be detected by extending
the same number of distinct subgraphs in the sample space.
This number, which we denote by nM, is defined to be the
number of isomorphic mappings of the subgraph SMto the
minor M. The specified requirement is clearly guaranteed
by our method since we check if all the edges needed to ex-
tend a sampled subgraph to an occurrence of the minor M
exist in the graph.
We make a few remarks on the SAMP LEMI NOR algo-
rithm. The first pass requires to design a sample space
whose size can be easily determined in a streaming pass
over the graph. Thus, the choice of the prototype subgraph
SMshould be made in a way that ensures that the size of
the sample space depends only on simple parameters, like
the number of vertices, the number of edges, and the degree
of every vertex. The second pass of the algorithm requires
to list all samples of the sample space in linear order, and to
efficiently identify the sample corresponding to some posi-
tion in the order. This task is easy if, say, the sample space
is formed by the set of all edges of the graph. In general,
more complex enumeration schemes might be needed.
Now, let us denote by TMthe number of occurrences of
minor Min the graph G. We recall that the objective of the
algorithm is to provide a good estimation of TM.
Lemma 3.1. The algorithm SA MPLEMINOR outputs a
value β, which has expected value
E[β] = nM·TM
|S|
Proof. The algorithm chooses a random element S(X)of
the sample space defined on a subset Xof cvertices. Each
of the TMminors can be obtained in nMdifferent ways, i.e.,
starting from nMdifferent samples. Since each choice of a
S(X)has the same probability, the probability of choosing
a sample that can be extended to a minor is nM·TM
|S| .
A single run of the SAMP LEMI NOR algorithm returns
a binary value. To obtain an estimation of the expectation
of the parameter β, we perform multiple runs of the SAM-
PL EMINOR algorithm. In particular, we start
s≥3
²2·|S|
nM·TM
·ln(2
δ)(1)
parallel instances of SAMP LEMI NOR and return the value
g
TM:= Ã1
s·
s
X
i=1
βi!·|S|
nM
as an estimate of TM, the number of occurrences of the mi-
nor Min the graph. For the quality of approximation pro-
vided by g
TMwe can show the following.
Lemma 3.2. With probability 1−δthe following statement
holds:
(1 −²)·TM<g
TM<(1 + ²)·TM.
Proof. (Sketch) We use the Chernoff bounds
Pr[1
s
s
X
i=1
βi≥(1 + ²)E[β]] < e−²2·E[β]·s/3
and
Pr[1
s
s
X
i=1
βi≤(1 −²)E[β]] < e−²2·E[β]·s/2,
and the definition of sfrom Equation (1) for the number of
repetitions of the algorithm.
We complete this section with the analysis of the running
time of the algorithm. The time complexity of the first two
passes of the algorithm cannot be stated in general, since
they depend on the specific sampling strategy. However, if
there is a constant-time method to access the i-th element
of the sample space, the uniform selection of a sample can
be implemented in constant time per edge of the graph by
reservoir sampling [31], as we will discuss in the next sec-
tion. For the third pass of the algorithm, if we implement the
different instances of the SA MPL EMINOR algorithm inde-
pendently of each other, we require O(1
²2·log( 1
δ)·(|S|
nMTM))
time to process each edge. However, as we will also de-
scribe in the next section, the cost of checking for Min all
instances of SMcan be reduced to expected constant time
per edge of the graph Gvia hashing. We therefore conclude
with the following theorem.
Theorem 1. There is a three-pass streaming algorithm to
count the number of minors in incidence streams up to a
multiplicative error of 1±², with probability at least 1−
δ, which needs O(s)memory cells and amortized expected
update time O(1 + s·|V|
|E|), where
s≥3
²2·|S|
nM·TM
·ln( 2
δ).
Table 1. The datasets used for experimental evaluation.
Graph class Type # Instances Max |V|Max |E|Graph class Type # Instances Max |V|Max |E|
(thousands) (thousands) (thousands) (thousands)
Synthetic [22, 7] un/directed 39 850 30 Word adjacency [24] directed 4 30 30
Wikipedia [9] un/directed 7 350 5 000 Author Collaboration [7, 28, 27] undirected 5 400 7 000
Webgraphs [3] un/directed 5 40 000 900 000 Autonomous Systems (AS) [1] undirected 12 12 43
Cellular [19] directed 43 3 7 Protein Interaction [2] undirected 3 4 16
Citation [14] directed 3 4 000 16 000 US Road [4] undirected 12 24 000 58 000
Food webs [1] directed 6 0.3 0.3
4 Experimental evaluation
We provide an optimized implementation of our algo-
rithm for counting all the subgraphs of three and four nodes.
Figure 3 shows the prototype subgraphs that we have used
for sampling. In the implementation, we merge the first two
Figure 3. Prototype sampling subgraphs
passes of our algorithm by applying reservoir sampling [31]
to choose elements at random for a sample set whose size
is not known in advance. We also implement efficiently the
third pass using a uniform hash function that requires linear
space, as proposed in [29]. We evaluate our algorithms on
an extensive collection of real and synthetic datasets, which
we summarize in Table 1. All the datasets are made avail-
able.
4.1 Quality of approximation
We have performed extensive experiments to evaluate
the accuracy of the counts produced by our methods. For
the sake of concreteness and clarity of presentation, we
present approximation results on counting the occurrences
of one particular directed 3-node minor, which we call M11
and it is shown in Table 2. We have verified that the quality
of approximation is similar for many other minors.
Table 2 reports the results on the quality of approxima-
tion of counting the occurrences of M11 in the 5largest
Webgraphs in our dataset. We run the algorithm three times
using 10K,100Kand 1Msamples. For each run, Table 2
reports: (i) the estimated number e
Nof occurrences of M11;
(ii) the quality Qlt(%) of the result, expressed in terms of
percentage deviation from the exact count: a positive value
indicates an overestimation and a negative value indicates
an underestimation; (iii) the running time in seconds.
Our algorithm obtains an approximation as good as 5%
even when using only 10,000 samples. For the two largest
datasets, uk-2002 and uk-2005, which have 200 and 940
million edges respectively, we do not indicate the quality
of the approximation since we are not able to compute the
exact value.
Alternative sampling strategies for undirected 4-node
minors: For each undirected minor of 4nodes we imple-
ment two of the three sampling strategies S4u
1,S4u
2,S4u
3,
presented in Figure 3. Strategy S4u
1is used for all minors.
We then implement S4u
3as alternative strategy for the mi-
nors that contain S4u
3, while we use S4u
2for the rest of mi-
nors.
None of the above sampling strategies can be a-priori
considered as the best one for all cases. The reason is that,
according to our analysis, the smaller the size of the sam-
ple space, the smaller is the variance of the estimation, the
better the strategy, since we need fewer samples to obtain a
good approximation. We compare the accuracy of the alter-
native sampling strategies for 6 Wikipedia graphs and 6 AS
graphs. The experimental results match the theoretical anal-
ysis: for all graphs the winning strategy is the one that uses
the smallest sample space. We do not present the results due
to lack of space.
4.2 Clustering networks
Inspired by previous work [25, 24], we use the distribu-
tion of subgraph counts for characterizing different types of
networks. We use clustering to investigate whether features
based on minors can be used to partition the input networks
into meaningful families (e.g., web graphs, food-chain net-
works, protein interaction networks, etc.) and if they out-
perform simpler features based on classical measures used
in literature to characterize complex networks [12].
Our approach is to represent graphs by vectors, where
each coordinate corresponds to the number of occurrences
of one particular minor. Let Gbe a graph and let tnbe the
number of distinct minors of nnodes. Let |Mj|be the num-
ber of instances of the j-th minor with j= 0, .., tn. We
then represent Gby the vector m(G) =< m1, . . . , mtn>,
where the coordinate mjis the normalized number of oc-
currences of minor Mj. Then, given a set of graphs from
different families we cluster the graphs by clustering the
vectors by which the graphs are represented.
For the clustering task we use Weka.1We use two clus-
tering algorithms, the Expectation-maximization (EM) and
the k-means algorithms [16], varying the number of clusters
1http://www.cs.waikato.ac.nz/ml/weka
Table 2. Results for minor M11 in the graphs extracted from 5 crawls of the Web domains .cnr, .eu, .in, .uk
Graph Nodes Edges |S|= 10,000 |S|= 100,000 |S|= 1,000,000
e
NQlt(%) Time e
NQlt(%) Time e
NQlt(%) Time
cnr-2000 3.2·1053.2·1068 910 124 -5.43 1.02 9 261 635 -1.70 5.71 9 349 349 -0.77 80.23
eu-2005 8.6·1051.9·107137 742 793 -4.50 4.85 140 576 158 -2.54 9.82 142 990 124 -0.87 34.86
in-2004 1.38 ·1061.69 ·107820 098 732 4.84 4.00 797 908 977 2.01 9.70 780 542 774 -0.21 69-92
uk-2002 1.8·1071.9·1084.3·109- 77.43 4.4·109- 117.45 4.3·109- 206.49
uk-2005 3.9·1079.36 ·1083·1010 - 302.64 2.9·1010 - 354.23 2.9·1010 - 649.93
from 4 to 8. We compare how well the computed clusters
match with the original classes, adopting the classes to clus-
ter evaluation method implemented in Weka. We label each
instance with the type of network it belongs to. During the
clustering, this label is ignored. In a second phase, the ma-
jority class in each cluster is determined.
Undirected graphs. The best result is achieved with the
EM method for k= 7 clusters: more than 75% of the in-
stances are correctly classified. Due to lack of space, results
are deferred to an extended version of this paper.
Directed graphs. For the directed graphs we compare the
following 5groups of features:
1. Standard topological properties: We consider 16 dif-
ferent measures for every node, including indegree,
outdegree, average indegree of successors, average
outdegree of predecessors, assortativity, edge reci-
procity, PageRank, and local number of triangles. In
order to assign the considered microscopic measures
to each graph, we compute the mean, the variance, the
median, the 10-percentile and 90-percentile of them,
for a total number of 81 features.
2. Minors with 3nodes, for a total number of 13 features.
3. Minors with 4nodes containing a K12 clique, for a
total number of 190 features.
4. All the minors of size 3and 4, for a total number of 203
features.
5. All the properties listed above, for a total number of
81 + 203 = 284 features.
For all the listed cases, we run the k-means algorithm im-
posing k= 7 clusters. The matching matrices for the first
4cases are shown in Table 3. In all the cases at least 5out
of 7classes are correctly identified. We number from 0to 6
the clusters that match the classes cellular, food-web, word,
citation, wikipedia, Webgraph and synthetic and we use the
numbers greater than 7for the clusters matching no class.
In the first case, i.e., using standard topological proper-
ties, the k-means algorithm is able to correctly classify the
74%of the instances. When using only minors of size 3,
77.78%of the instances are correctly classified. In the third
case, i.e., using only minors of size 4, we correctly classify
84.26%of the instances. Using all the minors, 90.74%of
the instances are correctly classified. It is worth observing
that using all the 284 features together (classical + minors)
does not improve the result achieved using only the minors-
counting methodology.
The results show that minors outperform standard topo-
logical features and play a fundamental role for the charac-
terization of complex networks.
4.3 Swap randomization
We assess the statistical significance of our method by
comparing the number of minors found in a network with
the number of minors observed in random networks with
the same degree sequence, generated through a sequence of
swaps between pairs of edges of the graph [15].
We generate 5randomized networks for each of the 43
real directed cellular graphs. For each graph we perform
1000·#{edges}swaps, a number much larger than what has
been empirically considered sufficient to obtain a random
graph [15]. The application of the EM algorithm with two
clusters correctly separates all but one of the randomized
networks from all the real ones. We then conclude that 3-
and 4-node minors contain valuable information about the
structure of the real networks.
5 Conclusions
We have proposed a suite of methods for approximating
the number of 3- and 4-node minors for directed and undi-
rected graphs. Our algorithms are based on random sam-
pling. They can be used to estimate with high precision the
number of minors in a graph using limited storage and con-
stant per unit processing time, and making three passes on
the input graph.
We provide an optimized implementation of the algo-
rithms and test them on networks extracted from more
than 10 application domains. We then propose a network-
clustering algorithm based on the frequency of occurrence
of all minors, which achieves a precision by far higher than
performing clustering based on simpler topological prop-
erties. The quality of our data mining techniques has also
been tested using swap randomization.
Table 3. Matching Matrix for the clustering using k-means method with k=7 clusters using (1) classical topological properties; (2)
only 3nodes minors; (3) only 4nodes minors; (4) 3and 4nodes minors
(1) (2) (3) (4)
assigned to 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7
cellular 43 0 0 - 0 - 0 0 0 31 0 0 0 0 - 0 12 43 0 - - 0 0 0 0 0 43 0 0 - 0 0 0 0
food-web 0 5 0 - 0 - 1 0 0 0 6 0 0 0 - 0 0 2 3 - - 0 0 0 0 1 0 4 0 - 0 0 0 2
word 0 0 4 - 0 - 0 0 0 0 0 4 0 0 - 0 0 4 0 - - 0 0 0 0 0 0 0 4 - 0 0 0 0
citation 0 0 0 - 3 - 0 0 0 0 1 0 2 0 - 0 0 0 0 - - 3 0 0 0 0 0 0 0 - 3 0 0 0
wikipedia 0 0 0 - 7 - 0 0 0 0 0 0 0 6 - 1 0 0 0 - - 7 0 0 0 0 0 0 0 - 7 0 0 0
Webgraph 0 0 0 - 6 - 0 0 0 0 0 0 0 2 - 4 0 0 0 - - 0 2 3 0 1 0 1 0 - 2 1 2 0
synthetic 0 0 0 - 9 - 21 5 4 0 0 0 4 0 - 35 0 0 0 - - 0 0 36 3 0 0 0 0 - 0 0 39 0
Correct 74%77.78% 84,26% 90.74%
References
[1] COSIN Project. http://www.cosin.org.
[2] DIP database at the University of California. http://dip.doe-
mbi.ucla.edu.
[3] Laboratory of Web Algorithmics, Universit`
a degli Studi di
Milano. http://law.dsi.unimi.it.
[4] The 9th DIMACS Implementation Challenge on Shortest
Paths. http://www.dis.uniroma1.it/˜
challenge9.
[5] R. Agrawal and R. Srikant. Fast algorithms for mining asso-
ciation rules. In VLDB, 1994.
[6] Z. Bar-Yosseff, R. Kumar, and D. Sivakumar. Reductions in
streaming algorithms, with an application to counting trian-
gles in graphs. In SODA, 2002.
[7] A. L. Barab´
asi and R. Albert. Emergence of scaling in ran-
dom networks. Science, 286(5439):509–512, October 1999.
[8] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and
R. Baeza-Yates. Link-based characterization and detection
of Web Spam. In AIRWeb, 2006.
[9] L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Mil-
lozzi. Temporal evolution of the wikigraph. In Web Intelli-
gence, 2006.
[10] L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-
Spaccamela, and C. Sohler. Counting triangles in data
streams. In PODS, 2006.
[11] L. Buriol, G. Frahling, S. Leonardi, and C. Sohler. Estimat-
ing clustering indexes in data streams. In ESA, 2007.
[12] Da, F. A. Rodrigues, G. Travieso, and Villas. Characteriza-
tion of complex networks: A survey of measurements. Ad-
vances in Physics, 56(1), 2007.
[13] M. Deshpande, M. Kuramochi, and N. Wale. Frequent
substructure-based approaches for classifying chemical com-
pounds. TKDE, 17(8), 2005.
[14] J. Gehrke, P. Ginsparg, and J. Kleinberg. Overview of the
2003 KDD Cup. SIGKDD Explorarion Newsletters, 2003.
[15] A. Gionis, H. Mannila, T. Mielik¨
ainen, and P. Tsaparas. As-
sessing data mining results via swap randomization. In KDD,
2006.
[16] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements
of Statistical Learning. Springer, 2001.
[17] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Com-
puting on data streams. External memory algorithms, 1999.
[18] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent
subgraphs in the presence of isomorphism. In ICDM, 2003.
[19] H. Jeong, B. Tombor, R. Albert, Z. Oltvai, and A. Barabasi.
The large-scale organization of metabolic networks. Nature,
407, 2000.
[20] H. Jowhari and M. Ghodsi. New streaming algorithms for
counting triangles in graphs. In COCOON, 2005.
[21] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. Efficient
sampling algorithm for estimating subgraph concentrations
and detecting network motifs. Bioinformatics, 20(11):1746–
1758, July 2004.
[22] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar,
A. Tomkins, and E. Upfal. Stochastic models for the web
graph. In FOCS, 2000.
[23] M. Kuramochi and G. Karypis. An efficient algorithm for
discovering frequent subgraphs. TKDE, 16(9), 2004.
[24] R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr,
I. Ayzenshtat, M. Sheffer, and U. Alon. Superfamilies of
evolved and designed networks. Science, 303, 2004.
[25] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,
D. Chklovskii, and U. Alon. Network motifs: simple
building blocks of complex networks. Science, 298, 2002.
[26] S. Muthukrishnan. Data streams: algorithms and applica-
tions. FTTCS, 1(2), 2005.
[27] M. Newman. Finding community structure in networks using
the eigenvectors of matrices. Physical Review E, 2006.
[28] M. E. J. Newman. The structure of scientific collaboration
networks. Proc Natl Acad Sci USA, 98(2), 2001.
[29] A. Ostlin and R. Pagh. Uniform hashing in constant time and
linear space. In STOC, 2003.
[30] N. Prˇ
zulj, D. G. Corneil, and I. Jurisica. Efficient estimation
of graphlet frequency distributions in protein–protein inter-
action networks. Bioinformatics, 22(8), 2006.
[31] J. S. Vitter. Random sampling with a reservoir. ACM Trans-
actions Mathematical Softwware, 11(1), 1985.
[32] S. Wernicke. Efficient detection of network motifs.
IEEE/ACM Trans. Comput. Biol. Bioinformatics, 3(4), 2006.
[33] X. Yan and J. Han. gspan: Graph-based substructure pattern
mining. In ICDM, 2002.
[34] X. Yan, P. Yu, and J. Han. Graph indexing: A frequent struc-
ture based approach. In SIGMOD, 2004.