Content uploaded by Vachik S. Dave
Author content
All content in this area was uploaded by Vachik S. Dave on Aug 09, 2018
Content may be subject to copyright.
Advanced Review
Triangle counting in large
networks: a review
Mohammad Al Hasan*and Vachik S. Dave
Counting and enumeration of local topological structures, such as triangles, is an
important task for analyzing large real-life networks. For instance, triangle count
in a network is used to compute transitivity—an important property for under-
standing graph evolution over time. Triangles are also used for various other
tasks completed for real-life networks, including community discovery, link pre-
diction, and spam filtering. The task of triangle counting, though simple, has
gained wide attention in recent years from the data mining community. This is
due to the fact that most of the existing algorithms for counting triangles do not
scale well to very large networks with millions (or even billions) of vertices. To
circumvent this limitation, researchers proposed triangle counting methods that
approximate the count or run on distributed clusters. In this paper, we discuss
the existing methods of triangle counting, ranging from sequential to parallel,
single-machine to distributed, exact to approximate, and off-line to streaming.
We also present experimental results of performance comparison among a set of
approximate triangle counting methods built under a unified implementation
framework. Finally, we conclude with a discussion of future works in this direc-
tion. © 2017 Wiley Periodicals, Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2018, 8:e1226. doi: 10.1002/widm.1226
INTRODUCTION
Network data appear in many domains, including
social, communication, and information
sciences. Although the networks in these domains dif-
fer in terms of their structural composition, some
topological structures, specifically, triangles, appear
in abundance across networks in all different
domains. Abundance of triangles in real-life networks
motivated scientists to invent metrics, such as cluster-
ing coefficient
1
or transitivity ratio
2
to characterize
and analyze networks. The existence of triangles in
social networks has also been studied and explained
from various social science theories such as
homophily,
3
and transitivity. A key computational
task for all these studies is to count the number of tri-
angles in a network, which is the focus of this work.
There are many real-life applications of triangle
counting. The most well-known among them, of
course, is to compute the transitivity ratio (or, simply
transitivity) of a network, which is defined as the
ratio between the counts of triangles and triples
(a path of length two) in a network. Given that the
number of triples can be computed simply from the
degree of the vertices of a network, transitivity com-
putation then it becomes identical to the task of tri-
angle counting. Clustering coefficient is another
similar metric, but its value is defined for a given ver-
tex of a network—for a vertex u, its clustering coeffi-
cient is the fraction of u’s neighbors who are
neighbor themselves. Both clustering coefficient and
transitivity have been used as a key metric for net-
work analysis and network evolution models.
4
Triangle count has also been used for several
other nonobvious applications. Becchetti et al.
5
have
used distribution of local triangles for detecting web
spam. Specifically, they have shown that the distribu-
tion of local triangle frequency of spam hosts is sig-
nificantly different from those of the nonspam hosts.
The distribution of triangles is also used to uncover
*Correspondence to: alhasan@iupui.edu
Department of Computer Science, IUPUI, Indianapolis, IN, USA
Conflict of interest: The authors have declared no conflicts of inter-
est for this article.
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 1of19
hidden thematic structure in the World Wide Web.
Eckmann and Moses have shown that connected
regions of web graph, which are dense in triangles
represents a common topic.
6
Bar-Yossef et al.
7
have
used triangle count for query plan optimization in
databases. Overlapping triangles (or more generally
k-cliques) have been used for community discovery.
8
Triangle counting, though appears to be a sim-
ple task algorithmically, has attracted many contribu-
tions over the years from scientists in diverse
domains, including data mining and graph theory.
While earlier works
9,10
mainly care for asymptotic
computational complexity, in recent works, real-life
execution time has been a major consideration, moti-
vation for which comes from the enormous size of
real-life networks having vertices in the ranges of mil-
lions to billions. To achieve efficiency, approximate
triangle counting through sampling has been a very
active direction in many recent works.
11–16
Also,
researchers have tried to achieve efficiency through
algorithms that run on multi-core or distributed envi-
ronment.
13,14,17
Some variants of triangle counting
algorithms have also been inspired by data access
constraints. For example, triangle counting algo-
rithms have been proposed for various data access
scenarios which are different from traditional ran-
dom memory access, examples include restricted
access,
12
and streaming data access.
16,18
Computational complexity of a triangle count-
ing algorithm is a good indicator of its efficiency, but
in real-life the execution time of two algorithms can
be widely different even if they have the same compu-
tational complexity. The main reason for this fact is
the hidden constant of the computation complexity,
which depends on various properties of the input
graph. Sparsity is one of such properties. Large real-
life networks are very sparse, in which the number of
edges is typically a constant factor of the number of
vertices; in other words, the average degree of a vertex
is constant. Another important property is that the
degree distribution of real-life networks is skewed.
Although, the average degree of a network is con-
stant, there always exist a few vertices that have a
very large degree. This phenomenon is commonly
known as power-law degree distribution,
19
which sig-
nificantly affects the performance of a triangle count-
ing algorithm.
In this paper, we provide a thorough review of
triangle counting algorithms. We group the existing
methods based on their computation model or data
access patterns. Then, we discuss the algorithms by
comparing and contrasting their time complexity.
Finally, we show some experimental results that com-
pare the performance of some of these algorithms.
The following section provides definitions of various
concepts which are related to the task of counting
triangles. For reader's convenience, in Table 1, we
also provide notations used throughout the paper.
BACKGROUND
G(V,E)isagraphwhereVis the set of vertices and
Eis the set of edges. We use nand mfor representing
the number of vertices (|V|) and the number of edges
(|E|). Each vertex in the graph can be uniquely identi-
fiedbyanumberbetween1andn. The assignment of
identifier can be arbitrary, but it is fixed. We also con-
sider that Gis simple, connected, and undirected.
Because Gis simple, between a pair of vertices uand v,
there exists at most one edge, which we define by (u,v)
where u<v. For a vertex u,weused(u)todenotethe
degree value of u,adj(u)todenotethesetofu’sneigh-
boring vertices, and inc(u) to denote the edges that are
incident to u. Likewise, for an edge e,weuseinc(e)to
denote the incidence vertices of the edge e. It is easy to
see that P
u2V
d(u)=2m. The maximum degree value
over the vertices is defined as d
max
.
Triples and Triangles
A (connected) triple (u,v,w) at a vertex vis a path
of length two for which vis the center vertex. If the
other two vertices (uand w) are also connected by
an edge, the triple is called a closed triple (triangle),
otherwise it is called an open triple. A triangle actu-
ally contains three closed triples, one centered on
each of its vertices.
We use the symbol Π
v
to represent the set of
triples that are centered at the vertex v. The set of tri-
ples in a graph G=(V,E)isΠ, which is the union of
the set of triples at each of its node, i.e., Π=[
v2V
Π
v
.
If the degree of each of the vertices is known, the
total number of triples can be computed efficiently as
below:
jΠj=X
v2V
Πv
jj
=X
v2V
dvðÞ
2
:ð1Þ
Based on whether the triple is open or closed
(in terms of its induced embedding in the graph G),
we partition the set Πinto Π
∠
(open triples) and Π
5
(closed triples). Note that, each of the nodes of a tri-
angle in a graph Gcontributes one distinct triple in
the set Π
5
. We use Λto represent the set of distinct
triangles in a graph G. Clearly, the size of Π
5
is three
times the size of Λ, as the former contains three
Advanced Review wires.wiley.com/dmkd
2of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
copies of a distinct triangle each centered at one of
the triangle vertices. Mathematically, jΛj=1
3jΠΔj.
To represent the set of open and closed triples
centered at a vertex v, we use Π∠
vand ΠΔ
v, respec-
tively. If t(G) is the number of triangles in the graph
G, then
tGðÞ=jΛj=1
3jΠΔj=1
3X
v2VjΠΔ
vj:ð2Þ
Counting, Enumeration, and Sampling of
Triangles
For a given graph G, triangle counting is the task of
obtaining the number t(G)asdefined in Eq. (2). On
the other hand, triangle enumeration task is to enu-
merate the members of Λ, i.e., to list all unique trian-
gles in a given graph. Enumeration is a costlier task
than counting because the former solves the latter
immediately, but the latter does not necessarily solves
the former. Nevertheless, for many real-life applica-
tions, one may need to enumerate the triangles rather
than simply finding a total count of it, so both count-
ing and enumeration tasks stand on their own merit.
Finally, sampling of triangles is to obtain a subset of
Λ, typically the size of the subset is a user defined
parameter. Depending on the sampling algorithm,
the triangles in the sample set can be chosen uni-
formly (each triangle is sampled with uniform proba-
bility) or they may be sampled with a biased
probability. Sometimes we are only interested to find
a count of triangles that are incident to a given
vertex. This task is then known as local triangle
counting. Local triangle count is important to find
clustering coefficient of a given vertex.
(Local) Clustering Coefficient
Clustering coefficient is a metric denoting the cluster-
ing tendency of the vertices in a graph. When the
metric is defined on a vertex of a graph it is called
local clustering coefficient. For a given vertex, u, its
local clustering coefficient C(u) is the fraction of u’s
neighbors who are neighbor themselves.
Mathematically,
Cu
ðÞ
=v,wðÞ:v,wðÞ2E^v,w2adj uðÞjj
adj u
ðÞadj u
ðÞ
−1
ðÞ
=2:ð3Þ
The average of local clustering coefficient over the
vertices is called clustering coefficient of the network.
Transitivity
Newman et al.
20
defined the transitivity of a graph
Gas the fraction that represents the number of
closed triples divided by the number of all the triples
over the entire network. We use γ(G) to denote tran-
sitivity of G
γGðÞ=ΠΔ
Π
jj=ΠΔ
Π∠j+jΠΔ
:ð4Þ
Using Eqs. (2) and (4), the triangle count (t(G)) of a
network can be obtained from the transitivity of the
network as below:
tGðÞ=1
3γGð ÞjΠj:ð5Þ
Metropolis–Hastings (MH) Algorithm
Several approximate triangle counting methods sam-
ple triangles or triples using random walk-based indi-
rect sampling strategies, also known as Markov
Chain Monte Carlo (MCMC) sampling. Metropolis–
Hastings (MH) algorithm is a variant of MCMC
algorithm; its goal is to draw samples from some dis-
tribution π(x), called the target distribution, where,
π(x)=f(x)/K; here, f(x) is any function which assigns
a nonnegative real-value to a population object
xdenoting its desirability in regards to sampling. Kis
a normalization constant to make the sum of π(x)
over the population object equal to 1. Typically, Kis
not known or difficult to compute.
MH algorithm is used together with a random
walk to perform MCMC sampling. For this, the MH
TABLE 1 |Summary of the Notations
Notations Meaning
n
Number of vertices
m
Number of edges
d
(
u
) Degree of vertex
u
adj
(
u
) Set of neighboring vertices of the vertex
u
inc(
u
) Set of edge incident to vertex
u
Π
Set of all triples
Π
v
Set of triples centered at vertex
v
Π
∠
Set of all open triples
Π
5
Set of all closed triples
Λ
Set of distinct triangles
t
(
G
) Number of triangles in the graph
G
γ
(
G
) Transitivity of the graph
G
N
(
u
) Sorted neighbors of vertex
u
AAdjacency matrix
d
max
Max degree of a vertex in the graph
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 3of19
algorithm draws a sequence of samples from the tar-
get distribution as follows:
1. It picks an initial state (say, x) satisfying
f(x)>0.
2. From current state x, it samples a state yusing
a distribution q(x,y), referred as proposal
distribution.
3. Then, it calculates the acceptance probability
α(x,y) (Eq. (6)) and accepts the proposal move
to ywith probability α(x,y). The process con-
tinues until the Markov chain reaches to a sta-
tionary distribution.
αx,yðÞ=min πyðÞqy,xðÞ
πxðÞqx,yðÞ
,1
= min fyðÞqy,xðÞ
fxðÞqx,yðÞ
,1
:
ð6Þ
Importance Sampling
Importance sampling (IS) is a sampling strategy, which
is used to estimate expectation of a function f(x)
relative to some distribution pxðÞ=e
pxðÞ=K, called the
target distribution, whereas the samples are actually
obtained from a different distribution q(x), called the
proposal distribution. IS is useful when it is easier to
sample from the distribution qbut we need to obtain
expectation with respect to a different distribution p.
For instance, for triangle counting, we want to obtain
triple samples from a uniform distribution, i.e., the
target distribution pis uniform, but it may be easier
to sample triples from a biased distribution, say q.
Using the idea of IS, the expectation of f(x) with
respect to the target distribution is equal to
EpfxðÞ½=X
S
i=1
fx
i
ðÞwx
i
ðÞ;ð7Þ
where,
wx
i
ðÞ=e
px
i
ðÞ=qx
i
ðÞ
XS
j=1e
px
j
=qx
j
:ð8Þ
ORGANIZATION OF THE REVIEW
We organize this review based on classification of the
triangle counting methods as depicted in Figure 1.
Our first level classification of triangle counting
methods is based on data (graph) access pattern. We
consider two kinds of data access patterns: random
access and restricted access.
Random access methods assume that the entire
network is available in the memory in an adjacency
vector data structure (or in other format) and we also
know the size of the network—the number of vertices
(n) and the number of edges (m). These random access
methods are further divided into three sub-categories:
(1) exact triangle counting; (2) approximate triangle
Triangle couting
Restricted access
Random access
Distributed and parallel counting
Exact counting
Approximate counting
Random walk over vertices
Random walk over triples
Triangle counting on Streaming data
With enumeration
Without enumeration
Graph sparcification
Triple sampling
Vertex/edge sampling
Linear algebra–based method
FIGURE 1 |Classification of triangle counting works.
Advanced Review wires.wiley.com/dmkd
4of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
counting; (3) distributed and parallel triangle count-
ing. Methods in the first sub-category, i.e., the exact
triangle counting methods provide actual count of tri-
angles in the given input network, which can be
obtained with or without enumeration of each trian-
gle. Methods in the second sub-category are approxi-
mate methods, which calculate triangle counts with
acceptable error in the count. Nevertheless, approxi-
mation methods are much faster compared to exact
counting methods. Most of the approximation count-
ing methods are based on different sampling
approaches. Methods in the final sub-category run on
distributed and parallel platforms. For triangle count-
ing, such methods have recently become popular, as
they can provide exact or approximate triangle counts
for huge networks, which cannot be stored into the
main memory of a single machine.
In real world, there are many networks, which
are not fully accessible so random access based meth-
ods for triangle counting is not an option for such
networks. Such networks can only be crawled,
i.e., an analyst can only explore the neighbors of the
currently visiting node. For restricted access, it is
assumed that access to one seed vertex or a collection
of seed vertices of the network is available so that the
crawling can be initiated. Another assumption is that
the network is connected or the largest (giant) con-
nected component of the network covers the majority
of the vertices and the part of the network excluding
the giant component can be ignored. Because, the
network is connected, one can attempt to crawl the
entire network by using graph traversal methodolo-
gies and save the network (in memory or disk) for
counting triangles by using random access-based
methodologies. However, we assume that the net-
work is very large (say, Internet network) and it does
not fit in the main memory. So, random access-based
methodologies do not work on such networks, or to
the least, such methodologies are highly inefficient
due to frequent I/O access. In such restricted access
scenarios, one cannot obtain an exact count of trian-
gles but random walk over the network provides a
viable option for approximate triangle counting.
Another type of restricted access is streaming
data access, where the graph data appears as a
stream of edges. Limited memory does not allow all
the edges to be stored in the memory so a triangle
counting method requires to store a judiciously
selected sample of edges or some form of summary
statistics computed over the edges. Edges that appear
in the stream are lost if they are not saved. Because,
streaming data access works with a sample of edges
it only provides an approximate count of triangles in
a graph.
In the following section, we discuss exact trian-
gle counting algorithms, which is followed by discus-
sion of approximate triangle counting algorithms.
Then, we discuss triangle counting algorithms that
work for restricted access and streaming access sce-
narios. After that we discuss some triangle counting
methods which work on distributed or parallel plat-
forms. There after, we present experimental results
from the comparison among a collection of approxi-
mate triangle counting methodologies. Lastly, we dis-
cuss other two related counting tasks before
concluding the paper.
EXACT TRIANGLE COUNTING WITH
RANDOM ACCESS
We first discuss triangle counting algorithms with ran-
dom memory access assumption. Under this assump-
tion, we can obtain the adjacency vector of any vertex
in O(1) time. We also assume that, in the adjacency
vector of a vertex u, the neighbors of u,adj(u)is
sorted. So, the existence of an edge (u,v)canbe
answered in O(lg n) time using binary search on that
vector. Another option is to use a hash-table of edges
to answer the edge existence query in expected O(1)
time. Note that even if we use binary search for
answering edge existence query, the complexity O(lg
n) is only a worst-case time complexity, which applies
to a very high degree vertex. Given the fact that the
average degree of real-life networks is constant, and
for triangle counting we need to ask the edge existence
query over a very large number of small adjacency lists
and occasionally a few large adjacency lists, we can
amortize the cost of costly binary searches over a large
number of cheap searches and assume that the cost of
edge existence query is constant.
A brute-force triangle counting algorithm can
be designed by enumerating all distinct three vertex
sets {u,v,w} (not necessarily connected) in a net-
work and then testing whether the three vertices
form a triangle. Because the number of such three-
vertex sets is in the order of Θ(n
3
), the brute-force
complexity of triangle counting is Θ(n
3
). Note that,
such an algorithm not only counts the triangles but
also iterates (or lists) the triangles. Note that, any
algorithm that iterates each of the triangles has a
worst-case complexity of Θ(n
3
), because the maxi-
mum possible number of triangles in a graph of
nvertices is exactly n
3, which is realized when the
given graph is a clique.
Over the years, many triangle counting methods
have been proposed which have better runtime perfor-
mance. Specifically, the methods that count but do
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 5of19
not list all the triangles have worst-case time complex-
ity much better than Θ(n
3
). Using this observation, we
will discuss the algorithms positioned in two groups.
The first group of algorithms only provides a count of
triangles without listing (or enumerating) them. On
the other hand, the second group of algorithms lists
the triangles. Our discussion of exact triangle count-
ing algorithms is brief, however, Schank’s PhD the-
sis
21
is an excellent reference of the methodologies for
exact triangle counting algorithms.
Triangle Counting Without Enumeration
The earliest methods of triangle counting without
enumeration are based on matrix multiplication of
the adjacency matrix. It is easy to see that, if Ais the
adjacency matrix of an undirected network G, the
diagonal elements of A
3
[i,i] contain the total number
of closed walks of length 3 that begin and end at ver-
tex i. Given that a triangle is counted as a closed
walk starting and ending at each of its three vertices
and also for an undirected graph each closed walk
can be counted twice (counterclockwise and clock-
wise); thus, the total number of triangles in a graph
G,t(G) = (1/6)Tr(A
3
). The complexity of this algo-
rithm is Θ(n
3
), however a fast matrix multiplication
algorithm can be used to achieve a better algorithm,
which runs in Θ(n
ω
), where the current best value of
ω, the exponent of matrix multiplication, is around
2.373.
22
However, the hidden constants of many of
the fast matrix multiplication algorithms are large,
which makes these algorithms not much superior
(if not worse) than the traditional Θ(n
3
) based matrix
multiplication algorithm for counting triangles in
real-life large graphs.
Alon et al.
10
has proposed a triangle counting
algorithm (hereby called as AYZ), which runs in
Om
2ω/(ω+1)
time. In this algorithm, authors first
define Δ=m
(ω−1)/(ω+1)
and name a vertex high
degree if its degree is higher than Δ, otherwise it is a
low degree vertex. There are at most mΔpaths for
which the intermediate vertices are low degree, each
of these paths can be checked for triangle in O(mΔ)
time. Then, the remaining triangles are involved with
all high degree vertices. As there are at most (2m/Δ)
high degree vertices, triangles involving those vertices
can be found in O((m/Δ)
ω
) time. Then, overall com-
plexity of this method is O(mΔ+(m/Δ)
ω
)=O(m
2ω/
(ω+1)
). Because AYZ uses matrix multiplication as a
part of the method, it also belongs to non-enumera-
tion-based triangle counting method. Note that, if
ω= 3 (which is the case of traditional matrix multi-
plication), the complexity of AYZ method is O(m
3/2
)
Triangle Counting With Enumeration
Enumeration-based triangle counting algorithms list
all the triangles, then counting becomes a trivial task.
The obvious advantage of an enumeration-based
method is that it returns the list of all the triangles,
which can be used for downstream tasks such as
community discovery
8
or spam filtering.
5
Besides the
above, enumeration-based methods are preferred
over the matrix multiplication-based methods even
for solving the counting task, because matrix
multiplication-based methods suffer from large mem-
ory footprint.
One of the earliest triangle enumeration method
is proposed by Itai and Rodeh.
9
This method was
actually proposed to find just one triangle, but it can
easily be extended to list all the triangles. This algo-
rithm first finds a spanning tree T(V,E
T
) of the graph
G(V,E) and then for each edge (u,v)2E
T
, it checks
whether (pred(u), v)2E(pred stands for predecessor
of a node in the tree); if true, it emits (u,v, pred(u))
as a triangle. It also checks whether (pred(v), u)2E.
If true, it outputs (u,v, pred(v)) as a triangle. The
edges of Tare then removed from Gand the process
is repeated by building a new spanning tree of the
updated graph. The algorithm terminates after no
more edges exist in G. Each iteration takes O(m)
time, and it can be shown that there are at most
Offiffiffiffi
m
p
ðÞiterations, so the complexity of this method
is O(m
3/2
). However, this algorithm needs modifica-
tion of the graph data structure, which is costly,
hence its real-life execution time is not so competitive
with other methods.
A better algorithm is to enumerate over vertex-
pairs that are adjacent to a given vertex v. As shown
in Algorithm 1, this method iterates over the vertices
through the variable v. For each pair of vertices u,
Advanced Review wires.wiley.com/dmkd
6of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
and wfrom adj(v), it checks whether an edge exists
between uand w; if yes, {u,v,w} forms a triangle,
otherwise not. A cumulative sum of total number of
triangles is then returned after dividing the sum by
3. Division is required because each triangle is
counted thrice, once in the iteration in which one of
its vertices is chosen in the outermost loop. Because
the algorithm iterates over the vertices of a network,
it is also known as NodeIterator algorithm. The
amount of work done at each vertex is Θ(d(v)
2
), so
the complexity of the method is Θnd2
max
. For net-
works, for which the degree distribution is highly
skewed, say if the maximum degree value of a net-
work grows linearly with the number of vertices the
time complexity of this algorithm is Θ(n
3
); this is true
even for a star network for which the triangle count
is equal to 0.
Instead of iterating over the vertices, triangles
can also be counted by iterating over the edges (Algo-
rithm 2). While iterating over the edges the algorithm
counts the number of triangles in which each of the
edges contributes. Such an algorithm is known as
EdgeIterator method. The complexity of an EdgeI-
terator algorithm is O(md
max
).
Both node iterator and edge iterator algorithms
iterate over each of the potential two-length paths and
check whether it forms a triangle. To avoid duplicate
counting and reduce counting time, strategies can be
adopted so that each triple is checked exactly once.
23,24
For node iterator algorithm, we can simply sort the
nodes based on their degree and enforce an ordering
on nodes v<u<win Algorithm 1. This ensures that
each triangle is counted only once by it’s smallest
degree vertex in the variable count
v
.In that case, the
division by 3 in Line 11 of Algorithm 1 is not needed.
Also, the sort order improves the running time of the
algorithm by not considering many two-length paths
at all. For example, for a star graph, the terminal nodes
are degree 1 nodes and the star node is the highest
degree node. Using this sort order, none of the triples
centered at the star node need to be tested for triangle
and the method can return a 0 value for the triangle
count of a star graph, without testing any of its triples.
Similar optimization can also be pursued for edge
iterator algorithm; for example, if the adjacency list of
the vertices are sorted, when computing the inter-
section of sets adj
1
and adj
2
(line 5 of Algorithm 2), we
can restrict the intersection operation such that the
third node of a triangle, x2adj(u)\adj(v), satisfies
x>max{u,v}.
When the redundant counting is avoided and
the cost of edge existence test is O(1), the time com-
plexity of triangle enumeration is bounded by the
total number of triples in a graph, which is equal to
X
v2V
dvðÞ
2
. We can also bound the triple count in
terms of edge count (m) with a careful analysis. Say,
we divide the vertices into two groups: high degree,
having degree > ffiffiffiffi
m
p, and low degree, the remaining.
For each low degree vertex, the maximum number of
possible triples that are centered at these vertices can
be obtained by packing all edges with as few low
degree vertices as possible. Because the total number
of edges is m, we can pack them in ffiffiffiffi
m
pvertices each
having degree ffiffiffiffi
m
p. Thus, the number of triples that
are centered at a low degree vertex is at most
Offiffiffiffi
m
pffiffiffiffi
m
p
ðÞ
2
=Om
3=2
. On the other hand, the
number of high degree vertices is at most 2m=ffiffiffiffi
m
p
ðÞ,
as each of these vertices has a degree at least ffiffiffiffi
m
p;
then the number of triples consisting of high degree
vertices is at most Offiffiffiffi
m
p
ðÞ
3
=Om
3=2
. Thus the
total number of triples is O(m
3/2
)+O(m
3/2
)=O(m
3/
2
). Because, the efficient version of both node iterator
and edge iterator test each of the triples in a network
exactly once, the complexity of both of these algo-
rithms are bounded by O(m
3/2
).
There are a set of recent variants of edge iterator,
Forward
21
by Schank T. and new-listing
24
by
Latapy M. Forward orders the vertices by increasing
degree. Latapy’s method sorts the adjacency list and
uses iterators to efficiently compute set intersection.
The complexity of these methods still remains O(m
3/2
),
but real-world execution time may be smaller. Schank
21
has performed a through comparison among various
exact triangle counting methods over a large number of
real-life and synthetic networks.
APPROXIMATE TRIANGLE COUNTING
Time complexity of node iterator and edge iterator
algorithm are O(m
3/2
). For very large graphs with
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 7of19
hundreds of millions of edges, this cost may still be
deemed costly. So, in recent years approximate trian-
gle counting algorithms have become popular. Meth-
ods for approximate triangle counting do not list
(enumerate) the triangles, rather they give an approxi-
mate count of triangles—sometimes, with an approxi-
mate guarantee. Also, their execution time is smaller,
typically by order of magnitudes. For many applica-
tions, trading the large running time for a good
approximation is adequate; for instance, to analyze
the evolution pattern of a network, an approximate
transitivity result is generally acceptable.
In existing literatures, there exist a few variants
of approximate triangle counting methods. Algo-
rithms in first variant are based on uniform triangle
sampling by graph sparsification, the algorithms in
the second variant are based on triple sampling, and
the algorithms in the third variant are based on a sto-
chastic version of node or edge iterator. There also
exists an approximate triangle counting method,
which uses ideas from linear algebra.
Graph Sparsification-Based Methods
The idea of graph sparsification-based method for
triangle counting is to sparsify the graph by prob-
abilistically deleting a subset of edges in the graph
and then extrapolating the triangle count in the
original graph from the exact triangle count in the
sparse graph. Because sparse graph is much smaller
than the original graph, the triangle counting in the
sparse graph can be performed typically in a frac-
tion of time, which makes the approximate method
substantially faster than an exact counting method.
Such a method can also be viewed as uniform trian-
gle sampling-based method, because the triangles in
the sparse networks are sampled with a uniform
probability over all the triangles in the original
network.
Tsourakakis et al.
25
proposed one of the earliest
approximate triangle counting method called DOU-
LION, which works by graph sparsification. Given a
graph G(V,E), DOULION keeps each edge of Gwith
pprobability and remove the edge with 1 −p
probability to generate a sparse graph, G
s
. It then
runs an exact triangle counting method on G
s
to
obtains t(G
s
)—the exact triangle count of G
s
. Each
triangle in the original graph is retained in the sparse
graph when all three of its edges are retained, for
which the probability is p
3
. Hence, the probability of
sampling a triangle from Gin G
s
is p
3
and thus, the
expected count of the triangles in the original graph is
^
tGðÞ=1=p3
tG
s
ðÞ. For large network with millions
of vertices pvalue as small as 0.01 can provide very
good approximate triangle count. A pvalue of 0.01
yields almost 100 times speed-up in the running time
over the running time of an exact counting method.
Pagh and Tsourakakis
26
proposed another
graph sparsification work for approximate triangle
counting which they named as ‘colorful triangle
counting.’For each vertex, this method assigns a
color between 1 and Nuniformly and retains only
those edges for which both the endpoints have the
same color, all other edges are removed. Identical to
DOULION, an exact counting algorithms is used to
find the number of triangles, t(G
s
), in the sparse net-
work G
s
.Ifp=1/N, each triangle in the original net-
work is retained in the sparse network with
probability p
2
. This is so because when two of the
edges of a triangle is monochromatic, the third edge
is also monochromatic by force and the probability
of retaining two edges is p
2
. So the expected count of
the triangles in the original graph for this method is
^
tGðÞ=1=p2
tG
s
ðÞ. Like DOULION, the sparse
graph G
s
contains each edge with probability p, but
unlike DOULION, this method retains (or samples)
each triangle with a probability p
2
, instead of p
3
.As
this method samples more triangles for the same
pvalue, it has a better accuracy than DOULION.
Pagh et al. had used variance analysis, and then
proved probabilistic bounds on the approximation
ratio of triangle estimation using this method. They
also proposed a MapReduce-based distributed imple-
mentation of this algorithm.
Very recently, Etemadi et al.
27
proposed
another method which is an adaptation of DOU-
LION. Similar to DOULION, this method also sam-
ples edges of a given graph Gwith a uniform
probability pto obtain the sparse graph G
s
. How-
ever, besides counting triangles in G
s
, it also checks
whether the missing edge of each open triple in G
s
exists in the original graph G; if yes, that partial tri-
angle is also counted with the count of actual triangle
in G
s
. A triangle in Ghas a p
2
probability to be
counted in t(G
s
), because a triangle in Gwill be
accounted for in G
s
as long as two of its edges are
retained in G
s
. So, the expected count of the triangles
in the original graph using this method is
^
tGðÞ=1=p2
tG
s
ðÞ. By estimating the variance of the
estimation, authors also provided a way to choose
the value of pfor achieving a targeted range of rela-
tive standard error. They have also proved that com-
pared to their method, DOULION always needs
more samples to achieve the same level of accuracy.
Accuracy of this method is comparable to Pagh
et al.’s method, because both methods sample a trian-
gle with the same probability.
Advanced Review wires.wiley.com/dmkd
8of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
Note that all the graph sparsification-based
methods need an exact triangle counting algorithm
which runs on the sparse network. So, their execu-
tion time depends on the performance of the exact
triangle counting algorithm. Clearly, Etemadi et al.’s
method is the costliest among all of these, because
besides counting triangles in the sparse network G
s
,
it also needs to check the graph data structure of
Gfor determining the existence of the missing edges
of an open triple in G
s
.
Triple Sampling-Based Method
The graph sparsification-based method samples trian-
gles, but there is another family of approximate trian-
gle counting algorithms, which samples triples,
instead of triangles. A triple sampling method lends to
a triangle approximation algorithm by computing an
unbiased estimate of transitivity (defined in the Back-
ground). This estimate is equal to the fraction of tri-
ples that are closed out of all the sampled triples. If
Tis a set of sampled triples, and ^γis the estimate of
transitivity
^γ=X
t2T
Itis closed
T
jj ;ð9Þ
here Iis an indicator random variable. Then using
Eq. (5), an approximate estimate of total number of
triangles in the graph is equal to 1=3
ðÞ
^γjΠj, where
jΠj=Xn
i=1
du
i
ðÞ
2
, is the total number of triples in
the given graph (see Eq. (1)). Such a method has been
used in several works.
12,16,28
A key requirement for computing an unbiased
estimate of transitivity is to sample triples from a uni-
form distribution, which is a non-obvious task. The
simplest approach to sample a triple is to select a
node vuniformly and then select two of v’s neigh-
bors (uand w) uniformly. This method samples a tri-
ple hu,v,wiwith vas the center node. However, this
method does not sample a triple uniformly, because
the number of triples centered at a node vis
jΠvj=dvðÞ
2
, which is nonuniform over the vertices
for a general graph. Hence triple hu,v,wiis sampled
with probability 1/(n|Π
v
|) /1/|Π
v
|. So, the triples
that are centered around high degree vertices will be
under-sampled and those that are centered around
low degree vertices will be over sampled.
Schank and Wagner
28
have proposed the ear-
liest algorithm for approximating transitivity of a net-
work by sampling triples uniformly. Their idea is as
follows: first, sample a vertex vin proportional to the
number of triples centered around that vertex. Then
with uniform probability return one of the triples that
are centered around v. If Πis the set of all triples in
Gand Π
v
is the set of triples centered at node vthen
Π=Xn
i=1Πiand jΠvj=dvðÞ
2
. For uniform triple
sampling, we sample the center vertex vwith proba-
bility |Π
v
|/|Π| and then return the triple hu,v,wiby
uniformly selecting one of the triples in Π
v
.Thus, the
probability of sampling the triple hu,v,wiis
uniform,
Ptriple v,u,whiis sampled
ðÞ
=Πu
jj
Πjj
1
duðÞ
2
=1
Πjj
;
ð10Þ
as desired. Schank et al. then used the set of sampled
triples to approximate the transitivity using Eq. (9).
However, one can easily obtain an approximate
count of triangles in a graph G,t(G), by using the
estimated transitivity in the following equation:
^
tG
ðÞ
=1
3^γjΠj;ð11Þ
the value of |Π| is known, as is shown in Eq. (1).
Kolda et al.
16
have reinvented the same method in a
later work. Both Schank et al. and Kolda et al. have
proved approximation error bound by using Hoeffd-
ing bound-based concentration inequality.
In a recent article, Al Hasan
29
has provided meth-
odologies for obtaining an unbiased estimate of transi-
tivity even for the case when the sampling algorithm
samples the triples from a non-uniform distribution. He
has used the idea of IS (discussed in the Background
Section) for this task. For instance, the simplest (but
biased) triple sampling method that we discussed at the
beginning of this subsection samples each triple in
inverse proportional to the number of triples centered
at the first sampled vertex, v. If we want an unbiased
estimate of transitivity, we need to have a uniform tar-
get distribution. But the triples are sampled from a dis-
tribution which is proportional to 1/|Π
v
|, so by using
Eq. (7) of IS an unbiased estimate of transitivity can be
computed as below. Consider we have a triple sample
set T={t
i
=hu
i
,v
i
,w
i
i}
1≤i≤|T|
,wherev
i
is the center
node of the triple t
i
.The importance weight is
wt
i
ðÞ=Πvi
jj
XT
jj
j=1 Πvj
:ð12Þ
Now, the unbiased estimate of transitivity is simply:
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 9of19
^γ=X
ti2T
wt
i
ðÞItiis closed
=Xti2TjΠvijItiis closed
Xtj2TΠvj
:ð13Þ
Approximation by Vertex or Edge Sampling
Another simple approximate triangle counting algo-
rithm can be built by using a probabilistic counter-
part of node iterator or edge iterator method. In the
exact node iterator algorithm, we count the number
of triangles by summing the count of triangles that
are incident to each of the nodes. Instead of summing
the count over all the vertices, we can simply sum
over pfraction of uniformly sampled vertices, and
then approximate the total count by dividing the sum
value by p. This provides an approximate node
iterator-based triangle counting algorithm. Likewise,
we can build an approximate edge iterator-based
algorithm by counting triangles over pfraction of
edges. Rahman and Al Hasan
13
have proposed this
method and they have shown that for large net-
works, this method achieves very good accuracy.
They have also shown that an edge iterator-based
sampling method achieves better accuracy than a
node iterator-based sampling method. Note that,
these methods do not sample the triangles uniformly,
but the estimation is still unbiased because the expec-
tation of triangle count is taken over the edges, which
are sampled uniformly.
Linear Algebra-Based Method
We have shown earlier that if Ais the adjacency
matrix of an undirected network G, the total number
of triangles in a graph G,t(G) = (1/6)Tr(A
3
). If λ
i
is
the eigenvalue of A,λ3
iis the eigenvalue of A
3
, hence,
Tr A3
=Xn
i=1λ3
i. Note that, because Ais symmetric,
all the λ
i
are real numbers. For exact counting, it
requires to obtain all the eigenvalues of the adjacency
matrix A—a costly task.
Fortunately, real-life networks have power-law
property and due to this fact, the eigenvalues of its
adjacency matrix are also skewed, typically following
a power-law property. So if the eigenvalues are
sorted by their absolute value (in other words, by
their contribution to the sum), we can approximate
the triangle count by taking only top-keigenvalues.
Note that using Lanczos method, top-keigenvalues
of a matrix can be easily computed in an incremental
manner. This idea has been used in the EigenTriangle
algorithm,
11
which accepts an adjacency matrix and
a tolerance parameter. The tolerance parameter is
used as a stopping criterion, as such that the ratio of
jλ3
ijand Xn
i=1λ3
ihas to be above the tolerance
parameter. Although elegant, there are two key lim-
itations of this method. First, the time-accuracy
trade-off of this method is much poorer than other
recently proposed approximate triangle counting
methods. Second, no approximate guaranty is availa-
ble regarding the accuracy of the method; in other
words, there is no obvious relation between tolerance
parameter and counting accuracy so that it can be set
to achieve a desired level of approximation.
TRIANGLE COUNTING IN
RESTRICTED ACCESS SCENARIO
As discussed before, for restricted access network
random walk-based approximation methods are
most suitable. Rahman and Al Hasan
12
have pro-
posed a collection of random walk-based methods
for approximating transitivity of a network in a
restricted access scenario. If the total number of tri-
ples (Π) is known, then these methods can be used
for approximating triangle count. Below we discuss
two of the methods, one performing random walk
over the vertices, and the other performing random
walk over the triples.
Random Walk Over Nodes
Earlier we have shown that an approximate triangle
counting algorithm can be obtained by first sampling
a vertex vand then uniformly sampling one of the
triples, hu,v,wi, centered at vertex v. A random
walk variant of this algorithm performs the first sam-
pling task, i.e., sampling v, by a random walk over
the graph. A crucial task, though, to ensure that we
can design the random walk in such a way that the
vertex vis sampled from a distribution, which
enables unbiased triangle counting.
A simple random walk, which chooses the
next vertex uniformly from the neighbors of
currently visiting vertex has a stationary distribution,
πd()/2m, where d() is the degree of a vertex. So,
if we perform a simple random walk and return a tri-
ple t=hu,v,wiuniformly among all the triples inci-
dent to the currently visited vertex, v, the probability
of sampling the triple tis equal to dvðÞ
2m×1
dvðÞ
2
,
which is proportional to 1/(d(v)−1), not uniform.
Because unbiased triangle counting by transitivity
approximation requires uniform triple sampling, we
can use the idea of IS that we have discussed earlier.
Advanced Review wires.wiley.com/dmkd
10 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
If we have a triple sample set T={t
i
=hu
i
,v
i
,w
i
i}
1≤i≤|T|
,
where v
i
is the center node of the triple t
i
,then
^γ=X
ti2T
wt
i
ðÞIti
ðÞ
isclosed
=Xti2Tdv
i
ðÞ−1ðÞIti
ðÞisclosed
Xtj2Tdv
j
−1
;ð14Þ
here, we used wt
i
ðÞ=dv
i
ðÞ−1
X
tj2T
dv
j
−1
, using Eq. (8).
Once an unbiased estimation of transitivity is
obtained, triangle count can be approximated
using Eq. (11).
However, Rahman et al.
12
did not use IS in
their solution, they rather have used MH algorithm.
Their solution, named Vertex-MCMC, works as fol-
lows: Use MH algorithm to design a random walk
whose stationarity distribution πdðÞ
2
. Say, the
random walk of a vertex-MCMC-based triple sam-
pler is visiting a vertex a. To use MH algorithm, we
need to uses a proposal distribution (q) to make a
trial move; vertex-MCMC chooses qto be uniform
over the neighbors of a; in other word, it chooses
one of the vertices (say, b) from the adjacency list of
auniformly. Therefore, the proposal distribution q(a,
b)=1/d(a), and q(a,b) represents the probability of
an adjacent node bto be selected from the node a.
Similarly, q(b,a)=1/d(b). Now, using Eq. (6), the
acceptance probability of the proposal move is as
shown in Eq. (15).
αa,bðÞ= min 1,
dbðÞ
2
1
dðbÞ
daðÞ
2
1
daðÞ
8
>
>
<
>
>
:
9
>
>
=
>
>
;
= min 1,db
ðÞ
−1
daðÞ−1
:
ð15Þ
The above MCMC random walk ensures that each
vertex is sampled from the target distribution, which
is proportional to the number of triples at each ver-
tex. So, while visiting a vertex vusing the above ran-
dom walk, a uniform triple sampler simply returns a
triple hu,v,wi, which is one of the triples centered at
v, selected with uniform probability over all such
triples.
Random Walk Over Triples
Instead of sampling triples in two stages (first sample
a vertex, and then a triple which is centered at that
vertex), we can also sample triple directly. Rahman
and Al Hasan
12
have also proposed such an
approach, which they call triple-MCMC. In this
method, a random walk is designed which walks over
the space of triples in a network. To facilitate this
walk, a neighborhood graph is defined over the set of
triples. Any reasonable neighbor definition works;
Rahman et al. consider two triples as neighbor if they
have two vertices in common. For example, the triples
h1, 2, 3iand h2, 3, 4iare neighbors because they have
two common vertices, {2, 3}. Starting from an arbi-
trary random triple, the random walk continues over
the triples by moving from one triple to a neighbor tri-
ple based on the above neighborhood definition. Set
of possible neighbors of the currently visiting triple
can be computed on the flybyfinding other triples
that can be obtained by replacing exactly one of the
vertices of the current triple. Thus, the walk resembles
sampling of dependent triples, where a sampled triple
shares two vertices with the previously sampled triple.
For the purpose of triangle counting, the triples
need to be sampled uniformly. But a simple random
walk does not guarantee this because the triples have
different degree in the neighborhood graph on which
the random walk proceeds. To ensure uniform sam-
pling, Rahman et al. proposed to adopt MH algo-
rithm. Let’s assume that the random walk is visiting
a triple t. For MH’s proposal distribution (say q),
they choose one of the triples from t’s neighborhood
(say, s) uniformly. So, q(t,s) = 1/(|Γ(t)|). Here, Γ(t)is
the set of neighbors of the triple t. Using Eq. (6), the
acceptance probability of the proposal move is
obtained as shown below:
αt,sðÞ= min 11
ΓsðÞ
jj
11
ΓtðÞ
jj
,1
()
= min Γt
ðÞ
jj
ΓsðÞ
jj
,1
:ð16Þ
Once a desired number of triples has been sampled,
the fraction of closed triples over all the sampled tri-
ples provides an unbiased estimation of transitivity,
from which an approximate count of triples can be
returned by using Eq. (11). Note that for this method
also, instead of using MH, one can perform simple
random walk and then use IS for obtaining an unbi-
ased statistics of transitivity. Note that, a random
walk-based triple sampling method can approximate
transitivity, but not triangle count unless the total
number of triples, |Π|, is available.
TRIANGLE COUNTING ON
STREAM DATA
For many datasets, graphs are too large to fit in main
memory, but it is easier to access a graph as
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 11 of 19
streaming edges, such that the edges appear on the
stream in an arbitrary order sequence. Even if a
graph fits in the main memory, a streaming edge
access model of graph is preferred for some computa-
tion model, such as, MapReduce. The main restric-
tion in a streaming access model is that we cannot
save all the edges of a graph in memory, so statistics
of each edge must be processed instantaneously as
the edge appears on the stream. For such restricted
access model, it is allowed to go over the edge stream
of a graph multiple times (aka, multi-pass streaming
algorithm). Going over the input graph stream multi-
ple times may appear inefficient, but if the graph does
not fit in main memory, going over multiple passes is
much cheaper than trying to access a large number
of random vertex (or the adjacency list of a random
vertex) in the disk.
The earliest streaming triangle counting method
is proposed by Bar-Yossef et al.
7
Their method is
based on stream-reduction, a general idea for compu-
tation over data stream, which is also proposed in
the same paper. Stream-reduction idea can be used
for approximating frequency moment over data
stream, which they used for approximating triangles.
Unfortunately, their method is mostly for theoretical
interest and is not practical for approximating trian-
gles in real-world networks.
Buriol et al.
18
have proposed several methods
for triangle counting over edge stream. Their first
method is a three-pass algorithm. In the first pass,
the number of edges (m) and the number of vertices
(n) are counted simply by using two counters. In the
second pass, an edge e=(a,b) is sampled uniformly
from the set of edges. Also, a vertex v2{V\{a,b}} is
chosen uniformly. This leaves us with a triple (not
necessarily connected) ha,b,vi. Then in the third
pass, the method simply tests whether (a,v)2E^(
b,v)2E, if yes, then β= 1, otherwise β= 0. In this
way, βis an estimate of the triangles over all possible
edge-plus-a-vertex combination. If T
1
is the number
of disconnected triples, T
2
is the number of con-
nected open triple, and T
3
is the number of triangles
in the graph, then T
1
+2T
2
+3T
3
is equal to m(
n−2), the population size. Besides, we also have the
expectation of β,Eβ½=3T3
T1+2T2+3T3
ðÞ
=3T3
mn−2ðÞðÞ
. So, an
approximate unbiased triangle estimate is equal to
1=3ðÞmn−2ðÞEβ½. This estimate can be improved by
running scopies of this sampling and averaging their
corresponding estimates of {β
i
}
1≤i≤s
. Thus, the final
estimate of triangle count is mn−2ðÞ=3sXs
i=1βi.
Because each sample only takes O(1) space, for
ssamples the total space is bounded by O(s), which
is linear with the number of samples but independent
with the size of the network. Chernoff’s inequality
can be used to prove probabilistic bound on the
approximation result.
The above three-pass algorithm can be con-
verted to a two-pass algorithm by combining the first
two passes in a single pass. That is, counting n,
mand uniform sampling of edge (a,b) and vertex
vcan actually be done in the same pass using reser-
voir sampling.
30
The key idea of reservoir sampling
for sampling an object (uniformly) from a stream of
objects is to keep a running count of the number of
objects as new objects are seen in the stream. The
first object in the stream is always saved in the reser-
voir, but the i’th object replaces the object in the res-
ervoir with 1/iprobability only. When the stream
ends, the object in the reservoir is the sampled object,
chosen uniformly from the stream without prior
knowledge of the number of objects in the stream.
Buriol et al.
18
have actually proposed a one-
pass version of their three-pass algorithm, which
combines the works of all three passes in one-pass, as
below. Say, the edge eappears in the stream, while
(a,b) (uniformly sampled edge) and v(uniformly
sampling vertex) are in the reservoir. If e=(a,v), set
the boolean variable x= 1 and if e=(b,v), set the
boolean variable y= 1. Once the stream ends, if x=
y= 1, set β= 1, otherwise β=0.β= 1 represent the
fact that we have sampled a triangle (a,b,v) where
(a,b) is the uniformly sampled edge and vis the uni-
formly sampled vertex, both in the reservoir. How-
ever, Eβ½=T3=T1+2T2+3T3
ðÞ, which is one-third
(note the missing 3 in the numerator) of the Eβ
½of
the three-pass method. This is due to the fact that for
the one-pass version, the triangle ha,b,viis counted
(i.e., β= 1) only if the edge (a,b) appears on the
stream before the edges (b,v) and (a,v) (probability
of this event to happen is 1/3); on the other hand, the
three-pass version counts the triangle for any order-
ing of the 3 edges. Besides this, three-pass and one-
pass method are identical. So, similar to the three-
pass method, the estimate of triangles is equal to
mn−2ðÞ=sXs
i=1βi,ifsparallel copies of sampling are
run together.
Jha et al.
15
have provided another one-pass
streaming triangle counting algorithm which is a
stream variant of triple sampling. Say, for a graph G,
e
1
,e
2
,…,e
m
is a sequence of distinct edges. Then
{G
t
}
1≤t≤m
are the graphs in time t, formed by the edge
set {e
i
|i≤t}; clearly G
m
=G. With the arrival of an
edge on the stream, the method performs an update
on the estimate of the triangles in graph G
t
.Once the
stream ends, the estimated value is equal to the count
of triangles in the entire graph, G
m
=G. The main
Advanced Review wires.wiley.com/dmkd
12 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
idea of this method is to use two reservoir arrays R
e
and R
w
of size s
e
and s
w
, respectively. For the stream-
ing graph, G
t
,R
e
stores a uniform subset of s
e
edges
from the graph G
t
and R
w
stores a uniform sample
of s
w
triples from the graph G
t
.With the arrival of
the edge e
t
, the method first counts the number of tri-
ples in R
w
that the edge e
g
completes. Then it updates
both R
e
and R
w
to maintain uniform sampling. R
e
is
updated by inserting e
t
in R
e
probabilistically
(by following reservoir sampling’s idea); if this inser-
tion is successful, then R
w
is updated by inserting
(again probabilistically by reservoir sampling) the
new triples that are formed by the edge e
t
with the
already existing edges in R
e
.Also, for any state of R
e
and R
w
, the statistics of total triples (tot_triples)
formed by the edges in R
e
is computed. It is easy to
see that if the triples in R
w
are sampled uniformly
over all the triples, the fraction of triples that
are closed in R
w
(denoted as ρ) approximates the
ratio t(G)/|Π|. But, to obtain triangle count from this
ratio, we also need to know the total number of tri-
ples in G(|Π|), which they estimates by using Birth-
day Paradox. The expected number of triangles is
then [ρt
2
/s
e
(s
e
−1)] ×tot_triples.
DISTRIBUTED AND PARALLEL
TRIANGLE COUNTING
We can use streaming method to approximate trian-
gle count of huge graphs that do not fit in the main
memory of conventional machines. However, if we
strive for exact triangle counting on such large
graphs, distributed computing provides a viable
option. In this section, we will give an overview of
parallel and distributed triangle counting methods.
The parallel methods use multi-core machines and
the distributed methods run on MapReduce
platform.
31
One of the earliest works on distributed trian-
gle counting using MapReduce framework is pro-
posed by Suri and Vassilvitskii.
17
They proposed a
MapReduce variant of an efficient node iterator algo-
rithm, which is shown in Algorithm 3. This algo-
rithm has two rounds: first round generates all
length-two paths in the graph from the edge list, in
parallel. Second round counts how many of the
length-two paths generated in the first round have a
closing edge in the graph. To accomplish this, the
second round takes the output of the first round
(denoted as Type 1 input in Algorithm 3) along with
the original edge list (denoted as Type 2 input in
Algorithm 3) as inputs. Suri and Vassilvitskii
17
also
proposed a graph partition-based MapReduce
algorithm, which first partitions the graph and then
runs an exact triangle counting method on each par-
tition, in parallel. Later, Park and Chung
32
identify
redundant computation in Suri et al.’s method and
proposed another partitioning method called Trian-
gle Type Partitioning. Pagh and Tsourakakis
26
pro-
posed a MapReduce version of their edge sampling-
based method, however, this method provides an
approximate count only as it is based on sampling.
Arifuzzaman et al.
33
proposed a distributed
memory-based parallel algorithm for triangle count-
ing using message passing interface. This algorithm
partitions the graph based on disjoint subsets of
nodes (core nodes), and generates induced subgraph
from the subset of nodes and their neighborhood.
Each induced subgraph is assigned to a machine and
triangles are counted independently in each machine
for corresponding core nodes. Last, it combines the
results from all the machines to get global triangle
count.
Kim et al.
34
proposed a disk-based framework
for triangle counting using multi-core CPU. They cat-
egorize the triangles into two types; internal triangles,
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 13 of 19
for which the adjacency lists of two connected nodes
are in the main memory and external triangles, for
which only one adjacency list is in the main memory.
This framework stores adjacency list as slotted page
structure in disk and use asynchronous read to load
required page into buffer memory. The buffer mem-
ory is split such that it contains pages (adjacency lists)
corresponding to both types of triangles (internal and
external) at the same time. The triangles are counted
when the adjacency lists are in the buffer, and to
avoid redundancy each page is loaded in the buffer
exactly once. In this framework, both type of triangles
are counted by two separate threads and for maxi-
mum utilization of CPU, it also uses thread morphing
if one of the threads completes its work and termi-
nates. The framework also uses openMP to use addi-
tional available threads to count internal triangle
counting and later use thread morphing, if required.
Shun and Tangwongsan
35
proposed a multi-
core parallel algorithm for shared memory machines.
The proposed algorithm has two steps: in the first
step, each node is ranked basked on degree and
ranked adjacency list of each node is generated,
which contains only higher ranked nodes than the
current node; the second step counts triangles from
the ranked adjacency list for each node. For the first
step, after ranking each node, the generation of
ranked adjacency list is easily parallelizable as this
task is independent for each node. For the second
step, an array is created to put values of each locally
counted triangles, here the size of the array is the
total size of ranked adjacency list for all node. Lastly,
the actual triangle count is the summation of the
values in the array. Rahman and Al Hasan
13
also
proposed a multi-core parallel algorithm for triangle
counting, which distributes the loop of node/edge
iterator algorithms across multiple cores.
EXPERIMENTAL COMPARISONS
Schank
21
has performed a thorough experimental
comparison among different exact triangle counting
methods. In this survey, we make a thorough com-
parison among various approximate triangle count-
ing methods. For the comparison, we consider two
sparsification-based methods: ‘DOULION’and ‘Col-
orful triangle counting.’For both the methods, the
triangles in the sparse network is counted exactly by
using efficient edge iterator algorithm. So, we call
these methods doulion_Edgeiter, and color_Edgeiter,
respectively. We also consider two triple sampling-
based methods: ‘Direct Sampling’(Eq. (10)) and uni-
form sampling with importance weight adjustment,
hereby named as Uniform_Importance (Eq. (13)).
The sampling version of edge iterator
13
hereby
named as Sampled_edgeIter is also considered. We
further consider two of the restricted access-based
methods: random walk over nodes with importance
weight adjustment hereby named as randWalk_im-
portance (Eq. (14)), and ‘vertexMCMC’
12
,
i.e., MCMC walk over nodes. We implement all the
above methods ourself by using identical graph data
structures and edge existence query module. This
ensures fairness among the comparison. We inten-
tionally omitted some of the approximate triangle
counting methods in this experiment after we found
that their performance is substantially poorer than
the performance of the methods we report here.
For the experiment, we use four large graphs
collected from the KONECT [the Koblenz Network
Collection (http://konect.uni-koblenz.de/networks/)].
The first, ‘as-skitter’is a network of autonomous
systems on the Internet, where autonomous systems
are nodes and connection between them are edges.
The ‘flickr,’‘livejournal’and ‘orkut’are social net-
works, where each node is a user and an edge between
users shows friendship between the users. The basic
statistics of the datasets is shown in Table 2, where
|V|, |E|, and t(G) are the number of vertices, the num-
ber of edges and the number of triangles in the graph,
and time(ms) is the time taken in millisecond by edge
iterator with hashing method, which is one of the fast-
est exact triangle counting method.
Comparing approximate triangle counting
methods is tricky, as many of these methods are
sampling-based methods and hence, they have error-
runtime trade-off, i.e., if we take more samples, the
error decreases but runtime increases, and vice-versa.
Besides, the population from which these methods
sample are different; some sample triples, some sam-
ple triangles, and some sample edges. So it is not easy
to simply compare the error of these methods using a
unified sampling factor. So, we report both error and
runtime of a method as a point on a graph. For each
method, we take three points by running them for
three different sampling factor values. For a given
method, we connect the three points by a piece-wise
linear curve. To obtain the data of this graph, the
TABLE 2 |Basic Statistics of the Datasets
Dataset |
V
||
E
|
t
(
G
) Time (ms)
as-skitter 1.69M 11.09M 28.77M 38, 989.82
flickr 1.72M 15.55M 548.17M 174, 216.12
liveJournal 5.20M 48.71M 310.87M 231, 541.28
orkut 3.07M 117, 19M 627.58M 867, 634.33
Advanced Review wires.wiley.com/dmkd
14 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
error is computed as percentage error, which is equal
to jexact-count−approx-countj×100
exact-count and runtime is reported
in millisecond.
For the sparsification-based methods and edge
sampling-based methods, we use the sampling probabil-
ity p2{0.001, 0.01, 0.1}. For both, full access- and
restricted access-based triple sampling methods, number
of samples |T| are selected as {0.001%,0.01%,0.1%}of
|Π|, where |Π| is the total number of triples. These values
of pand |T| help us to understand how time and errors
are related for the approximation methods. For all the
methods, we use a serial version of the method, although
they can easily be parallelized to run faster. We run all
our experiments on a machine with AMD 2.3 GHz proc-
essor, 128 GB RAM, and Red Hat Enterprise Server
Release 7.3 OS. For all data points, the value is computed
after running the corresponding method for 10 times and
then taking the average value of those runs.
In Figure 2, we show four charts, each for one of
the datasets. Each chart has seven piece-wise linear
curves, each representing one of the approximate tri-
angle counting methods. Each curve has three points,
representing log(error) versus log(runtime) of a
method for three sampling rates (pvalues). All the
curves have a negative slope showing the inverse rela-
tionship between error and runtime, i.e., in the lowest
sampling rate they have the smallest runtime but the
largest error. The data point that is closest to the ori-
gin is the best as it has both small error and small
runtime.
If we observe the graphs carefully, we can con-
clude that both the sparsification-based methods take
more time to achieve as high accuracy as other
approximation methods. For lower value of p, DOU-
LION has very high error and colorful sampling
method performs consistently better than DOULION
in that case, but takes more time. Among other
approximation methods, almost all methods take simi-
lar amount of time but have different error values. The
best approximation method is ‘Direct Sampling,’
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Results for as-skitter Results for flickr
Results for liveJournal Results for orkut
Color_edgelter
RandWalk_importance
4.64.54.4
Log10 (ms)
Log10 (%error)
4.34.2
4.95 5.00 5.05 5.10 5.15 5.20 5.25 5.30 5.35 5.35 5.40 5.45 5.50 5.55 5.60 5.65 5.70 5.75
4
(a) (b)
(c) (d)
3
2
1
0
–1
–2
Log10 (ms)
Log10 (ms) Log10 (ms)
Log10 (%error)
4.4 4.5 4.6 4.7 4.8
3
2
1
0
–1
–2
Log10 (%error)
3
2
1
0
–1
–2
Log10 (%error)
3
4
2
1
0
–1
–2
–3
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
FIGURE 2 |Comparison of approximation methods.
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 15 of 19
which always provides the lowest error. Sampled_-
edgeIter method also performs very good consistently
and on some occasions it is faster than ‘Direct Sam-
pling’method, but with a higher error. The indirect
sampling-based methods (MCMC, and -importance
weight-based sampling) are fast, but they generally
have a higher error than the direct sampling-based
method.
OTHER RELATED COUNTING TASKS
Although triangle counting task has received enor-
mous attention, there are other counting tasks that
count higher order graphical structures beyond trian-
gles. Obvious extension of a triangle is a k-clique—a
complete graph with kvertices, and a related count-
ing task is to count the distinct k-cliques in a given
graph for a user chosen value of k. However, count-
ing k-clique is much more difficult than counting tri-
angles, as the number of k-cliques increases
exponentially with the value of k. An efficient
sequential solution for k-clique counting algorithm
can be obtained by modifying the well-known Bron–
Kerbosch algorithm.
36
But, this algorithm does not
scale to large real-life networks. To solve the lack of
scalability issue, Finocchi et al.
37
proposed a distribu-
ted solution for k-clique counting algorithm which
runs on MapReduce.
Graphlet counting is another related task which
has become very popular in recent years because of
its wide applicability in different domains for various
tasks including network classification,
38
biological
network comparisons,
39,40
image classification,
41
and
building graph kernels for chemoinformatics.
42
For a
given k, all possible k-size graphical topologies are
collectively referred as graphlets. For undirected
graphs, there are 2 size-3 graphlets (open triples and
closed triples), 6 size-4 graphlets, and 21 size-5
graphlets. The number of distinct graphlets increases
exponentially with the size of the graphlet. The
counting task is to obtain the count of the total num-
ber of distinct-induced occurrences of all graphlets
(of a given size) in a given network.
In existing literature, several works exist for
solving the graphlet counting problem exactly, exam-
ples include FANMOD,
43
RAGE,
44
GRAFT,
38
and
ESCAPE.
45
The earliest among the above, FANMOD
and RAGE, are very slow, mainly because they use
an enumeration-based approach. GRAFT is relatively
better than the above two as it only enumerates tree
graphlets and then counts the other graphlets by effi-
cient edge existence check. Ho
cevar and Demšar
46
provided an efficient method, named ORCA, which
does not enumerate all graphlets, but counts a subset
of graphets and calculates other graphlet counts
using a combinatorial approach. However, ORCA is
not highly scalable when it needs to handle huge
real-world graph with millions of nodes/edges.
Recently, Ahmed et al.
47
provided a highly efficient
and scalable method, namely PGD, for graphlet
counting by utilizing graphlet transition. Graphlet
transition relates two graphlets by using add/removal
of an edge, which helps to calculate count of one
graphlet using the count of other smaller graphlet.
PGD is scalable, but works for upto four-sized
graphlets. Pinar et al.
45
proposed a method
(ESCAPE), which can provide count of five size
graphlet very efficiently. Similar to PGD, ESCAPE
also calculates counts of four- and five-sized graph-
lets using counts of specific set of other (mostly smal-
ler) graphlets.
Similar to the case of triangle counting, approxi-
mate counting method has also been popular for
graphlet counting. Rand-ESU (available in FANMOD
library) is one of the earliest approximate graphlet
counting method, but its accuracy is poor. In recent
years, Bhuiyan et al.
48
has proposed a method called
GUISE, which uses MCMC sampling for obtaining
uniform samples of graphlets through random walk.
GUISE samples upto size-5 graphlets, but Saha and Al
Hasan
49
have generalized the method so that it can
sample graphlets of any size. Wang et al.
50
proposed
an improved and more efficient method based on ran-
dom walk. Rahman et al.
51
proposed edge sampling-
based approximation method (GRAFT), which aligns
sampled edge with a specific edge of a graphlet and
then enumerate all embeddings of the graphlet. Jha
et al.
52
propose three-path sampling-based method
for four size approximate graphlet counting, which
has been proved to be more efficient than GUISE and
GRAFT. Recently, Bressan et al.
53
proposed color
coding-based approach and show its superiority over
MCMC-based method.
CONCLUSIONS
Triangles play a very important role in network anal-
ysis. In social networks, triangles represent transitiv-
ity, which is important for understanding network
evolution over the time. In biological networks, sev-
eral motifs have been found to be triangle represent-
ing various biological pathways. Due to the
importance of triangles, enumeration and counting
them in large networks are important tasks. Both
enumeration and counting of triangles have been
studied for a long time, but in recent years, there has
Advanced Review wires.wiley.com/dmkd
16 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
been a renewed interest in triangle counting methods
considering approximate counting, parallel and dis-
tributed implementation, and restricted and stream-
ing data access scenarios.
In this survey, we discuss the existing methods
of triangle counting, ranging from sequential to par-
allel, single-machine to distributed, exact to approxi-
mate, and off-line to streaming. We place more
emphasis on the recent methods, specifically on the
methods for approximate triangle counting by sam-
pling. We also show some experimental comparison
of the approximate triangle counting methods by
implementing them with a uniform data structures.
Our results show that triple sampling-based methods
are superior over other approximate triangle count-
ing methods, both in terms of accuracy and runtime.
Future works in this direction will consider counting
higher order graphical structures having more than
three vertices. Some works on counting higher order
structures have already emerged, but given that it is a
very active area of research, we expect that many
more studies will come in near future.
ACKNOWLEDGMENT
This research is supported partly by NSF-1149851
grant and a Research Award from CareerBuilder.
REFERENCES
1. Watts DJ, Strogatz S. Collective dynamics of ‘small-
world’networks. Nature 1998, 393:440–442.
2. Luce RD, Perry AD. A method of matrix analysis of
group structure. Psychometrika 2001, 14:95–116.
3. McPherson M, Smith-Lovin L, Cook JM. Birds of a
feather: homophily in social networks. Annu Rev Soc
2001, 27:415–444.
4. Aggarwal C, Subbian K. Evolutionary network analy-
sis: a survey. ACM Comput Surv 2014,
47:10:1–10:36. https://doi.org/10.1145/2601412.
5. Becchetti L, Boldi P, Castillo C, Gionis A. Efficient
semi-streaming algorithms for local triangle counting
in massive graphs. In: Proc. of 4th ACM SIGKDD,
2008, 6–24.
6. Eckmann JP, Moses E. Curvature of co-links uncovers
hidden thematic layers in the world wide web. Proc
Natl Acad Sci U S A 2002, 99:5825–5829.
7. Bar-Yossef Z, Kumar R, Sivakumar D. Reductions in
streaming algorithms, with an application to counting
triangles in graphs. In: Proceedings of the Thirteenth
Annual ACM-SIAM Symposium on Discrete Algo-
rithms, SODA ’02, Philadelphia, PA, USA, 2002,
623–632. Society for Industrial and Applied Mathe-
matics. ISBN: 0-89871-513-X. Available at: http://dl.
acm.org/citation.cfm?id=545381.545464.
8. Palla G, Dereny I, Farkas I, Vicsek T. Uncovering the
overlapping community structure of complex networks
in nature and society. Nature 2005, 435:814–818.
9. Itai A, Rodeh M. Finding a minimum circuit in a
graph. In: Proceedings of the Ninth Annual ACM Sym-
posium on Theory of Computing, STOC ’77,
1977, 1–10.
10. Alon N, Yuster R, Zwick U. Finding and counting
given length cycles. Algorithmica 1997, 17:209–223.
11. Charalampos E, Tsourakakis E. Fast counting of trian-
gles in large real networks without counting:
algorithms and laws. In: 2008 I.E. 8th International
Conference on Data Mining, 2008, 608–617.
12. Rahman M, Al Hasan M. Sampling triples from
restricted networks using MCMC strategy. In: Pro-
ceedings of the 23rd ACM International Conference
on Conference on Information and Knowledge Man-
agement, CIKM 2014, Shanghai, China, 3–7
November, 2014, 1519–1528. 10.1145/2661829.
2662075.
13. Rahman M, Al Hasan M. Approximate triangle count-
ing algorithms on multi-cores. In: Proceedings of the
2013 I.E. International Conference on Big Data, Santa
Clara, CA, USA, 6–9 October, 2013, 127–133.
10.1109/BigData.2013.6691744
14. Tsourakakis CE, Kang U, Miller GL, Faloutsos C.
Doulion: counting triangles in massive graphs with a
coin. In: Proceedings of the Fifteen ACM SIGKDD
International Conference on Knowledge Discovery in
Data Mining, 2009.
15. Jha M, Seshadhri C, Pinar A. A space efficient stream-
ing algorithm for triangle counting using the birthday
paradox. In: Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, KDD ’13, New York, NY, USA,
ACM, 2013, 589–597. ISBN: 978-1-4503-2174-7.
10.1145/2487575.2487678
16. Kolda TG, Pinar A, Seshadhri C. Triadic measures on
graphs: the power of wedge sampling. In: SIAM Data
Mining, SIAM, 2013, 10–18.
17. Suri S, Vassilvitskii S. Counting triangles and the curse
of the last reducer. In: Proceedings of the 20th Interna-
tional Conference on World Wide Web, WWW ’11,
2011, 607–614.
18. Buriol LS, Frahling G, Leonardi S, Marchetti-
Spaccamela A, and Sohler C. Counting triangles in
data streams. In: Proceedings of the Twenty-fifth ACM
SIGMOD-SIGACT-SIGART Symposium on
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 17 of 19
Principles of Database Systems, PODS ’06, New York,
NY, USA. ACM, 2006, 253–262. ISBN: 1-59593-318-
2. 10.1145/1142351.1142388
19. Barabasi A-L, Albert R. Emergence of scaling in ran-
dom networks. Science 1999, 286:509–512.
20. Newman MEJ, Watts DJ, Strogatz SH. Random graph
models of social networks. Proc Natl Acad Sci U S A
2002, 99(suppl 1):2566–2572.
21. Schank T. Algorithmic aspects of triangle-based net-
work analysis. PhD Thesis, Department of Computer
Science, University of Karlsruhe, 2007.
22. Le Gall F. Powers of tensors and fast matrix multiplica-
tion. In: Proceedings of the 39th International Sympo-
sium on Symbolic and Algebraic Computation, 2014.
23. Schank T, Wagner D. Finding, counting and listing all
triangles in large graphs, an experimental study. In:
Proceedings of the 4th International Conference on
Experimental and Efficient Algorithms, WEA ’05,
2005, 606–609.
24. Latapy M. Main-memory triangle computations for
very large (sparse (power-law)) graphs. Theor Comput
Sci 2008, 407:458–473.
25. Tsourakakis CE, Kang U, Miller GL, Faloutsos C.
Doulion: counting triangles in massive graphs with a
coin. In: Proceedings of the 15th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and
Data Mining, KDD ’09, 2009, 837–846.
26. Pagh R, Tsourakakis CE. Colorful triangle counting
and a mapreduce implementation. Inf Process Lett
2012, 112:277–281.
27. Etemadi R, Lu J, Tsin YH. Efficient estimation of tri-
angles in very large graphs. In: Proceedings of the 25th
ACM International on Conference on Information and
Knowledge Management, CIKM ’16, 2016,
1251–1260.
28. Schank T, Wagner D. Approximating clustering-
coefficient and transitivity. J Graph Algorithms Appl
2005, 9:265–275.
29. Al Hasan M. Chapter 5: Methods and applications of
network sampling. In: Gupta A, Capponi A, eds. Opti-
mization Challenges in Complex, Networked and
Risky Systems. Catonsville, MD: INFORMS; 2016,
115–139. https://doi.org/10.1287/educ.2016.0147.
30. Vitter JS. Random sampling with a reservoir. ACM
Trans Math Softw 1985, 11:37–57.
31. Dean J, Ghemawat S. Mapreduce: simplified data pro-
cessing on large clusters. In: Proc. of the 6th confer-
ence on Operating Systems Design and
Implementation –Volume 6, 2004, 137–149.
32. Park H-M, Chung C-W. An efficient mapreduce algo-
rithm for counting triangles in a very large graph. In:
Proceedings of the 22Nd ACM International Confer-
ence on Information & Knowledge Management,
CIKM ’13, 2013, 539–548.
33. Arifuzzaman S, Khan M, Marathe M. Patric: a parallel
algorithm for counting triangles in massive networks.
In: Proceedings of the 22Nd ACM International Con-
ference on Information & Knowledge Management,
CIKM ’13, 2013, 529–538.
34. Kim J, Han W-S, Lee S, Park K, Yu H. Opt: a new
framework for overlapped and parallel triangulation in
large-scale graphs. In: Proceedings of the 2014 ACM
SIGMOD International Conference on Management
of Data, SIGMOD ’14, 2014, 637–648.
35. Shun J, Tangwongsan K. Multicore triangle computa-
tions without tuning. In: 2015 I.E. 31st International
Conference on Data Engineering, April 2015,
149–160.
36. Bron C, Kerbosch J. Algorithm 457: finding all cliques
of an undirected graph. Commun ACM 1973,
16:575–577. https://doi.org/10.1145/362342.362367.
37. Finocchi I, Finocchi M, Fusco EG. Clique counting in
mapreduce: algorithms and experiments. J Exp Algo-
rithmics 2015, 20:1.7:1–1.7:20. https://doi.org/10.
1145/2794080.
38. Rahman M, Bhuiyan MA, Al Hasan M. GRAFT: an
efficient graphlet counting method for large graph
analysis. IEEE Trans Knowl Data Eng 2014,
26:2466–2478. https://doi.org/10.1109/TKDE.2013.
2297929.
39. Hayes W, Sun K, Pržulj N. Graphlet-based measures
are suitable for biological network comparison. Bioin-
formatics 2013, 29:483.
40. Pržulj N. Biological network comparison using graph-
let degree distribution. Bioinformatics 2007, 23:e177.
41. Zhang L, Hong R, Gao Y, Ji R, Dai Q, Li X. Image
categorization by learning a propagated graphlet path.
IEEE Trans Neural Netw Learn Syst 2016,
27:674–685.
42. Kashima H, Saigo H, Hattori M, Tsuda K. Graph ker-
nels for chemoinformatics. In: Chemoinformatics and
Advanced Machine Learning Perspectives: Complex
Computational Methods and Collaborative Techni-
ques. Hershey, PA: IGI Global; 2010, 1.
43. Wernicke S, Rasche F. Fanmod: a tool for fast network
motif detection. Bioinformatics 2006, 22:1152–1153.
44. Marcus D, Shavitt Y. Rage –a rapid graphlet enumer-
ator for large networks. Comput Netw 2012,
56:810–819.
45. Pinar, A, Seshadhri C, Vishal V. Escape: efficiently
counting all 5-vertex subgraphs. In: Proceedings of the
26th International Conference on World Wide Web,
WWW ’17, 2017, 1431–1440.
46. Ho
cevar T, Demšar J. A combinatorial approach to
graphlet counting. Bioinformatics 2014, 30:559–565.
47. Ahmed NK., Neville J, Rossi RA, Duffield N. Efficient
graphlet counting for large networks. In: Proceedings
of the 2015 I.E. International Conference on Data
Advanced Review wires.wiley.com/dmkd
18 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
Mining (ICDM), ICDM ’15, 2015, 1–10. ISBN: 978-1-
4673-9504-5.
48. Bhuiyan MA, Rahman M, Rahman M, Al Hasan M.
GUISE: uniform sampling of graphlets for large graph
analysis. In: 2012 I.E. 12th International Conference
on Data Mining, December, 2012, 91–100. 10.1109/
ICDM.2012.87.
49. Saha TK, Al Hasan M. Fs
3
: a sampling based method
for top-k frequent subgraph mining. Stat Anal Data
Min 2015, 8:245–261. https://doi.org/10.1002/sam.
11277.
50. P Wang, J C S Lui, B Ribeiro, D Towsley, J Zhao, and
X Guan. Efficiently estimating motif statistics of large
networks. ACM Trans Knowl Discov Data,
9:8:1–8:27 2014. ISSN: 1556-4681.
51. Rahman M, Bhuiyan M, Al Hasan M. Graft: An
approximate graphlet counting algorithm for large
graph analysis. In: Proceedings of the 21st ACM Inter-
national Conference on Information and Knowledge
Management, CIKM ’12, 2012, 1467–1471.
52. Jha M, Seshadhri C, Pinar A. Path sampling: a fast and
provable method for estimating 4-vertex subgraph
counts. In: Proceedings of the 24th International Con-
ference on World Wide Web, WWW ’15, 2015,
495–505.
53. Bressan M, Chierichetti F, Kumar R, Leucci S,
Panconesi A. Counting graphlets: space vs time. In:
Proceedings of the Tenth ACM International Confer-
ence on Web Search and Data Mining, WSDM ’17,
2017, 557–566.
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 19 of 19