ArticlePDF Available

Abstract and Figures

Counting and enumeration of local topological structures, such as triangles, is an important task for analyzing large real‐life networks. For instance, triangle count in a network is used to compute transitivity—an important property for understanding graph evolution over time. Triangles are also used for various other tasks completed for real‐life networks, including community discovery, link prediction, and spam filtering. The task of triangle counting, though simple, has gained wide attention in recent years from the data mining community. This is due to the fact that most of the existing algorithms for counting triangles do not scale well to very large networks with millions (or even billions) of vertices. To circumvent this limitation, researchers proposed triangle counting methods that approximate the count or run on distributed clusters. In this paper, we discuss the existing methods of triangle counting, ranging from sequential to parallel, single‐machine to distributed, exact to approximate, and off‐line to streaming. We also present experimental results of performance comparison among a set of approximate triangle counting methods built under a unified implementation framework. Finally, we conclude with a discussion of future works in this direction. WIREs Data Mining Knowl Discov 2018, 8:e1226. doi: 10.1002/widm.1226 This article is categorized under: • Algorithmic Development > Structure Discovery
Content may be subject to copyright.
Advanced Review
Triangle counting in large
networks: a review
Mohammad Al Hasan*and Vachik S. Dave
Counting and enumeration of local topological structures, such as triangles, is an
important task for analyzing large real-life networks. For instance, triangle count
in a network is used to compute transitivityan important property for under-
standing graph evolution over time. Triangles are also used for various other
tasks completed for real-life networks, including community discovery, link pre-
diction, and spam ltering. The task of triangle counting, though simple, has
gained wide attention in recent years from the data mining community. This is
due to the fact that most of the existing algorithms for counting triangles do not
scale well to very large networks with millions (or even billions) of vertices. To
circumvent this limitation, researchers proposed triangle counting methods that
approximate the count or run on distributed clusters. In this paper, we discuss
the existing methods of triangle counting, ranging from sequential to parallel,
single-machine to distributed, exact to approximate, and off-line to streaming.
We also present experimental results of performance comparison among a set of
approximate triangle counting methods built under a unied implementation
framework. Finally, we conclude with a discussion of future works in this direc-
tion. © 2017 Wiley Periodicals, Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2018, 8:e1226. doi: 10.1002/widm.1226
INTRODUCTION
Network data appear in many domains, including
social, communication, and information
sciences. Although the networks in these domains dif-
fer in terms of their structural composition, some
topological structures, specically, triangles, appear
in abundance across networks in all different
domains. Abundance of triangles in real-life networks
motivated scientists to invent metrics, such as cluster-
ing coefcient
1
or transitivity ratio
2
to characterize
and analyze networks. The existence of triangles in
social networks has also been studied and explained
from various social science theories such as
homophily,
3
and transitivity. A key computational
task for all these studies is to count the number of tri-
angles in a network, which is the focus of this work.
There are many real-life applications of triangle
counting. The most well-known among them, of
course, is to compute the transitivity ratio (or, simply
transitivity) of a network, which is dened as the
ratio between the counts of triangles and triples
(a path of length two) in a network. Given that the
number of triples can be computed simply from the
degree of the vertices of a network, transitivity com-
putation then it becomes identical to the task of tri-
angle counting. Clustering coefcient is another
similar metric, but its value is dened for a given ver-
tex of a networkfor a vertex u, its clustering coef-
cient is the fraction of us neighbors who are
neighbor themselves. Both clustering coefcient and
transitivity have been used as a key metric for net-
work analysis and network evolution models.
4
Triangle count has also been used for several
other nonobvious applications. Becchetti et al.
5
have
used distribution of local triangles for detecting web
spam. Specically, they have shown that the distribu-
tion of local triangle frequency of spam hosts is sig-
nicantly different from those of the nonspam hosts.
The distribution of triangles is also used to uncover
*Correspondence to: alhasan@iupui.edu
Department of Computer Science, IUPUI, Indianapolis, IN, USA
Conict of interest: The authors have declared no conicts of inter-
est for this article.
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 1of19
hidden thematic structure in the World Wide Web.
Eckmann and Moses have shown that connected
regions of web graph, which are dense in triangles
represents a common topic.
6
Bar-Yossef et al.
7
have
used triangle count for query plan optimization in
databases. Overlapping triangles (or more generally
k-cliques) have been used for community discovery.
8
Triangle counting, though appears to be a sim-
ple task algorithmically, has attracted many contribu-
tions over the years from scientists in diverse
domains, including data mining and graph theory.
While earlier works
9,10
mainly care for asymptotic
computational complexity, in recent works, real-life
execution time has been a major consideration, moti-
vation for which comes from the enormous size of
real-life networks having vertices in the ranges of mil-
lions to billions. To achieve efciency, approximate
triangle counting through sampling has been a very
active direction in many recent works.
1116
Also,
researchers have tried to achieve efciency through
algorithms that run on multi-core or distributed envi-
ronment.
13,14,17
Some variants of triangle counting
algorithms have also been inspired by data access
constraints. For example, triangle counting algo-
rithms have been proposed for various data access
scenarios which are different from traditional ran-
dom memory access, examples include restricted
access,
12
and streaming data access.
16,18
Computational complexity of a triangle count-
ing algorithm is a good indicator of its efciency, but
in real-life the execution time of two algorithms can
be widely different even if they have the same compu-
tational complexity. The main reason for this fact is
the hidden constant of the computation complexity,
which depends on various properties of the input
graph. Sparsity is one of such properties. Large real-
life networks are very sparse, in which the number of
edges is typically a constant factor of the number of
vertices; in other words, the average degree of a vertex
is constant. Another important property is that the
degree distribution of real-life networks is skewed.
Although, the average degree of a network is con-
stant, there always exist a few vertices that have a
very large degree. This phenomenon is commonly
known as power-law degree distribution,
19
which sig-
nicantly affects the performance of a triangle count-
ing algorithm.
In this paper, we provide a thorough review of
triangle counting algorithms. We group the existing
methods based on their computation model or data
access patterns. Then, we discuss the algorithms by
comparing and contrasting their time complexity.
Finally, we show some experimental results that com-
pare the performance of some of these algorithms.
The following section provides denitions of various
concepts which are related to the task of counting
triangles. For reader's convenience, in Table 1, we
also provide notations used throughout the paper.
BACKGROUND
G(V,E)isagraphwhereVis the set of vertices and
Eis the set of edges. We use nand mfor representing
the number of vertices (|V|) and the number of edges
(|E|). Each vertex in the graph can be uniquely identi-
edbyanumberbetween1andn. The assignment of
identier can be arbitrary, but it is xed. We also con-
sider that Gis simple, connected, and undirected.
Because Gis simple, between a pair of vertices uand v,
there exists at most one edge, which we dene by (u,v)
where u<v. For a vertex u,weused(u)todenotethe
degree value of u,adj(u)todenotethesetofusneigh-
boring vertices, and inc(u) to denote the edges that are
incident to u. Likewise, for an edge e,weuseinc(e)to
denote the incidence vertices of the edge e. It is easy to
see that P
u2V
d(u)=2m. The maximum degree value
over the vertices is dened as d
max
.
Triples and Triangles
A (connected) triple (u,v,w) at a vertex vis a path
of length two for which vis the center vertex. If the
other two vertices (uand w) are also connected by
an edge, the triple is called a closed triple (triangle),
otherwise it is called an open triple. A triangle actu-
ally contains three closed triples, one centered on
each of its vertices.
We use the symbol Π
v
to represent the set of
triples that are centered at the vertex v. The set of tri-
ples in a graph G=(V,E)isΠ, which is the union of
the set of triples at each of its node, i.e., Π=[
v2V
Π
v
.
If the degree of each of the vertices is known, the
total number of triples can be computed efciently as
below:
jΠj=X
v2V
Πv
jj
=X
v2V
dvðÞ
2

:ð1Þ
Based on whether the triple is open or closed
(in terms of its induced embedding in the graph G),
we partition the set Πinto Π
(open triples) and Π
5
(closed triples). Note that, each of the nodes of a tri-
angle in a graph Gcontributes one distinct triple in
the set Π
5
. We use Λto represent the set of distinct
triangles in a graph G. Clearly, the size of Π
5
is three
times the size of Λ, as the former contains three
Advanced Review wires.wiley.com/dmkd
2of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
copies of a distinct triangle each centered at one of
the triangle vertices. Mathematically, jΛj=1
3jΠΔj.
To represent the set of open and closed triples
centered at a vertex v, we use Π
vand ΠΔ
v, respec-
tively. If t(G) is the number of triangles in the graph
G, then
tGðÞ=jΛj=1
3jΠΔj=1
3X
v2VjΠΔ
vj:ð2Þ
Counting, Enumeration, and Sampling of
Triangles
For a given graph G, triangle counting is the task of
obtaining the number t(G)asdened in Eq. (2). On
the other hand, triangle enumeration task is to enu-
merate the members of Λ, i.e., to list all unique trian-
gles in a given graph. Enumeration is a costlier task
than counting because the former solves the latter
immediately, but the latter does not necessarily solves
the former. Nevertheless, for many real-life applica-
tions, one may need to enumerate the triangles rather
than simply nding a total count of it, so both count-
ing and enumeration tasks stand on their own merit.
Finally, sampling of triangles is to obtain a subset of
Λ, typically the size of the subset is a user dened
parameter. Depending on the sampling algorithm,
the triangles in the sample set can be chosen uni-
formly (each triangle is sampled with uniform proba-
bility) or they may be sampled with a biased
probability. Sometimes we are only interested to nd
a count of triangles that are incident to a given
vertex. This task is then known as local triangle
counting. Local triangle count is important to nd
clustering coefcient of a given vertex.
(Local) Clustering Coefcient
Clustering coefcient is a metric denoting the cluster-
ing tendency of the vertices in a graph. When the
metric is dened on a vertex of a graph it is called
local clustering coefcient. For a given vertex, u, its
local clustering coefcient C(u) is the fraction of us
neighbors who are neighbor themselves.
Mathematically,
Cu
ðÞ
=v,wðÞ:v,wðÞ2E^v,w2adj uðÞjj
adj u
ðÞadj u
ðÞ
1
ðÞ
=2:ð3Þ
The average of local clustering coefcient over the
vertices is called clustering coefcient of the network.
Transitivity
Newman et al.
20
dened the transitivity of a graph
Gas the fraction that represents the number of
closed triples divided by the number of all the triples
over the entire network. We use γ(G) to denote tran-
sitivity of G
γGðÞ=ΠΔ
Π
jj=ΠΔ
Πj+jΠΔ
:ð4Þ
Using Eqs. (2) and (4), the triangle count (t(G)) of a
network can be obtained from the transitivity of the
network as below:
tGðÞ=1
3γGð ÞjΠj:ð5Þ
MetropolisHastings (MH) Algorithm
Several approximate triangle counting methods sam-
ple triangles or triples using random walk-based indi-
rect sampling strategies, also known as Markov
Chain Monte Carlo (MCMC) sampling. Metropolis
Hastings (MH) algorithm is a variant of MCMC
algorithm; its goal is to draw samples from some dis-
tribution π(x), called the target distribution, where,
π(x)=f(x)/K; here, f(x) is any function which assigns
a nonnegative real-value to a population object
xdenoting its desirability in regards to sampling. Kis
a normalization constant to make the sum of π(x)
over the population object equal to 1. Typically, Kis
not known or difcult to compute.
MH algorithm is used together with a random
walk to perform MCMC sampling. For this, the MH
TABLE 1 |Summary of the Notations
Notations Meaning
n
Number of vertices
m
Number of edges
d
(
u
) Degree of vertex
u
adj
(
u
) Set of neighboring vertices of the vertex
u
inc(
u
) Set of edge incident to vertex
u
Π
Set of all triples
Π
v
Set of triples centered at vertex
v
Π
Set of all open triples
Π
5
Set of all closed triples
Λ
Set of distinct triangles
t
(
G
) Number of triangles in the graph
G
γ
(
G
) Transitivity of the graph
G
N
(
u
) Sorted neighbors of vertex
u
AAdjacency matrix
d
max
Max degree of a vertex in the graph
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 3of19
algorithm draws a sequence of samples from the tar-
get distribution as follows:
1. It picks an initial state (say, x) satisfying
f(x)>0.
2. From current state x, it samples a state yusing
a distribution q(x,y), referred as proposal
distribution.
3. Then, it calculates the acceptance probability
α(x,y) (Eq. (6)) and accepts the proposal move
to ywith probability α(x,y). The process con-
tinues until the Markov chain reaches to a sta-
tionary distribution.
αx,yðÞ=min πyðÞqy,xðÞ
πxðÞqx,yðÞ
,1

= min fyðÞqy,xðÞ
fxðÞqx,yðÞ
,1

:
ð6Þ
Importance Sampling
Importance sampling (IS) is a sampling strategy, which
is used to estimate expectation of a function f(x)
relative to some distribution pxðÞ=e
pxðÞ=K, called the
target distribution, whereas the samples are actually
obtained from a different distribution q(x), called the
proposal distribution. IS is useful when it is easier to
sample from the distribution qbut we need to obtain
expectation with respect to a different distribution p.
For instance, for triangle counting, we want to obtain
triple samples from a uniform distribution, i.e., the
target distribution pis uniform, but it may be easier
to sample triples from a biased distribution, say q.
Using the idea of IS, the expectation of f(x) with
respect to the target distribution is equal to
EpfxðÞ½=X
S
i=1
fx
i
ðÞwx
i
ðÞ;ð7Þ
where,
wx
i
ðÞ=e
px
i
ðÞ=qx
i
ðÞ
XS
j=1e
px
j

=qx
j

:ð8Þ
ORGANIZATION OF THE REVIEW
We organize this review based on classication of the
triangle counting methods as depicted in Figure 1.
Our rst level classication of triangle counting
methods is based on data (graph) access pattern. We
consider two kinds of data access patterns: random
access and restricted access.
Random access methods assume that the entire
network is available in the memory in an adjacency
vector data structure (or in other format) and we also
know the size of the networkthe number of vertices
(n) and the number of edges (m). These random access
methods are further divided into three sub-categories:
(1) exact triangle counting; (2) approximate triangle
Triangle couting
Restricted access
Random access
Distributed and parallel counting
Exact counting
Approximate counting
Random walk over vertices
Random walk over triples
Triangle counting on Streaming data
With enumeration
Without enumeration
Graph sparcification
Triple sampling
Vertex/edge sampling
Linear algebra–based method
FIGURE 1 |Classication of triangle counting works.
Advanced Review wires.wiley.com/dmkd
4of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
counting; (3) distributed and parallel triangle count-
ing. Methods in the rst sub-category, i.e., the exact
triangle counting methods provide actual count of tri-
angles in the given input network, which can be
obtained with or without enumeration of each trian-
gle. Methods in the second sub-category are approxi-
mate methods, which calculate triangle counts with
acceptable error in the count. Nevertheless, approxi-
mation methods are much faster compared to exact
counting methods. Most of the approximation count-
ing methods are based on different sampling
approaches. Methods in the nal sub-category run on
distributed and parallel platforms. For triangle count-
ing, such methods have recently become popular, as
they can provide exact or approximate triangle counts
for huge networks, which cannot be stored into the
main memory of a single machine.
In real world, there are many networks, which
are not fully accessible so random access based meth-
ods for triangle counting is not an option for such
networks. Such networks can only be crawled,
i.e., an analyst can only explore the neighbors of the
currently visiting node. For restricted access, it is
assumed that access to one seed vertex or a collection
of seed vertices of the network is available so that the
crawling can be initiated. Another assumption is that
the network is connected or the largest (giant) con-
nected component of the network covers the majority
of the vertices and the part of the network excluding
the giant component can be ignored. Because, the
network is connected, one can attempt to crawl the
entire network by using graph traversal methodolo-
gies and save the network (in memory or disk) for
counting triangles by using random access-based
methodologies. However, we assume that the net-
work is very large (say, Internet network) and it does
not t in the main memory. So, random access-based
methodologies do not work on such networks, or to
the least, such methodologies are highly inefcient
due to frequent I/O access. In such restricted access
scenarios, one cannot obtain an exact count of trian-
gles but random walk over the network provides a
viable option for approximate triangle counting.
Another type of restricted access is streaming
data access, where the graph data appears as a
stream of edges. Limited memory does not allow all
the edges to be stored in the memory so a triangle
counting method requires to store a judiciously
selected sample of edges or some form of summary
statistics computed over the edges. Edges that appear
in the stream are lost if they are not saved. Because,
streaming data access works with a sample of edges
it only provides an approximate count of triangles in
a graph.
In the following section, we discuss exact trian-
gle counting algorithms, which is followed by discus-
sion of approximate triangle counting algorithms.
Then, we discuss triangle counting algorithms that
work for restricted access and streaming access sce-
narios. After that we discuss some triangle counting
methods which work on distributed or parallel plat-
forms. There after, we present experimental results
from the comparison among a collection of approxi-
mate triangle counting methodologies. Lastly, we dis-
cuss other two related counting tasks before
concluding the paper.
EXACT TRIANGLE COUNTING WITH
RANDOM ACCESS
We rst discuss triangle counting algorithms with ran-
dom memory access assumption. Under this assump-
tion, we can obtain the adjacency vector of any vertex
in O(1) time. We also assume that, in the adjacency
vector of a vertex u, the neighbors of u,adj(u)is
sorted. So, the existence of an edge (u,v)canbe
answered in O(lg n) time using binary search on that
vector. Another option is to use a hash-table of edges
to answer the edge existence query in expected O(1)
time. Note that even if we use binary search for
answering edge existence query, the complexity O(lg
n) is only a worst-case time complexity, which applies
to a very high degree vertex. Given the fact that the
average degree of real-life networks is constant, and
for triangle counting we need to ask the edge existence
query over a very large number of small adjacency lists
and occasionally a few large adjacency lists, we can
amortize the cost of costly binary searches over a large
number of cheap searches and assume that the cost of
edge existence query is constant.
A brute-force triangle counting algorithm can
be designed by enumerating all distinct three vertex
sets {u,v,w} (not necessarily connected) in a net-
work and then testing whether the three vertices
form a triangle. Because the number of such three-
vertex sets is in the order of Θ(n
3
), the brute-force
complexity of triangle counting is Θ(n
3
). Note that,
such an algorithm not only counts the triangles but
also iterates (or lists) the triangles. Note that, any
algorithm that iterates each of the triangles has a
worst-case complexity of Θ(n
3
), because the maxi-
mum possible number of triangles in a graph of
nvertices is exactly n
3, which is realized when the
given graph is a clique.
Over the years, many triangle counting methods
have been proposed which have better runtime perfor-
mance. Specically, the methods that count but do
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 5of19
not list all the triangles have worst-case time complex-
ity much better than Θ(n
3
). Using this observation, we
will discuss the algorithms positioned in two groups.
The rst group of algorithms only provides a count of
triangles without listing (or enumerating) them. On
the other hand, the second group of algorithms lists
the triangles. Our discussion of exact triangle count-
ing algorithms is brief, however, Schanks PhD the-
sis
21
is an excellent reference of the methodologies for
exact triangle counting algorithms.
Triangle Counting Without Enumeration
The earliest methods of triangle counting without
enumeration are based on matrix multiplication of
the adjacency matrix. It is easy to see that, if Ais the
adjacency matrix of an undirected network G, the
diagonal elements of A
3
[i,i] contain the total number
of closed walks of length 3 that begin and end at ver-
tex i. Given that a triangle is counted as a closed
walk starting and ending at each of its three vertices
and also for an undirected graph each closed walk
can be counted twice (counterclockwise and clock-
wise); thus, the total number of triangles in a graph
G,t(G) = (1/6)Tr(A
3
). The complexity of this algo-
rithm is Θ(n
3
), however a fast matrix multiplication
algorithm can be used to achieve a better algorithm,
which runs in Θ(n
ω
), where the current best value of
ω, the exponent of matrix multiplication, is around
2.373.
22
However, the hidden constants of many of
the fast matrix multiplication algorithms are large,
which makes these algorithms not much superior
(if not worse) than the traditional Θ(n
3
) based matrix
multiplication algorithm for counting triangles in
real-life large graphs.
Alon et al.
10
has proposed a triangle counting
algorithm (hereby called as AYZ), which runs in
Om
2ω/(ω+1)
time. In this algorithm, authors rst
dene Δ=m
(ω1)/(ω+1)
and name a vertex high
degree if its degree is higher than Δ, otherwise it is a
low degree vertex. There are at most mΔpaths for
which the intermediate vertices are low degree, each
of these paths can be checked for triangle in O(mΔ)
time. Then, the remaining triangles are involved with
all high degree vertices. As there are at most (2m/Δ)
high degree vertices, triangles involving those vertices
can be found in O((m/Δ)
ω
) time. Then, overall com-
plexity of this method is O(mΔ+(m/Δ)
ω
)=O(m
2ω/
(ω+1)
). Because AYZ uses matrix multiplication as a
part of the method, it also belongs to non-enumera-
tion-based triangle counting method. Note that, if
ω= 3 (which is the case of traditional matrix multi-
plication), the complexity of AYZ method is O(m
3/2
)
Triangle Counting With Enumeration
Enumeration-based triangle counting algorithms list
all the triangles, then counting becomes a trivial task.
The obvious advantage of an enumeration-based
method is that it returns the list of all the triangles,
which can be used for downstream tasks such as
community discovery
8
or spam ltering.
5
Besides the
above, enumeration-based methods are preferred
over the matrix multiplication-based methods even
for solving the counting task, because matrix
multiplication-based methods suffer from large mem-
ory footprint.
One of the earliest triangle enumeration method
is proposed by Itai and Rodeh.
9
This method was
actually proposed to nd just one triangle, but it can
easily be extended to list all the triangles. This algo-
rithm rst nds a spanning tree T(V,E
T
) of the graph
G(V,E) and then for each edge (u,v)2E
T
, it checks
whether (pred(u), v)2E(pred stands for predecessor
of a node in the tree); if true, it emits (u,v, pred(u))
as a triangle. It also checks whether (pred(v), u)2E.
If true, it outputs (u,v, pred(v)) as a triangle. The
edges of Tare then removed from Gand the process
is repeated by building a new spanning tree of the
updated graph. The algorithm terminates after no
more edges exist in G. Each iteration takes O(m)
time, and it can be shown that there are at most
Offiffiffiffi
m
p
ðÞiterations, so the complexity of this method
is O(m
3/2
). However, this algorithm needs modica-
tion of the graph data structure, which is costly,
hence its real-life execution time is not so competitive
with other methods.
A better algorithm is to enumerate over vertex-
pairs that are adjacent to a given vertex v. As shown
in Algorithm 1, this method iterates over the vertices
through the variable v. For each pair of vertices u,
Advanced Review wires.wiley.com/dmkd
6of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
and wfrom adj(v), it checks whether an edge exists
between uand w; if yes, {u,v,w} forms a triangle,
otherwise not. A cumulative sum of total number of
triangles is then returned after dividing the sum by
3. Division is required because each triangle is
counted thrice, once in the iteration in which one of
its vertices is chosen in the outermost loop. Because
the algorithm iterates over the vertices of a network,
it is also known as NodeIterator algorithm. The
amount of work done at each vertex is Θ(d(v)
2
), so
the complexity of the method is Θnd2
max

. For net-
works, for which the degree distribution is highly
skewed, say if the maximum degree value of a net-
work grows linearly with the number of vertices the
time complexity of this algorithm is Θ(n
3
); this is true
even for a star network for which the triangle count
is equal to 0.
Instead of iterating over the vertices, triangles
can also be counted by iterating over the edges (Algo-
rithm 2). While iterating over the edges the algorithm
counts the number of triangles in which each of the
edges contributes. Such an algorithm is known as
EdgeIterator method. The complexity of an EdgeI-
terator algorithm is O(md
max
).
Both node iterator and edge iterator algorithms
iterate over each of the potential two-length paths and
check whether it forms a triangle. To avoid duplicate
counting and reduce counting time, strategies can be
adopted so that each triple is checked exactly once.
23,24
For node iterator algorithm, we can simply sort the
nodes based on their degree and enforce an ordering
on nodes v<u<win Algorithm 1. This ensures that
each triangle is counted only once by its smallest
degree vertex in the variable count
v
.In that case, the
division by 3 in Line 11 of Algorithm 1 is not needed.
Also, the sort order improves the running time of the
algorithm by not considering many two-length paths
at all. For example, for a star graph, the terminal nodes
are degree 1 nodes and the star node is the highest
degree node. Using this sort order, none of the triples
centered at the star node need to be tested for triangle
and the method can return a 0 value for the triangle
count of a star graph, without testing any of its triples.
Similar optimization can also be pursued for edge
iterator algorithm; for example, if the adjacency list of
the vertices are sorted, when computing the inter-
section of sets adj
1
and adj
2
(line 5 of Algorithm 2), we
can restrict the intersection operation such that the
third node of a triangle, x2adj(u)\adj(v), satises
x>max{u,v}.
When the redundant counting is avoided and
the cost of edge existence test is O(1), the time com-
plexity of triangle enumeration is bounded by the
total number of triples in a graph, which is equal to
X
v2V
dvðÞ
2

. We can also bound the triple count in
terms of edge count (m) with a careful analysis. Say,
we divide the vertices into two groups: high degree,
having degree > ffiffiffiffi
m
p, and low degree, the remaining.
For each low degree vertex, the maximum number of
possible triples that are centered at these vertices can
be obtained by packing all edges with as few low
degree vertices as possible. Because the total number
of edges is m, we can pack them in ffiffiffiffi
m
pvertices each
having degree ffiffiffiffi
m
p. Thus, the number of triples that
are centered at a low degree vertex is at most
Offiffiffiffi
m
pffiffiffiffi
m
p
ðÞ
2

=Om
3=2

. On the other hand, the
number of high degree vertices is at most 2m=ffiffiffiffi
m
p
ðÞ,
as each of these vertices has a degree at least ffiffiffiffi
m
p;
then the number of triples consisting of high degree
vertices is at most Offiffiffiffi
m
p
ðÞ
3

=Om
3=2

. Thus the
total number of triples is O(m
3/2
)+O(m
3/2
)=O(m
3/
2
). Because, the efcient version of both node iterator
and edge iterator test each of the triples in a network
exactly once, the complexity of both of these algo-
rithms are bounded by O(m
3/2
).
There are a set of recent variants of edge iterator,
Forward
21
by Schank T. and new-listing
24
by
Latapy M. Forward orders the vertices by increasing
degree. Latapys method sorts the adjacency list and
uses iterators to efciently compute set intersection.
The complexity of these methods still remains O(m
3/2
),
but real-world execution time may be smaller. Schank
21
has performed a through comparison among various
exact triangle counting methods over a large number of
real-life and synthetic networks.
APPROXIMATE TRIANGLE COUNTING
Time complexity of node iterator and edge iterator
algorithm are O(m
3/2
). For very large graphs with
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 7of19
hundreds of millions of edges, this cost may still be
deemed costly. So, in recent years approximate trian-
gle counting algorithms have become popular. Meth-
ods for approximate triangle counting do not list
(enumerate) the triangles, rather they give an approxi-
mate count of trianglessometimes, with an approxi-
mate guarantee. Also, their execution time is smaller,
typically by order of magnitudes. For many applica-
tions, trading the large running time for a good
approximation is adequate; for instance, to analyze
the evolution pattern of a network, an approximate
transitivity result is generally acceptable.
In existing literatures, there exist a few variants
of approximate triangle counting methods. Algo-
rithms in rst variant are based on uniform triangle
sampling by graph sparsication, the algorithms in
the second variant are based on triple sampling, and
the algorithms in the third variant are based on a sto-
chastic version of node or edge iterator. There also
exists an approximate triangle counting method,
which uses ideas from linear algebra.
Graph Sparsication-Based Methods
The idea of graph sparsication-based method for
triangle counting is to sparsify the graph by prob-
abilistically deleting a subset of edges in the graph
and then extrapolating the triangle count in the
original graph from the exact triangle count in the
sparse graph. Because sparse graph is much smaller
than the original graph, the triangle counting in the
sparse graph can be performed typically in a frac-
tion of time, which makes the approximate method
substantially faster than an exact counting method.
Such a method can also be viewed as uniform trian-
gle sampling-based method, because the triangles in
the sparse networks are sampled with a uniform
probability over all the triangles in the original
network.
Tsourakakis et al.
25
proposed one of the earliest
approximate triangle counting method called DOU-
LION, which works by graph sparsication. Given a
graph G(V,E), DOULION keeps each edge of Gwith
pprobability and remove the edge with 1 p
probability to generate a sparse graph, G
s
. It then
runs an exact triangle counting method on G
s
to
obtains t(G
s
)the exact triangle count of G
s
. Each
triangle in the original graph is retained in the sparse
graph when all three of its edges are retained, for
which the probability is p
3
. Hence, the probability of
sampling a triangle from Gin G
s
is p
3
and thus, the
expected count of the triangles in the original graph is
^
tGðÞ=1=p3

tG
s
ðÞ. For large network with millions
of vertices pvalue as small as 0.01 can provide very
good approximate triangle count. A pvalue of 0.01
yields almost 100 times speed-up in the running time
over the running time of an exact counting method.
Pagh and Tsourakakis
26
proposed another
graph sparsication work for approximate triangle
counting which they named as colorful triangle
counting.For each vertex, this method assigns a
color between 1 and Nuniformly and retains only
those edges for which both the endpoints have the
same color, all other edges are removed. Identical to
DOULION, an exact counting algorithms is used to
nd the number of triangles, t(G
s
), in the sparse net-
work G
s
.Ifp=1/N, each triangle in the original net-
work is retained in the sparse network with
probability p
2
. This is so because when two of the
edges of a triangle is monochromatic, the third edge
is also monochromatic by force and the probability
of retaining two edges is p
2
. So the expected count of
the triangles in the original graph for this method is
^
tGðÞ=1=p2

tG
s
ðÞ. Like DOULION, the sparse
graph G
s
contains each edge with probability p, but
unlike DOULION, this method retains (or samples)
each triangle with a probability p
2
, instead of p
3
.As
this method samples more triangles for the same
pvalue, it has a better accuracy than DOULION.
Pagh et al. had used variance analysis, and then
proved probabilistic bounds on the approximation
ratio of triangle estimation using this method. They
also proposed a MapReduce-based distributed imple-
mentation of this algorithm.
Very recently, Etemadi et al.
27
proposed
another method which is an adaptation of DOU-
LION. Similar to DOULION, this method also sam-
ples edges of a given graph Gwith a uniform
probability pto obtain the sparse graph G
s
. How-
ever, besides counting triangles in G
s
, it also checks
whether the missing edge of each open triple in G
s
exists in the original graph G; if yes, that partial tri-
angle is also counted with the count of actual triangle
in G
s
. A triangle in Ghas a p
2
probability to be
counted in t(G
s
), because a triangle in Gwill be
accounted for in G
s
as long as two of its edges are
retained in G
s
. So, the expected count of the triangles
in the original graph using this method is
^
tGðÞ=1=p2

tG
s
ðÞ. By estimating the variance of the
estimation, authors also provided a way to choose
the value of pfor achieving a targeted range of rela-
tive standard error. They have also proved that com-
pared to their method, DOULION always needs
more samples to achieve the same level of accuracy.
Accuracy of this method is comparable to Pagh
et al.s method, because both methods sample a trian-
gle with the same probability.
Advanced Review wires.wiley.com/dmkd
8of19 © 2017 Wile y P e r i o d i c a l s, Inc. Vo l u m e 8 , M a r c h /April 2018
Note that all the graph sparsication-based
methods need an exact triangle counting algorithm
which runs on the sparse network. So, their execu-
tion time depends on the performance of the exact
triangle counting algorithm. Clearly, Etemadi et al.s
method is the costliest among all of these, because
besides counting triangles in the sparse network G
s
,
it also needs to check the graph data structure of
Gfor determining the existence of the missing edges
of an open triple in G
s
.
Triple Sampling-Based Method
The graph sparsication-based method samples trian-
gles, but there is another family of approximate trian-
gle counting algorithms, which samples triples,
instead of triangles. A triple sampling method lends to
a triangle approximation algorithm by computing an
unbiased estimate of transitivity (dened in the Back-
ground). This estimate is equal to the fraction of tri-
ples that are closed out of all the sampled triples. If
Tis a set of sampled triples, and ^γis the estimate of
transitivity
^γ=X
t2T
Itis closed
T
jj ;ð9Þ
here Iis an indicator random variable. Then using
Eq. (5), an approximate estimate of total number of
triangles in the graph is equal to 1=3
ðÞ
^γjΠj, where
jΠj=Xn
i=1
du
i
ðÞ
2

, is the total number of triples in
the given graph (see Eq. (1)). Such a method has been
used in several works.
12,16,28
A key requirement for computing an unbiased
estimate of transitivity is to sample triples from a uni-
form distribution, which is a non-obvious task. The
simplest approach to sample a triple is to select a
node vuniformly and then select two of vs neigh-
bors (uand w) uniformly. This method samples a tri-
ple hu,v,wiwith vas the center node. However, this
method does not sample a triple uniformly, because
the number of triples centered at a node vis
jΠvj=dvðÞ
2

, which is nonuniform over the vertices
for a general graph. Hence triple hu,v,wiis sampled
with probability 1/(n|Π
v
|) /1/|Π
v
|. So, the triples
that are centered around high degree vertices will be
under-sampled and those that are centered around
low degree vertices will be over sampled.
Schank and Wagner
28
have proposed the ear-
liest algorithm for approximating transitivity of a net-
work by sampling triples uniformly. Their idea is as
follows: rst, sample a vertex vin proportional to the
number of triples centered around that vertex. Then
with uniform probability return one of the triples that
are centered around v. If Πis the set of all triples in
Gand Π
v
is the set of triples centered at node vthen
Π=Xn
i=1Πiand jΠvj=dvðÞ
2

. For uniform triple
sampling, we sample the center vertex vwith proba-
bility |Π
v
|/|Π| and then return the triple hu,v,wiby
uniformly selecting one of the triples in Π
v
.Thus, the
probability of sampling the triple hu,v,wiis
uniform,
Ptriple v,u,whiis sampled
ðÞ
=Πu
jj
Πjj
1
duðÞ
2

=1
Πjj
;
ð10Þ
as desired. Schank et al. then used the set of sampled
triples to approximate the transitivity using Eq. (9).
However, one can easily obtain an approximate
count of triangles in a graph G,t(G), by using the
estimated transitivity in the following equation:
^
tG
ðÞ
=1
3^γjΠj;ð11Þ
the value of |Π| is known, as is shown in Eq. (1).
Kolda et al.
16
have reinvented the same method in a
later work. Both Schank et al. and Kolda et al. have
proved approximation error bound by using Hoeffd-
ing bound-based concentration inequality.
In a recent article, Al Hasan
29
has provided meth-
odologies for obtaining an unbiased estimate of transi-
tivity even for the case when the sampling algorithm
samples the triples from a non-uniform distribution. He
has used the idea of IS (discussed in the Background
Section) for this task. For instance, the simplest (but
biased) triple sampling method that we discussed at the
beginning of this subsection samples each triple in
inverse proportional to the number of triples centered
at the rst sampled vertex, v. If we want an unbiased
estimate of transitivity, we need to have a uniform tar-
get distribution. But the triples are sampled from a dis-
tribution which is proportional to 1/|Π
v
|, so by using
Eq. (7) of IS an unbiased estimate of transitivity can be
computed as below. Consider we have a triple sample
set T={t
i
=hu
i
,v
i
,w
i
i}
1i|T|
,wherev
i
is the center
node of the triple t
i
.The importance weight is
wt
i
ðÞ=Πvi
jj
XT
jj
j=1 Πvj
:ð12Þ
Now, the unbiased estimate of transitivity is simply:
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 9of19
^γ=X
ti2T
wt
i
ðÞItiis closed

=Xti2TjΠvijItiis closed
Xtj2TΠvj
:ð13Þ
Approximation by Vertex or Edge Sampling
Another simple approximate triangle counting algo-
rithm can be built by using a probabilistic counter-
part of node iterator or edge iterator method. In the
exact node iterator algorithm, we count the number
of triangles by summing the count of triangles that
are incident to each of the nodes. Instead of summing
the count over all the vertices, we can simply sum
over pfraction of uniformly sampled vertices, and
then approximate the total count by dividing the sum
value by p. This provides an approximate node
iterator-based triangle counting algorithm. Likewise,
we can build an approximate edge iterator-based
algorithm by counting triangles over pfraction of
edges. Rahman and Al Hasan
13
have proposed this
method and they have shown that for large net-
works, this method achieves very good accuracy.
They have also shown that an edge iterator-based
sampling method achieves better accuracy than a
node iterator-based sampling method. Note that,
these methods do not sample the triangles uniformly,
but the estimation is still unbiased because the expec-
tation of triangle count is taken over the edges, which
are sampled uniformly.
Linear Algebra-Based Method
We have shown earlier that if Ais the adjacency
matrix of an undirected network G, the total number
of triangles in a graph G,t(G) = (1/6)Tr(A
3
). If λ
i
is
the eigenvalue of A,λ3
iis the eigenvalue of A
3
, hence,
Tr A3

=Xn
i=1λ3
i. Note that, because Ais symmetric,
all the λ
i
are real numbers. For exact counting, it
requires to obtain all the eigenvalues of the adjacency
matrix Aa costly task.
Fortunately, real-life networks have power-law
property and due to this fact, the eigenvalues of its
adjacency matrix are also skewed, typically following
a power-law property. So if the eigenvalues are
sorted by their absolute value (in other words, by
their contribution to the sum), we can approximate
the triangle count by taking only top-keigenvalues.
Note that using Lanczos method, top-keigenvalues
of a matrix can be easily computed in an incremental
manner. This idea has been used in the EigenTriangle
algorithm,
11
which accepts an adjacency matrix and
a tolerance parameter. The tolerance parameter is
used as a stopping criterion, as such that the ratio of
jλ3
ijand Xn
i=1λ3
ihas to be above the tolerance
parameter. Although elegant, there are two key lim-
itations of this method. First, the time-accuracy
trade-off of this method is much poorer than other
recently proposed approximate triangle counting
methods. Second, no approximate guaranty is availa-
ble regarding the accuracy of the method; in other
words, there is no obvious relation between tolerance
parameter and counting accuracy so that it can be set
to achieve a desired level of approximation.
TRIANGLE COUNTING IN
RESTRICTED ACCESS SCENARIO
As discussed before, for restricted access network
random walk-based approximation methods are
most suitable. Rahman and Al Hasan
12
have pro-
posed a collection of random walk-based methods
for approximating transitivity of a network in a
restricted access scenario. If the total number of tri-
ples (Π) is known, then these methods can be used
for approximating triangle count. Below we discuss
two of the methods, one performing random walk
over the vertices, and the other performing random
walk over the triples.
Random Walk Over Nodes
Earlier we have shown that an approximate triangle
counting algorithm can be obtained by rst sampling
a vertex vand then uniformly sampling one of the
triples, hu,v,wi, centered at vertex v. A random
walk variant of this algorithm performs the rst sam-
pling task, i.e., sampling v, by a random walk over
the graph. A crucial task, though, to ensure that we
can design the random walk in such a way that the
vertex vis sampled from a distribution, which
enables unbiased triangle counting.
A simple random walk, which chooses the
next vertex uniformly from the neighbors of
currently visiting vertex has a stationary distribution,
πd()/2m, where d() is the degree of a vertex. So,
if we perform a simple random walk and return a tri-
ple t=hu,v,wiuniformly among all the triples inci-
dent to the currently visited vertex, v, the probability
of sampling the triple tis equal to dvðÞ
2m×1
dvðÞ
2

,
which is proportional to 1/(d(v)1), not uniform.
Because unbiased triangle counting by transitivity
approximation requires uniform triple sampling, we
can use the idea of IS that we have discussed earlier.
Advanced Review wires.wiley.com/dmkd
10 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
If we have a triple sample set T={t
i
=hu
i
,v
i
,w
i
i}
1i|T|
,
where v
i
is the center node of the triple t
i
,then
^γ=X
ti2T
wt
i
ðÞIti
ðÞ
isclosed

=Xti2Tdv
i
ðÞ1ðÞIti
ðÞisclosed

Xtj2Tdv
j

1

;ð14Þ
here, we used wt
i
ðÞ=dv
i
ðÞ1
X
tj2T
dv
j

1

, using Eq. (8).
Once an unbiased estimation of transitivity is
obtained, triangle count can be approximated
using Eq. (11).
However, Rahman et al.
12
did not use IS in
their solution, they rather have used MH algorithm.
Their solution, named Vertex-MCMC, works as fol-
lows: Use MH algorithm to design a random walk
whose stationarity distribution πdðÞ
2

. Say, the
random walk of a vertex-MCMC-based triple sam-
pler is visiting a vertex a. To use MH algorithm, we
need to uses a proposal distribution (q) to make a
trial move; vertex-MCMC chooses qto be uniform
over the neighbors of a; in other word, it chooses
one of the vertices (say, b) from the adjacency list of
auniformly. Therefore, the proposal distribution q(a,
b)=1/d(a), and q(a,b) represents the probability of
an adjacent node bto be selected from the node a.
Similarly, q(b,a)=1/d(b). Now, using Eq. (6), the
acceptance probability of the proposal move is as
shown in Eq. (15).
αa,bðÞ= min 1,
dbðÞ
2

1
dðbÞ
daðÞ
2

1
daðÞ
8
>
>
<
>
>
:
9
>
>
=
>
>
;
= min 1,db
ðÞ
1
daðÞ1

:
ð15Þ
The above MCMC random walk ensures that each
vertex is sampled from the target distribution, which
is proportional to the number of triples at each ver-
tex. So, while visiting a vertex vusing the above ran-
dom walk, a uniform triple sampler simply returns a
triple hu,v,wi, which is one of the triples centered at
v, selected with uniform probability over all such
triples.
Random Walk Over Triples
Instead of sampling triples in two stages (rst sample
a vertex, and then a triple which is centered at that
vertex), we can also sample triple directly. Rahman
and Al Hasan
12
have also proposed such an
approach, which they call triple-MCMC. In this
method, a random walk is designed which walks over
the space of triples in a network. To facilitate this
walk, a neighborhood graph is dened over the set of
triples. Any reasonable neighbor denition works;
Rahman et al. consider two triples as neighbor if they
have two vertices in common. For example, the triples
h1, 2, 3iand h2, 3, 4iare neighbors because they have
two common vertices, {2, 3}. Starting from an arbi-
trary random triple, the random walk continues over
the triples by moving from one triple to a neighbor tri-
ple based on the above neighborhood denition. Set
of possible neighbors of the currently visiting triple
can be computed on the ybynding other triples
that can be obtained by replacing exactly one of the
vertices of the current triple. Thus, the walk resembles
sampling of dependent triples, where a sampled triple
shares two vertices with the previously sampled triple.
For the purpose of triangle counting, the triples
need to be sampled uniformly. But a simple random
walk does not guarantee this because the triples have
different degree in the neighborhood graph on which
the random walk proceeds. To ensure uniform sam-
pling, Rahman et al. proposed to adopt MH algo-
rithm. Lets assume that the random walk is visiting
a triple t. For MHs proposal distribution (say q),
they choose one of the triples from ts neighborhood
(say, s) uniformly. So, q(t,s) = 1/(|Γ(t)|). Here, Γ(t)is
the set of neighbors of the triple t. Using Eq. (6), the
acceptance probability of the proposal move is
obtained as shown below:
αt,sðÞ= min 11
ΓsðÞ
jj
11
ΓtðÞ
jj
,1
()
= min Γt
ðÞ
jj
ΓsðÞ
jj
,1

:ð16Þ
Once a desired number of triples has been sampled,
the fraction of closed triples over all the sampled tri-
ples provides an unbiased estimation of transitivity,
from which an approximate count of triples can be
returned by using Eq. (11). Note that for this method
also, instead of using MH, one can perform simple
random walk and then use IS for obtaining an unbi-
ased statistics of transitivity. Note that, a random
walk-based triple sampling method can approximate
transitivity, but not triangle count unless the total
number of triples, |Π|, is available.
TRIANGLE COUNTING ON
STREAM DATA
For many datasets, graphs are too large to t in main
memory, but it is easier to access a graph as
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 11 of 19
streaming edges, such that the edges appear on the
stream in an arbitrary order sequence. Even if a
graph ts in the main memory, a streaming edge
access model of graph is preferred for some computa-
tion model, such as, MapReduce. The main restric-
tion in a streaming access model is that we cannot
save all the edges of a graph in memory, so statistics
of each edge must be processed instantaneously as
the edge appears on the stream. For such restricted
access model, it is allowed to go over the edge stream
of a graph multiple times (aka, multi-pass streaming
algorithm). Going over the input graph stream multi-
ple times may appear inefcient, but if the graph does
not t in main memory, going over multiple passes is
much cheaper than trying to access a large number
of random vertex (or the adjacency list of a random
vertex) in the disk.
The earliest streaming triangle counting method
is proposed by Bar-Yossef et al.
7
Their method is
based on stream-reduction, a general idea for compu-
tation over data stream, which is also proposed in
the same paper. Stream-reduction idea can be used
for approximating frequency moment over data
stream, which they used for approximating triangles.
Unfortunately, their method is mostly for theoretical
interest and is not practical for approximating trian-
gles in real-world networks.
Buriol et al.
18
have proposed several methods
for triangle counting over edge stream. Their rst
method is a three-pass algorithm. In the rst pass,
the number of edges (m) and the number of vertices
(n) are counted simply by using two counters. In the
second pass, an edge e=(a,b) is sampled uniformly
from the set of edges. Also, a vertex v2{V\{a,b}} is
chosen uniformly. This leaves us with a triple (not
necessarily connected) ha,b,vi. Then in the third
pass, the method simply tests whether (a,v)2E^(
b,v)2E, if yes, then β= 1, otherwise β= 0. In this
way, βis an estimate of the triangles over all possible
edge-plus-a-vertex combination. If T
1
is the number
of disconnected triples, T
2
is the number of con-
nected open triple, and T
3
is the number of triangles
in the graph, then T
1
+2T
2
+3T
3
is equal to m(
n2), the population size. Besides, we also have the
expectation of β,Eβ½=3T3
T1+2T2+3T3
ðÞ
=3T3
mn2ðÞðÞ
. So, an
approximate unbiased triangle estimate is equal to
1=3ðÞmn2ðÞEβ½. This estimate can be improved by
running scopies of this sampling and averaging their
corresponding estimates of {β
i
}
1is
. Thus, the nal
estimate of triangle count is mn2ðÞ=3sXs
i=1βi.
Because each sample only takes O(1) space, for
ssamples the total space is bounded by O(s), which
is linear with the number of samples but independent
with the size of the network. Chernoffs inequality
can be used to prove probabilistic bound on the
approximation result.
The above three-pass algorithm can be con-
verted to a two-pass algorithm by combining the rst
two passes in a single pass. That is, counting n,
mand uniform sampling of edge (a,b) and vertex
vcan actually be done in the same pass using reser-
voir sampling.
30
The key idea of reservoir sampling
for sampling an object (uniformly) from a stream of
objects is to keep a running count of the number of
objects as new objects are seen in the stream. The
rst object in the stream is always saved in the reser-
voir, but the ith object replaces the object in the res-
ervoir with 1/iprobability only. When the stream
ends, the object in the reservoir is the sampled object,
chosen uniformly from the stream without prior
knowledge of the number of objects in the stream.
Buriol et al.
18
have actually proposed a one-
pass version of their three-pass algorithm, which
combines the works of all three passes in one-pass, as
below. Say, the edge eappears in the stream, while
(a,b) (uniformly sampled edge) and v(uniformly
sampling vertex) are in the reservoir. If e=(a,v), set
the boolean variable x= 1 and if e=(b,v), set the
boolean variable y= 1. Once the stream ends, if x=
y= 1, set β= 1, otherwise β=0.β= 1 represent the
fact that we have sampled a triangle (a,b,v) where
(a,b) is the uniformly sampled edge and vis the uni-
formly sampled vertex, both in the reservoir. How-
ever, Eβ½=T3=T1+2T2+3T3
ðÞ, which is one-third
(note the missing 3 in the numerator) of the Eβ
½of
the three-pass method. This is due to the fact that for
the one-pass version, the triangle ha,b,viis counted
(i.e., β= 1) only if the edge (a,b) appears on the
stream before the edges (b,v) and (a,v) (probability
of this event to happen is 1/3); on the other hand, the
three-pass version counts the triangle for any order-
ing of the 3 edges. Besides this, three-pass and one-
pass method are identical. So, similar to the three-
pass method, the estimate of triangles is equal to
mn2ðÞ=sXs
i=1βi,ifsparallel copies of sampling are
run together.
Jha et al.
15
have provided another one-pass
streaming triangle counting algorithm which is a
stream variant of triple sampling. Say, for a graph G,
e
1
,e
2
,,e
m
is a sequence of distinct edges. Then
{G
t
}
1tm
are the graphs in time t, formed by the edge
set {e
i
|it}; clearly G
m
=G. With the arrival of an
edge on the stream, the method performs an update
on the estimate of the triangles in graph G
t
.Once the
stream ends, the estimated value is equal to the count
of triangles in the entire graph, G
m
=G. The main
Advanced Review wires.wiley.com/dmkd
12 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
idea of this method is to use two reservoir arrays R
e
and R
w
of size s
e
and s
w
, respectively. For the stream-
ing graph, G
t
,R
e
stores a uniform subset of s
e
edges
from the graph G
t
and R
w
stores a uniform sample
of s
w
triples from the graph G
t
.With the arrival of
the edge e
t
, the method rst counts the number of tri-
ples in R
w
that the edge e
g
completes. Then it updates
both R
e
and R
w
to maintain uniform sampling. R
e
is
updated by inserting e
t
in R
e
probabilistically
(by following reservoir samplings idea); if this inser-
tion is successful, then R
w
is updated by inserting
(again probabilistically by reservoir sampling) the
new triples that are formed by the edge e
t
with the
already existing edges in R
e
.Also, for any state of R
e
and R
w
, the statistics of total triples (tot_triples)
formed by the edges in R
e
is computed. It is easy to
see that if the triples in R
w
are sampled uniformly
over all the triples, the fraction of triples that
are closed in R
w
(denoted as ρ) approximates the
ratio t(G)/|Π|. But, to obtain triangle count from this
ratio, we also need to know the total number of tri-
ples in G(|Π|), which they estimates by using Birth-
day Paradox. The expected number of triangles is
then [ρt
2
/s
e
(s
e
1)] ×tot_triples.
DISTRIBUTED AND PARALLEL
TRIANGLE COUNTING
We can use streaming method to approximate trian-
gle count of huge graphs that do not t in the main
memory of conventional machines. However, if we
strive for exact triangle counting on such large
graphs, distributed computing provides a viable
option. In this section, we will give an overview of
parallel and distributed triangle counting methods.
The parallel methods use multi-core machines and
the distributed methods run on MapReduce
platform.
31
One of the earliest works on distributed trian-
gle counting using MapReduce framework is pro-
posed by Suri and Vassilvitskii.
17
They proposed a
MapReduce variant of an efcient node iterator algo-
rithm, which is shown in Algorithm 3. This algo-
rithm has two rounds: rst round generates all
length-two paths in the graph from the edge list, in
parallel. Second round counts how many of the
length-two paths generated in the rst round have a
closing edge in the graph. To accomplish this, the
second round takes the output of the rst round
(denoted as Type 1 input in Algorithm 3) along with
the original edge list (denoted as Type 2 input in
Algorithm 3) as inputs. Suri and Vassilvitskii
17
also
proposed a graph partition-based MapReduce
algorithm, which rst partitions the graph and then
runs an exact triangle counting method on each par-
tition, in parallel. Later, Park and Chung
32
identify
redundant computation in Suri et al.s method and
proposed another partitioning method called Trian-
gle Type Partitioning. Pagh and Tsourakakis
26
pro-
posed a MapReduce version of their edge sampling-
based method, however, this method provides an
approximate count only as it is based on sampling.
Arifuzzaman et al.
33
proposed a distributed
memory-based parallel algorithm for triangle count-
ing using message passing interface. This algorithm
partitions the graph based on disjoint subsets of
nodes (core nodes), and generates induced subgraph
from the subset of nodes and their neighborhood.
Each induced subgraph is assigned to a machine and
triangles are counted independently in each machine
for corresponding core nodes. Last, it combines the
results from all the machines to get global triangle
count.
Kim et al.
34
proposed a disk-based framework
for triangle counting using multi-core CPU. They cat-
egorize the triangles into two types; internal triangles,
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 13 of 19
for which the adjacency lists of two connected nodes
are in the main memory and external triangles, for
which only one adjacency list is in the main memory.
This framework stores adjacency list as slotted page
structure in disk and use asynchronous read to load
required page into buffer memory. The buffer mem-
ory is split such that it contains pages (adjacency lists)
corresponding to both types of triangles (internal and
external) at the same time. The triangles are counted
when the adjacency lists are in the buffer, and to
avoid redundancy each page is loaded in the buffer
exactly once. In this framework, both type of triangles
are counted by two separate threads and for maxi-
mum utilization of CPU, it also uses thread morphing
if one of the threads completes its work and termi-
nates. The framework also uses openMP to use addi-
tional available threads to count internal triangle
counting and later use thread morphing, if required.
Shun and Tangwongsan
35
proposed a multi-
core parallel algorithm for shared memory machines.
The proposed algorithm has two steps: in the rst
step, each node is ranked basked on degree and
ranked adjacency list of each node is generated,
which contains only higher ranked nodes than the
current node; the second step counts triangles from
the ranked adjacency list for each node. For the rst
step, after ranking each node, the generation of
ranked adjacency list is easily parallelizable as this
task is independent for each node. For the second
step, an array is created to put values of each locally
counted triangles, here the size of the array is the
total size of ranked adjacency list for all node. Lastly,
the actual triangle count is the summation of the
values in the array. Rahman and Al Hasan
13
also
proposed a multi-core parallel algorithm for triangle
counting, which distributes the loop of node/edge
iterator algorithms across multiple cores.
EXPERIMENTAL COMPARISONS
Schank
21
has performed a thorough experimental
comparison among different exact triangle counting
methods. In this survey, we make a thorough com-
parison among various approximate triangle count-
ing methods. For the comparison, we consider two
sparsication-based methods: DOULIONand Col-
orful triangle counting.For both the methods, the
triangles in the sparse network is counted exactly by
using efcient edge iterator algorithm. So, we call
these methods doulion_Edgeiter, and color_Edgeiter,
respectively. We also consider two triple sampling-
based methods: Direct Sampling(Eq. (10)) and uni-
form sampling with importance weight adjustment,
hereby named as Uniform_Importance (Eq. (13)).
The sampling version of edge iterator
13
hereby
named as Sampled_edgeIter is also considered. We
further consider two of the restricted access-based
methods: random walk over nodes with importance
weight adjustment hereby named as randWalk_im-
portance (Eq. (14)), and vertexMCMC
12
,
i.e., MCMC walk over nodes. We implement all the
above methods ourself by using identical graph data
structures and edge existence query module. This
ensures fairness among the comparison. We inten-
tionally omitted some of the approximate triangle
counting methods in this experiment after we found
that their performance is substantially poorer than
the performance of the methods we report here.
For the experiment, we use four large graphs
collected from the KONECT [the Koblenz Network
Collection (http://konect.uni-koblenz.de/networks/)].
The rst, as-skitteris a network of autonomous
systems on the Internet, where autonomous systems
are nodes and connection between them are edges.
The ickr,’‘livejournaland orkutare social net-
works, where each node is a user and an edge between
users shows friendship between the users. The basic
statistics of the datasets is shown in Table 2, where
|V|, |E|, and t(G) are the number of vertices, the num-
ber of edges and the number of triangles in the graph,
and time(ms) is the time taken in millisecond by edge
iterator with hashing method, which is one of the fast-
est exact triangle counting method.
Comparing approximate triangle counting
methods is tricky, as many of these methods are
sampling-based methods and hence, they have error-
runtime trade-off, i.e., if we take more samples, the
error decreases but runtime increases, and vice-versa.
Besides, the population from which these methods
sample are different; some sample triples, some sam-
ple triangles, and some sample edges. So it is not easy
to simply compare the error of these methods using a
unied sampling factor. So, we report both error and
runtime of a method as a point on a graph. For each
method, we take three points by running them for
three different sampling factor values. For a given
method, we connect the three points by a piece-wise
linear curve. To obtain the data of this graph, the
TABLE 2 |Basic Statistics of the Datasets
Dataset |
V
||
E
|
t
(
G
) Time (ms)
as-skitter 1.69M 11.09M 28.77M 38, 989.82
ickr 1.72M 15.55M 548.17M 174, 216.12
liveJournal 5.20M 48.71M 310.87M 231, 541.28
orkut 3.07M 117, 19M 627.58M 867, 634.33
Advanced Review wires.wiley.com/dmkd
14 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
error is computed as percentage error, which is equal
to jexact-countapprox-countj×100
exact-count and runtime is reported
in millisecond.
For the sparsication-based methods and edge
sampling-based methods, we use the sampling probabil-
ity p2{0.001, 0.01, 0.1}. For both, full access- and
restricted access-based triple sampling methods, number
of samples |T| are selected as {0.001%,0.01%,0.1%}of
|Π|, where |Π| is the total number of triples. These values
of pand |T| help us to understand how time and errors
are related for the approximation methods. For all the
methods, we use a serial version of the method, although
they can easily be parallelized to run faster. We run all
our experiments on a machine with AMD 2.3 GHz proc-
essor, 128 GB RAM, and Red Hat Enterprise Server
Release 7.3 OS. For all data points, the value is computed
after running the corresponding method for 10 times and
then taking the average value of those runs.
In Figure 2, we show four charts, each for one of
the datasets. Each chart has seven piece-wise linear
curves, each representing one of the approximate tri-
angle counting methods. Each curve has three points,
representing log(error) versus log(runtime) of a
method for three sampling rates (pvalues). All the
curves have a negative slope showing the inverse rela-
tionship between error and runtime, i.e., in the lowest
sampling rate they have the smallest runtime but the
largest error. The data point that is closest to the ori-
gin is the best as it has both small error and small
runtime.
If we observe the graphs carefully, we can con-
clude that both the sparsication-based methods take
more time to achieve as high accuracy as other
approximation methods. For lower value of p, DOU-
LION has very high error and colorful sampling
method performs consistently better than DOULION
in that case, but takes more time. Among other
approximation methods, almost all methods take simi-
lar amount of time but have different error values. The
best approximation method is Direct Sampling,
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Results for as-skitter Results for flickr
Results for liveJournal Results for orkut
Color_edgelter
RandWalk_importance
4.64.54.4
Log10 (ms)
Log10 (%error)
4.34.2
4.95 5.00 5.05 5.10 5.15 5.20 5.25 5.30 5.35 5.35 5.40 5.45 5.50 5.55 5.60 5.65 5.70 5.75
4
(a) (b)
(c) (d)
3
2
1
0
–1
–2
Log10 (ms)
Log10 (ms) Log10 (ms)
Log10 (%error)
4.4 4.5 4.6 4.7 4.8
3
2
1
0
–1
–2
Log10 (%error)
3
2
1
0
–1
–2
Log10 (%error)
3
4
2
1
0
–1
–2
–3
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
Direct_sampling
Uniform_importance
Doulion_edgelter
Sampled_edgelter
Color_edgelter
RandWalk_importance
VertexMCMC
FIGURE 2 |Comparison of approximation methods.
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 15 of 19
which always provides the lowest error. Sampled_-
edgeIter method also performs very good consistently
and on some occasions it is faster than Direct Sam-
plingmethod, but with a higher error. The indirect
sampling-based methods (MCMC, and -importance
weight-based sampling) are fast, but they generally
have a higher error than the direct sampling-based
method.
OTHER RELATED COUNTING TASKS
Although triangle counting task has received enor-
mous attention, there are other counting tasks that
count higher order graphical structures beyond trian-
gles. Obvious extension of a triangle is a k-cliquea
complete graph with kvertices, and a related count-
ing task is to count the distinct k-cliques in a given
graph for a user chosen value of k. However, count-
ing k-clique is much more difcult than counting tri-
angles, as the number of k-cliques increases
exponentially with the value of k. An efcient
sequential solution for k-clique counting algorithm
can be obtained by modifying the well-known Bron
Kerbosch algorithm.
36
But, this algorithm does not
scale to large real-life networks. To solve the lack of
scalability issue, Finocchi et al.
37
proposed a distribu-
ted solution for k-clique counting algorithm which
runs on MapReduce.
Graphlet counting is another related task which
has become very popular in recent years because of
its wide applicability in different domains for various
tasks including network classication,
38
biological
network comparisons,
39,40
image classication,
41
and
building graph kernels for chemoinformatics.
42
For a
given k, all possible k-size graphical topologies are
collectively referred as graphlets. For undirected
graphs, there are 2 size-3 graphlets (open triples and
closed triples), 6 size-4 graphlets, and 21 size-5
graphlets. The number of distinct graphlets increases
exponentially with the size of the graphlet. The
counting task is to obtain the count of the total num-
ber of distinct-induced occurrences of all graphlets
(of a given size) in a given network.
In existing literature, several works exist for
solving the graphlet counting problem exactly, exam-
ples include FANMOD,
43
RAGE,
44
GRAFT,
38
and
ESCAPE.
45
The earliest among the above, FANMOD
and RAGE, are very slow, mainly because they use
an enumeration-based approach. GRAFT is relatively
better than the above two as it only enumerates tree
graphlets and then counts the other graphlets by ef-
cient edge existence check. Ho
cevar and Demšar
46
provided an efcient method, named ORCA, which
does not enumerate all graphlets, but counts a subset
of graphets and calculates other graphlet counts
using a combinatorial approach. However, ORCA is
not highly scalable when it needs to handle huge
real-world graph with millions of nodes/edges.
Recently, Ahmed et al.
47
provided a highly efcient
and scalable method, namely PGD, for graphlet
counting by utilizing graphlet transition. Graphlet
transition relates two graphlets by using add/removal
of an edge, which helps to calculate count of one
graphlet using the count of other smaller graphlet.
PGD is scalable, but works for upto four-sized
graphlets. Pinar et al.
45
proposed a method
(ESCAPE), which can provide count of ve size
graphlet very efciently. Similar to PGD, ESCAPE
also calculates counts of four- and ve-sized graph-
lets using counts of specic set of other (mostly smal-
ler) graphlets.
Similar to the case of triangle counting, approxi-
mate counting method has also been popular for
graphlet counting. Rand-ESU (available in FANMOD
library) is one of the earliest approximate graphlet
counting method, but its accuracy is poor. In recent
years, Bhuiyan et al.
48
has proposed a method called
GUISE, which uses MCMC sampling for obtaining
uniform samples of graphlets through random walk.
GUISE samples upto size-5 graphlets, but Saha and Al
Hasan
49
have generalized the method so that it can
sample graphlets of any size. Wang et al.
50
proposed
an improved and more efcient method based on ran-
dom walk. Rahman et al.
51
proposed edge sampling-
based approximation method (GRAFT), which aligns
sampled edge with a specic edge of a graphlet and
then enumerate all embeddings of the graphlet. Jha
et al.
52
propose three-path sampling-based method
for four size approximate graphlet counting, which
has been proved to be more efcient than GUISE and
GRAFT. Recently, Bressan et al.
53
proposed color
coding-based approach and show its superiority over
MCMC-based method.
CONCLUSIONS
Triangles play a very important role in network anal-
ysis. In social networks, triangles represent transitiv-
ity, which is important for understanding network
evolution over the time. In biological networks, sev-
eral motifs have been found to be triangle represent-
ing various biological pathways. Due to the
importance of triangles, enumeration and counting
them in large networks are important tasks. Both
enumeration and counting of triangles have been
studied for a long time, but in recent years, there has
Advanced Review wires.wiley.com/dmkd
16 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
been a renewed interest in triangle counting methods
considering approximate counting, parallel and dis-
tributed implementation, and restricted and stream-
ing data access scenarios.
In this survey, we discuss the existing methods
of triangle counting, ranging from sequential to par-
allel, single-machine to distributed, exact to approxi-
mate, and off-line to streaming. We place more
emphasis on the recent methods, specically on the
methods for approximate triangle counting by sam-
pling. We also show some experimental comparison
of the approximate triangle counting methods by
implementing them with a uniform data structures.
Our results show that triple sampling-based methods
are superior over other approximate triangle count-
ing methods, both in terms of accuracy and runtime.
Future works in this direction will consider counting
higher order graphical structures having more than
three vertices. Some works on counting higher order
structures have already emerged, but given that it is a
very active area of research, we expect that many
more studies will come in near future.
ACKNOWLEDGMENT
This research is supported partly by NSF-1149851
grant and a Research Award from CareerBuilder.
REFERENCES
1. Watts DJ, Strogatz S. Collective dynamics of small-
worldnetworks. Nature 1998, 393:440442.
2. Luce RD, Perry AD. A method of matrix analysis of
group structure. Psychometrika 2001, 14:95116.
3. McPherson M, Smith-Lovin L, Cook JM. Birds of a
feather: homophily in social networks. Annu Rev Soc
2001, 27:415444.
4. Aggarwal C, Subbian K. Evolutionary network analy-
sis: a survey. ACM Comput Surv 2014,
47:10:110:36. https://doi.org/10.1145/2601412.
5. Becchetti L, Boldi P, Castillo C, Gionis A. Efcient
semi-streaming algorithms for local triangle counting
in massive graphs. In: Proc. of 4th ACM SIGKDD,
2008, 624.
6. Eckmann JP, Moses E. Curvature of co-links uncovers
hidden thematic layers in the world wide web. Proc
Natl Acad Sci U S A 2002, 99:58255829.
7. Bar-Yossef Z, Kumar R, Sivakumar D. Reductions in
streaming algorithms, with an application to counting
triangles in graphs. In: Proceedings of the Thirteenth
Annual ACM-SIAM Symposium on Discrete Algo-
rithms, SODA 02, Philadelphia, PA, USA, 2002,
623632. Society for Industrial and Applied Mathe-
matics. ISBN: 0-89871-513-X. Available at: http://dl.
acm.org/citation.cfm?id=545381.545464.
8. Palla G, Dereny I, Farkas I, Vicsek T. Uncovering the
overlapping community structure of complex networks
in nature and society. Nature 2005, 435:814818.
9. Itai A, Rodeh M. Finding a minimum circuit in a
graph. In: Proceedings of the Ninth Annual ACM Sym-
posium on Theory of Computing, STOC 77,
1977, 110.
10. Alon N, Yuster R, Zwick U. Finding and counting
given length cycles. Algorithmica 1997, 17:209223.
11. Charalampos E, Tsourakakis E. Fast counting of trian-
gles in large real networks without counting:
algorithms and laws. In: 2008 I.E. 8th International
Conference on Data Mining, 2008, 608617.
12. Rahman M, Al Hasan M. Sampling triples from
restricted networks using MCMC strategy. In: Pro-
ceedings of the 23rd ACM International Conference
on Conference on Information and Knowledge Man-
agement, CIKM 2014, Shanghai, China, 37
November, 2014, 15191528. 10.1145/2661829.
2662075.
13. Rahman M, Al Hasan M. Approximate triangle count-
ing algorithms on multi-cores. In: Proceedings of the
2013 I.E. International Conference on Big Data, Santa
Clara, CA, USA, 69 October, 2013, 127133.
10.1109/BigData.2013.6691744
14. Tsourakakis CE, Kang U, Miller GL, Faloutsos C.
Doulion: counting triangles in massive graphs with a
coin. In: Proceedings of the Fifteen ACM SIGKDD
International Conference on Knowledge Discovery in
Data Mining, 2009.
15. Jha M, Seshadhri C, Pinar A. A space efcient stream-
ing algorithm for triangle counting using the birthday
paradox. In: Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, KDD 13, New York, NY, USA,
ACM, 2013, 589597. ISBN: 978-1-4503-2174-7.
10.1145/2487575.2487678
16. Kolda TG, Pinar A, Seshadhri C. Triadic measures on
graphs: the power of wedge sampling. In: SIAM Data
Mining, SIAM, 2013, 1018.
17. Suri S, Vassilvitskii S. Counting triangles and the curse
of the last reducer. In: Proceedings of the 20th Interna-
tional Conference on World Wide Web, WWW 11,
2011, 607614.
18. Buriol LS, Frahling G, Leonardi S, Marchetti-
Spaccamela A, and Sohler C. Counting triangles in
data streams. In: Proceedings of the Twenty-fth ACM
SIGMOD-SIGACT-SIGART Symposium on
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 17 of 19
Principles of Database Systems, PODS 06, New York,
NY, USA. ACM, 2006, 253262. ISBN: 1-59593-318-
2. 10.1145/1142351.1142388
19. Barabasi A-L, Albert R. Emergence of scaling in ran-
dom networks. Science 1999, 286:509512.
20. Newman MEJ, Watts DJ, Strogatz SH. Random graph
models of social networks. Proc Natl Acad Sci U S A
2002, 99(suppl 1):25662572.
21. Schank T. Algorithmic aspects of triangle-based net-
work analysis. PhD Thesis, Department of Computer
Science, University of Karlsruhe, 2007.
22. Le Gall F. Powers of tensors and fast matrix multiplica-
tion. In: Proceedings of the 39th International Sympo-
sium on Symbolic and Algebraic Computation, 2014.
23. Schank T, Wagner D. Finding, counting and listing all
triangles in large graphs, an experimental study. In:
Proceedings of the 4th International Conference on
Experimental and Efcient Algorithms, WEA 05,
2005, 606609.
24. Latapy M. Main-memory triangle computations for
very large (sparse (power-law)) graphs. Theor Comput
Sci 2008, 407:458473.
25. Tsourakakis CE, Kang U, Miller GL, Faloutsos C.
Doulion: counting triangles in massive graphs with a
coin. In: Proceedings of the 15th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and
Data Mining, KDD 09, 2009, 837846.
26. Pagh R, Tsourakakis CE. Colorful triangle counting
and a mapreduce implementation. Inf Process Lett
2012, 112:277281.
27. Etemadi R, Lu J, Tsin YH. Efcient estimation of tri-
angles in very large graphs. In: Proceedings of the 25th
ACM International on Conference on Information and
Knowledge Management, CIKM 16, 2016,
12511260.
28. Schank T, Wagner D. Approximating clustering-
coefcient and transitivity. J Graph Algorithms Appl
2005, 9:265275.
29. Al Hasan M. Chapter 5: Methods and applications of
network sampling. In: Gupta A, Capponi A, eds. Opti-
mization Challenges in Complex, Networked and
Risky Systems. Catonsville, MD: INFORMS; 2016,
115139. https://doi.org/10.1287/educ.2016.0147.
30. Vitter JS. Random sampling with a reservoir. ACM
Trans Math Softw 1985, 11:3757.
31. Dean J, Ghemawat S. Mapreduce: simplied data pro-
cessing on large clusters. In: Proc. of the 6th confer-
ence on Operating Systems Design and
Implementation Volume 6, 2004, 137149.
32. Park H-M, Chung C-W. An efcient mapreduce algo-
rithm for counting triangles in a very large graph. In:
Proceedings of the 22Nd ACM International Confer-
ence on Information & Knowledge Management,
CIKM 13, 2013, 539548.
33. Arifuzzaman S, Khan M, Marathe M. Patric: a parallel
algorithm for counting triangles in massive networks.
In: Proceedings of the 22Nd ACM International Con-
ference on Information & Knowledge Management,
CIKM 13, 2013, 529538.
34. Kim J, Han W-S, Lee S, Park K, Yu H. Opt: a new
framework for overlapped and parallel triangulation in
large-scale graphs. In: Proceedings of the 2014 ACM
SIGMOD International Conference on Management
of Data, SIGMOD 14, 2014, 637648.
35. Shun J, Tangwongsan K. Multicore triangle computa-
tions without tuning. In: 2015 I.E. 31st International
Conference on Data Engineering, April 2015,
149160.
36. Bron C, Kerbosch J. Algorithm 457: nding all cliques
of an undirected graph. Commun ACM 1973,
16:575577. https://doi.org/10.1145/362342.362367.
37. Finocchi I, Finocchi M, Fusco EG. Clique counting in
mapreduce: algorithms and experiments. J Exp Algo-
rithmics 2015, 20:1.7:11.7:20. https://doi.org/10.
1145/2794080.
38. Rahman M, Bhuiyan MA, Al Hasan M. GRAFT: an
efcient graphlet counting method for large graph
analysis. IEEE Trans Knowl Data Eng 2014,
26:24662478. https://doi.org/10.1109/TKDE.2013.
2297929.
39. Hayes W, Sun K, Pržulj N. Graphlet-based measures
are suitable for biological network comparison. Bioin-
formatics 2013, 29:483.
40. Pržulj N. Biological network comparison using graph-
let degree distribution. Bioinformatics 2007, 23:e177.
41. Zhang L, Hong R, Gao Y, Ji R, Dai Q, Li X. Image
categorization by learning a propagated graphlet path.
IEEE Trans Neural Netw Learn Syst 2016,
27:674685.
42. Kashima H, Saigo H, Hattori M, Tsuda K. Graph ker-
nels for chemoinformatics. In: Chemoinformatics and
Advanced Machine Learning Perspectives: Complex
Computational Methods and Collaborative Techni-
ques. Hershey, PA: IGI Global; 2010, 1.
43. Wernicke S, Rasche F. Fanmod: a tool for fast network
motif detection. Bioinformatics 2006, 22:11521153.
44. Marcus D, Shavitt Y. Rage a rapid graphlet enumer-
ator for large networks. Comput Netw 2012,
56:810819.
45. Pinar, A, Seshadhri C, Vishal V. Escape: efciently
counting all 5-vertex subgraphs. In: Proceedings of the
26th International Conference on World Wide Web,
WWW 17, 2017, 14311440.
46. Ho
cevar T, Demšar J. A combinatorial approach to
graphlet counting. Bioinformatics 2014, 30:559565.
47. Ahmed NK., Neville J, Rossi RA, Dufeld N. Efcient
graphlet counting for large networks. In: Proceedings
of the 2015 I.E. International Conference on Data
Advanced Review wires.wiley.com/dmkd
18 of 19 © 2017 Wiley P e r i o d i c a l s , Inc. Vol u m e 8 , M a r c h / A pril 2018
Mining (ICDM), ICDM 15, 2015, 110. ISBN: 978-1-
4673-9504-5.
48. Bhuiyan MA, Rahman M, Rahman M, Al Hasan M.
GUISE: uniform sampling of graphlets for large graph
analysis. In: 2012 I.E. 12th International Conference
on Data Mining, December, 2012, 91100. 10.1109/
ICDM.2012.87.
49. Saha TK, Al Hasan M. Fs
3
: a sampling based method
for top-k frequent subgraph mining. Stat Anal Data
Min 2015, 8:245261. https://doi.org/10.1002/sam.
11277.
50. P Wang, J C S Lui, B Ribeiro, D Towsley, J Zhao, and
X Guan. Efciently estimating motif statistics of large
networks. ACM Trans Knowl Discov Data,
9:8:18:27 2014. ISSN: 1556-4681.
51. Rahman M, Bhuiyan M, Al Hasan M. Graft: An
approximate graphlet counting algorithm for large
graph analysis. In: Proceedings of the 21st ACM Inter-
national Conference on Information and Knowledge
Management, CIKM 12, 2012, 14671471.
52. Jha M, Seshadhri C, Pinar A. Path sampling: a fast and
provable method for estimating 4-vertex subgraph
counts. In: Proceedings of the 24th International Con-
ference on World Wide Web, WWW 15, 2015,
495505.
53. Bressan M, Chierichetti F, Kumar R, Leucci S,
Panconesi A. Counting graphlets: space vs time. In:
Proceedings of the Tenth ACM International Confer-
ence on Web Search and Data Mining, WSDM 17,
2017, 557566.
WIREs Data Mining and Knowledge Discovery Triangle counting in large networks
Vo l u m e 8 , M a r c h / A p ril 2018 © 2017 W i l e y P e r i o d i c a l s , Inc. 19 of 19
... One critical aim in the analysis of complex networks is that of identifying cliques, which is known for its NP-hard complexity [7,28]. In classical algorithms, a typical strategy in identifying cliques is to build upon lower dimensional cliques, such as triangle cliques formed by a combination of edges [1,14]. More contemporary algorithms also compute higher-order cliques (i.e. ...
... Throughout this process, we analyze the bottlenecks and key features of each classic algorithm, ultimately developing alternative methods that combine elements of traditional approaches [52,53] with the node neighbourhood strategy. More specifically, among the various algorithms for finding triangles [1,2,2,15,32,55], we drew inspiration from those that iterate over edges. While these algorithms are effective for detecting low-dimensional cliques, they are limited in scope. ...
... The idea of the algorithms relies on the extension of the iterated edges to higher-order simplices. In this case, the ordering of edges is (1, 2), (2, 3), (1, 5), (3,5), (1,3), (3,4), (4,5). One new edge can give rise to several cliques, for instance, edge (1, 3) at iteration 8, which provides clique expansions at iterations 9, 10 (two triangles -not visible) and 11 (one tetrahedron). ...
Preprint
Full-text available
Identifying cliques in dense networks remains a formidable challenge, even with significant advances in computational power and methodologies. To tackle this, numerous algorithms have been developed to optimize time and memory usage, implemented across diverse programming languages. Yet, the inherent NP-completeness of the problem continues to hinder performance on large-scale networks, often resulting in memory leaks and slow computations. In the present study, we critically evaluate classic algorithms to pinpoint computational bottlenecks and introduce novel set-theoretical approaches tailored for network clique computation. Our proposed algorithms are rigorously implemented and benchmarked against existing Python-based solutions, demonstrating superior performance. These findings underscore the potential of set-theoretical techniques to drive substantial performance gains in network analysis.
... The problem of counting global and local triangles in a graph has been extensively studied in the last decades [10,12,20,21,37,38], in many different settings [25][26][27][28]36]. Due to space constraints, we discuss here the works mostly related to ours, focusing on sampling approaches to estimate triangle counts in graph streams, and refer to the surveys [1,16] for a more in-depth presentation of other approaches. Sampling approaches fall into two main categories: fixed memory, where edges are sampled without exceeding a given memory budget, and fixed probability, where edges are sampled with a given fixed probability. ...
... We consider an undirected graph G = (V, E) with no self-loops and no multiple edges, where V and E are the set of nodes and the set of edges, respectively, with |V | = n and |E| = m. Edges are observed in arbitrary order through the graph stream Σ = {e (1) , ..., e (m) }. ...
... For t = 1, ..., m, we denote with G (t) = (V, E (t) ), where E (t) = {e (1) , . . . , e (t) }, the graph up to time t; note that G (m) = G. ...
Preprint
In this work, we present the first efficient and practical algorithm for estimating the number of triangles in a graph stream using predictions. Our algorithm combines waiting room sampling and reservoir sampling with a predictor for the heaviness of edges, that is, the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates. We also propose a simple and domain-independent predictor, based on the degree of nodes, that can be easily computed with one pass on a stream of edges when the stream is available beforehand. Our analytical results show that, when the predictor provides useful information on the heaviness of edges, it leads to estimates with reduced variance compared to the state-of-the-art, even when the predictions are far from perfect. Our experimental results show that, when analyzing a single graph stream, our algorithm is faster than the state-of-the-art for a given memory budget, while providing significantly more accurate estimates. Even more interestingly, when sequences of hundreds of graph streams are analyzed, our algorithm significantly outperforms the state-of-the-art using our simple degree-based predictor built by analyzing only the first graph of the sequence.
... Comprehensive surveys in the literature (Ribeiro et al., 2021;Al Hasan & Dave, 2018;Ortmann & Brandes, 2014) provide in-depth insights into subgraph counting and triangle (three-clique) counting algorithms in the literature. However, there needs to be more literature that provides a comprehensive review of algorithms for counting k-cliques for k is greater than three. ...
... Despite the availability of comprehensive surveys that provide in-depth insights into subgraph counting and triangle (three-clique) counting algorithms (Ribeiro et al., 2021;Al Hasan & Dave, 2018;Ortmann & Brandes, 2014), there is a notable gap in the literature concerning a thorough review of algorithms for counting k-cliques where k is greater than three. This survey aims to fill that gap, addressing the need for a comprehensive review of k-clique counting algorithms beyond triangles. ...
Article
Full-text available
Clique counting is a crucial task in graph mining, as the count of cliques provides different insights across various domains, social and biological network analysis, community detection, recommendation systems, and fraud detection. Counting cliques is algorithmically challenging due to combinatorial explosion, especially for large datasets and larger clique sizes. There are comprehensive surveys and reviews on algorithms for counting subgraphs and triangles (three-clique), but there is a notable lack of reviews addressing k-clique counting algorithms for k > 3. This paper addresses this gap by reviewing clique counting algorithms designed to overcome this challenge. Also, a systematic analysis and comparison of exact and approximation techniques are provided by highlighting their advantages, disadvantages, and suitability for different contexts. It also presents a taxonomy of clique counting methodologies, covering approximate and exact methods and parallelization strategies. The paper aims to enhance understanding of this specific domain and guide future research of k-clique counting in large-scale graphs.
... Counting triangle has been a well motivated problem with applications across various domains like optimizing query size in database join problems (Bar-Yossef et al., 2002;Atserias et al., 2013;Assadi et al., 2019), computing clustering coefficients, transitivity ratio (Aggarwal and Subbian, 2014;Luce and Perry, 1949;Watts and Strogatz, 1998;Leskovec et al., 2008), studying structures in web graphs (Eckmann and Moses, 2001;Danisch et al., 2018) etc. For a more thorough overview, see the surveys (Al Hasan and Dave, 2018;Tsourakakis et al., 2011). The parametrization based on arboricity is also of practical interest due to occurrence of low-arboricity graphs in various real world scenarios (Dory et al., 2022;Konrad et al., 2024;Onak et al., 2020;Goel and Gustedt, 2006;Danisch et al., 2018;Shin et al., 2018). ...
Preprint
Full-text available
Given a simple, unweighted, undirected graph G=(V,E) with V=n|V|=n and E=m|E|=m, and parameters 0<ε,δ<10 < \varepsilon, \delta <1, along with \texttt{Degree}, \texttt{Neighbour}, \texttt{Edge} and \texttt{RandomEdge} query access to G, we provide a query based randomized algorithm to generate an estimate T^\widehat{T} of the number of triangles T in G, such that T^[(1ε)T,(1+ε)T]\widehat{T} \in [(1-\varepsilon)T , (1+\varepsilon)T] with probability at least 1δ1-\delta. The query complexity of our algorithm is O~(mαlog(1/δ)/ε3T)\widetilde{O}\left({m \alpha \log(1/\delta)}/{\varepsilon^3 T}\right), where α\alpha is the arboricity of G. Our work can be seen as a continuation in the line of recent works [Eden et al., SIAM J Comp., 2017; Assadi et al., ITCS 2019; Eden et al. SODA 2020] that considered subgraph or triangle counting with or without the use of \texttt{RandomEdge} query. Of these works, Eden et al. [SODA 2020] considers the role of arboricity. Our work considers how \texttt{RandomEdge} query can leverage the notion of arboricity. Furthermore, continuing in the line of work of Assadi et al. [APPROX/RANDOM 2022], we also provide a lower bound of Ω~(mαlog(1/δ)/ε2T)\widetilde{\Omega}\left({m \alpha \log(1/\delta)}/{\varepsilon^2 T}\right) that matches the upper bound exactly on arboricity and the parameter δ\delta and almost on ε\varepsilon.
Article
In many real-world applications (e.g., email networks, social networks, and phone call networks), the relationships between entities can be modeled as a temporal graph, in which each edge is associated with a timestamp representing the interaction time. As a fundamental task in temporal graph analysis, triangle counting has received much attention, and several triangle models have been developed, including δ-temporal triangle, sliding-window triangle, and (δ 1,3 , δ 1,2 , δ 2,3 )-temporal triangle. In particular, the δ-temporal triangle, requiring the gap of timestamps of any two edges within it to be bounded by a threshold δ, has been demonstrated effective in many real applications, such as cohesiveness analysis, transitivity, clustering coefficient, and graph classification. In this paper, we study fast algorithms for counting δ-temporal triangles in a given query time window. We first propose an online algorithm, which enumerates all edges in the graph and for each edge, calculates how many δ-temporal triangles end with the edge. We further develop an efficient index-based solution, which maps δ-temporal triangles into points of the 2-dimensional space and further compactly organizes these points using hierarchical structures. Besides, we study the problem of binary δ-temporal triangle counting by considering the existence of δ-temporal triangle among three vertices. Experiments on large temporal graphs show that our online algorithm is up to 70× faster than the state-of-the-art algorithm, and our index-based algorithm is up to 10 ⁸ × faster than the online algorithm.
Chapter
Full-text available
Network data appears in various domains, including social, communication , and information sciences. Analysis of such data is crucial for making inferences and predictions about these networks, and moreover, for understanding the different processes that drive their evolution. However, a major bottleneck to perform such an analysis is the massive size of real-life networks, which makes modeling and analyzing these networks simply infeasible. Further, many networks, specifically, those that belong to social and communication domains are not visible to the public due to privacy concerns, and other networks, such as the Web, are only accessible via crawling. Therefore, to overcome the above challenges, researchers use network sampling overwhelmingly as a key statistical approach to select a sub-population of interest that can be studied thoroughly. In this tutorial, we aim to cover a diverse collection of methodologies and applications of network sampling. We will base the discussion of network sampling in terms of population of interest (vertices, edges, motifs), and sampling methodologies (such as Metropolis-Hastings, random walk, and importance sampling). We will also present a number of applications of these methods.
Conference Paper
Full-text available
Until a few years ago, the fastest known matrix multiplication algorithm, due to Coppersmith and Winograd (1990), ran in time O(n2.3755). Recently, a surge of activity by Stothers, Vassilevska-Williams, and Le~Gall has led to an improved algorithm running in time O(n2.3729). These algorithms are obtained by analyzing higher and higher tensor powers of a certain identity of Coppersmith and Winograd. We show that this exact approach cannot result in an algorithm with running time O(n2.3725), and identify a wide class of variants of this approach which cannot result in an algorithm with running time $O(n^{2.3078}); in particular, this approach cannot prove the conjecture that for every ε > 0, two n x n matrices can be multiplied in time O(n2+ε). We describe a new framework extending the original laser method, which is the method underlying the previously mentioned algorithms. Our framework accommodates the algorithms by Coppersmith and Winograd, Stothers, Vassilevska-Williams and Le~Gall. We obtain our main result by analyzing this framework. The framework also explains why taking tensor powers of the Coppersmith--Winograd identity results in faster algorithms.
Conference Paper
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.
Conference Paper
Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural approaches based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that this approach is outperformed by a carefully engineered version of color coding (CC) [1], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC. Furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC's memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that a careful implementation of CC can push the limits of the state of the art, both in terms of the size of the input graph and of that of the graphlets.
Conference Paper
The number of triangles in a graph is an important metric for understanding the graph. It is also directly related to the clustering coefficient of a graph, which is one of the most important indicator for social networks. Counting the number of triangles is computationally expensive for very large graphs. Hence, estimation is necessary for large graphs, particularly for graphs that are hidden behind searchable interfaces where the graphs in their entirety are not available. For instance, user networks in Twitter and Facebook are not available for third parties to explore their properties directly. This paper proposes a new method to estimate the number of triangles based on random edge sampling. It improves the traditional random edge sampling by probing the edges that have a higher probability of forming triangles. The method outperforms the traditional method consistently, and can be better by orders of magnitude when the graph is very large. The result is demonstrated on 20 graphs, including the largest graphs we can find. More importantly, we proved the improvement ratio, and verified our result on all the datasets. The analytical results are achieved by simplifying the variances of the estimators based on the assumption that the graph is very large. We believe that such big data assumption can lead to interesting results not only in triangle estimation, but also in other sampling problems.
Conference Paper
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
Conference Paper
From social science to biology, numerous applications often rely on graphlets for intuitive and meaningful characterization of networks at both the global macro-level as well as the local micro-level. While graphlets have witnessed a tremendous success and impact in a variety of domains, there has yet to be a fast and efficient approach for computing the frequencies of these subgraph patterns. However, existing methods are not scalable to large networks with millions of nodes and edges, which impedes the application of graphlets to new problems that require large-scale network analysis. To address these problems, we propose a fast, efficient, and parallel algorithm for counting graphlets of size k={3,4}-nodes that take only a fraction of the time to compute when compared with the current methods used. The proposed graphlet counting algorithms leverages a number of proven combinatorial arguments for different graphlets. For each edge, we count a few graphlets, and with these counts along with the combinatorial arguments, we obtain the exact counts of others in constant time. On a large collection of 300+ networks from a variety of domains, our graphlet counting strategies are on average 460x faster than current methods. This brings new opportunities to investigate the use of graphlets on much larger networks and newer applications as we show in the experiments. To the best of our knowledge, this paper provides the largest graphlet computations to date as well as the largest systematic investigation on over 300+ networks from a variety of domains.
Article
The authors review graph kernels which is one of the state-of-the-art approaches using machine learning techniques for computational predictive modeling in chemoinformatics. The authors introduce a random walk graph kernel that defines a similarity between arbitrary two labeled graphs based on label sequences generated by random walks on the graphs. They introduce two applications of the graph kernels, the prediction of properties of chemical compounds and prediction of missing enzymes in metabolic networks. In the latter application, the authors propose to use the random walk graph kernel to compare arbitrary two chemical reactions, and apply it to plant secondary metabolism.