Conference PaperPDF Available

Abstract and Figures

Community detection is a valuable tool for analyzing complex networks. This work investigates the community detection problem based on the density-based algorithm DBSCAN*. This algorithm requires, though, a lower bound for the community size to be determined a priori, a challenging task. To this end, this work proposes the application of a Martingale process to DBSCAN* that progressively detects communities at various levels of granularity. The proposed DBSCAN*-Martingale community detection algorithm corresponds to an iterative process that progressively lowers the threshold of the size of the acceptable communities, while maintaining the communities detected for higher thresholds. Evaluation experiments are performed based on four realistic benhmark networks and the results indicate improvements in the effectiveness of the proposed DBSCAN*-Martingale community detection algorithm in terms of the Normalized Mutual Information and the RAND metrics against several state-of-the-art community detection approaches.
Content may be subject to copyright.
Community Detection in Complex Networks Based
on DBSCAN* and a Martingale Process
Ilias Gialampoukidis, Theodora Tsikrika, Stefanos Vrochidis and Yiannis Kompatsiaris
Information Technologies Institute
Centre for Research and Technology Hellas
Email: {heliasgj, theodora.tsikrika, stefanos, ikom}@iti.gr
Abstract—Community detection is a valuable tool for ana-
lyzing complex networks. This work investigates the community
detection problem based on the density-based algorithm DB-
SCAN*. This algorithm requires, though, a lower bound for the
community size to be determined a priori, a challenging task.
To this end, this work proposes the application of a Martingale
process to DBSCAN* that progressively detects communities at
various levels of granularity. The proposed DBSCAN*-Martingale
community detection algorithm corresponds to an iterative pro-
cess that progressively lowers the threshold of the size of
the acceptable communities, while maintaining the communi-
ties detected for higher thresholds. Evaluation experiments are
performed based on four realistic benhmark networks and the
results indicate improvements in the effectiveness of the proposed
DBSCAN*-Martingale community detection algorithm in terms
of the Normalized Mutual Information and the RAND metrics
against several state-of-the-art community detection approaches.
I. INTRODUCTION
Community detection in complex networks aims to identify
groups of nodes that are more densely connected to each
other than to the rest of the network [1] and thus probably
share common properties and/or play similar roles within the
network [2]. The detection of the community structure of
networks is of great importance in many fields, including
sociology and biology [3], as well as computer science [4],
i.e. disciplines where systems are often represented as net-
works. More recently, there has been increasing interest in
detecting communities on the Web [5] and social media [1]
so as to both gain valuable insights into the particular charac-
teristics and latent phenomena in such networks, and also to
exploit the detected communities in various applications, such
as in the detection of events in social media streams.
Detecting communities in complex networks is also known
as a graph partition problem, given that networks are usually
modelled as graphs. A graph can be split into communities
in numerous ways, i.e. for each graph there are many possi-
ble community structures. In the simple case, a community
structure is defined as a graph partition into a set of node sets.
Several community detection algorithms have been pro-
posed (e.g. [2], [6], [7], [3], [8], [9], [10], [11]). The quality
of their results is often evaluated by the use of modularity [4],
particularly in the absense of appropriate ground-truth. Hence,
several approaches use modularity optimization itself as a
method for the detection of communities in complex net-
works [2]. Alternative to the maximization of modularity, the
minimization of the so-called codelength description, being the
minimum Shannon information needed to describe a random
walk on the network, has also played a key role in revealing
community structure [11]. However, none of these approaches
is able to identify noise, i.e. nodes that are not members of any
community. To address this issue, density-based community
detection approaches are more appropriate since they provide
support for leaving spuriously connected nodes (i.e. noise) out
of the detected community structure.
DBSCAN* [12], the graph analogue of the well-established
DBSCAN [13] algorithm, is such a density-based approach
that could be applied to community detection. Similarly to
DBSCAN, it relies on two parameters, the density level and
a lower bound M inP ts for the number of nodes that may form
a community. Both these parameters greatly affect the output of
the algorithm, but their estimation is far from trivial. To address
this issue, and in particular the estimation of the M inP ts pa-
rameter, this work proposes an extension to DBSCAN* based
on Doob’s Martingale [14], which involves the construction
of a Martingale that progressively gains knowledge about the
communities in the network based on an iterative application
of DBSCAN* for several values of MinP ts.
The main contributions of this work are three-fold: (i) the
application of DBSCAN* to the community detection problem,
(ii) the proposal of a Martingale process for community detec-
tion based on DBSCAN*, and (iii) the experimental evaluation
of the proposed DBSCAN*-Martingale community detection
algorithm against several state-of-the-art community detection
approaches by using four realistic benhmark networks [15].
The proposed DBSCAN*-Martingale community detection
algorithm is presented in Section III and its experimental
evaluation is reported in Section IV. First, though, the state-
of-the-art in community detection is discussed next.
II. RE LATE D WOR K
A large number of community detection algorithms has
appeared in the literature (e.g. [2], [6], [7]), but only few of
them are large scale algorithms that are directly applicable in
large social media graphs, as reviewed in [1].
The GirvanNewman community detection algorithm [3],
[4] is a divisive hierarchical process, based on the edge be-
tweenness centrality measure, which may be quickly calculated
[16]. The edge betweenness is measured by the number of
shortest paths that pass through a given edge and determines
the edges which are more likely to connect different com-
munities. The edge with the highest edge betweenness is
removed and the remaining edges are re-assigned new edge
This is a draft version of the paper. The final version will appear at IEEEXplore: 978-1-5090-5246-2/16/$31.00 ©2016 IEEE
In: Proceedings of the 11th International Workshop on Semantic and Social Media Adaptation and Personalization
betweenness scores. The process generates a dendrogram with
root node the whole graph and leaves the graph vertices. In
order to extract the detected communities, the modularity score
is computed at each dendrogram cut, so as to be maximized.
The GirvanNewman algorithm requires the maximization of
a modularity function, as a stopping criterion, for the op-
timal extraction of communities. An alternative hierarchical
approach for community detection has been proposed [17],
using the modularity function as an objective function to
optimize. Initially, all vertices are separate communities and
any two communities are merged if the modularity increases.
The algorithm stops when the modularity is not increasing
anymore.
In the Label Propagation method [8], every node is initial-
ized with a unique label and at every step each node adopts
the label that most of its neighbors currently have. Hence, an
iterative process is defined, in which the densely connected
groups of nodes form a consensus on a unique label and
communities are extracted.
The Louvain method [9] is based on the maximization
of the modularity and involves two phases that are repeated
iteratively. In the first phase, each vertex forms a community
and for each vertex ithe gain of modularity is calculated
for removing vertex ifrom its own community and placing
it into the community of each neighbor jof i. The vertex i
is moved to the community for which the gain in modularity
becomes maximal. In case the modularity decreases or remains
the same, vertex idoes not change community. The first phase
is completed when the modularity cannot be further increased.
In the second phase, the detected communities formulate a
new network with weights of the links between the new
nodes being the sum of weights of the links between nodes
in the corresponding two communities. In this new network,
self-loops are allowed, representing links between vertices
of the same community. At the end of the second phase,
the first phase is re-applied to the new network, until no
more communities are merged and the modularity attains its
maximum.
The Walktrap method [10] generates random short walks
on the graph by simulating transitions from one node to
another. Since short random walks tend to stay within the
same community, it is possible to detect communities using
such random walks.
The Infomap method [11], [18], [19] is an information-
theoretic approach for community detection. The inventors of
the Infomap method showed that the problem of finding a
community structure in networks is equivalent to solving a
coding problem. In general, the goal of a coding problem
is to minimize the information required for the transmission
of a message. Initially, Infomap employs the Huffman code
[20] in order to give a unique name (codeword) in every
node in the network. In contrast to the Louvain method,
which maximizes modularity, Infomap minimizes the Shannon
information [20] required to describe the trajectory of a random
walk on the network. The objective function, which minimizes
the description length of a random walk on the network
(described by the corresponding sequence of codewords on
each visited node), is called the map equation [11], [18], [19],
and is minimized over all possible network partitions.
DBSCAN [13] is a density-based clustering algorithm,
which is able to extract clusters without knowing the number
of clusters, even in the case where there is noise in the
spatial collection of points. The clustering is based on two
parameters and M inP ts, which are determined by the
desired density level and a lower bound for the number of
points in a cluster M inP ts. The estimation of the density
level, however, is not a trivial task and several approaches have
been proposed to extract clusters, using DBSCAN, without
determining the parameter , such as the DBSCAN-Martingale
[21]. The graph-analogue of DBSCAN is called DBSCAN*
[12] and defines core objects on a graph, in a way similar
to the core points of DBSCAN. The transition from density-
based clustering of spatial databases to community detection
in graphs, through DBSCAN* does not involve border points,
due to the “updated” definition of reachability.
III. DBSCAN*-MARTINGALE COMMUNITY DETE CT IO N
A. Notation and Preliminaries on DBSCAN* and Martingales
Given a network G(N, E)with Nnodes and Eedges,
density-based community detection algorithms partition the
network into kcommunities, where NcNof the nodes
belong to the detected communities, while the N\Ncnodes
that were not assigned to any of the communities are labeled
as “noise”. The output of such algorithms corresponds to an
N-dimensional vector C. For each node nj,j= 1,2, . . . , N,
the j-th element of C, denoted as C[j], is assigned the ID
{1,2, . . . , k}of the community the node njbelongs; if a node
does not belong to any of the communities, the value 0is
assigned instead. As a result, the communities vector Cis an
N-dimensional vector with values in {0,1,2, . . . , k}.
DBSCAN* relies on two parameters, the density level
and the minimum number M inP ts of nodes that can form
a community. We denote the communities vector provided by
DBSCAN* as CDBS CAN (,M inP ts). As this work considers
that the parameter is fixed, the communities vector is denoted
as CDBS CAN (M inP ts). High values of MinP ts typically
result in a CDBS CAN (M inP ts)vector of zeros, i.e. all nodes
are marked as noise, since the algorithm fails to detect com-
munities required to have at least MinP ts nodes. On the other
hand, low values of MinP ts result in a single community and
thus the partitioning is trivial.
The output of DBSCAN* strongly depends on the pa-
rameter M inP ts. This is illustrated by the example depicted
in Figure 1. Figure 1a shows the ground-truth communities
as disconnected components for illustrative purposes. A high
value of M inP ts (M inP ts > 13) results in no communities
being detected (Figure 1b). For M inP ts = 13, two commu-
nities are detected (Figure 1c), while for M inP ts = 11, two
additional communities are detected (Figure 1d). Lower values
of M inP ts result in the detection of further communities, but
at the same time they merge communities that would have
been detected as separate by higher values of M inP ts.
This indicates that a single value of M inP ts may not allow
to detect all communities and motivates us to consider that
an iterative process would be more appropriate for detecting
communities in an effective manner. In particular, starting
from high values of M inP ts, so that the larger communities
are detected, and progressively decreasing MinP ts, so that
(a) ground truth (b) MinPts>13
(c) MinPts= 13 (d) MinPts= 11
Fig. 1. Community detection in a social network consisting of 650 nodes using DBSCAN* with = 1 and various values of M inP ts.
further, smaller, communities are detected, would result in a
set of communities that are detected based on different values
of M inP ts; this process would continue until a minimum
acceptable threshold of community size is applied. To this
end, we propose an extension of DBSCAN* based on Doob’s
Martingale, which allows for introducing a random variable
M inP ts and involves the construction of a Martingale pro-
cess, which progressively approaches the CDBSC AN(MinP ts)
vector that contains all communities.
Martingale is a stochastic process, i.e. a sequence of
random variables X1, X2,..., for which the expected future
value of Xs+1 , given all prior values X1, X2, . . . , Xs, is equal
to the present observed value Xs. A well-known martingale is
Doob’s Martingale, in which our knowledge about a random
variable is progressively obtained and is defined as follows:
Definition 1: (Doob’s Martingale) [14] Let X, Y1, Y2, . . .
be any random variables with finite expectation E[|X|]<
. Then, if Xsis defined by the conditional expectation
Xs=E[X|Y1, Y2, . . . , Ys], the sequence of random variables
X1, X2, . . . is a martingale.
We shall introduce a probabilistic method that constructs
aMartingale stochastic process for progressively detecting all
communities based on DBSCAN* and a given density level
. The martingale construction is based on Doob’s martingale
(Definition 1), where knowledge is progressively gained about
the result of a random variable.
B. Progressive Community Detection Based on a Martingale
In the context of a community detection problem, the
random variable that needs to be known is the vector of
communities’ IDs, which is a combination of Scommunities’
vectors CDBSCAN (M inP tss), each generated for a different
value M inP tss, s = 1,2, . . . , S . For each application of DB-
SCAN*, the parameter is set to 1so that only the immediate
neighborhood of each node is considered. Neighborhoods of
order greater than 2 tend to merge different communities, be-
cause all communities are mutually reachable by intermediate
nodes much easier than the case where neighborhoods are
considered to be of order 1.
First, we generate Srandom numbers M inP tss, s =
1,2, . . . , S uniformly in [M inP tsmin, M inP tsmax], a range
of thresholds for the minimum community size. The sample
of M inP tss, s = 1,2, . . . , S is sorted in decreasing order.
Initially, there are no communities detected in the network.
In the first iteration (s= 1), all communities detected by
CDBS CAN (M inP ts1)are kept, corresponding to the commu-
nity size threshold M inP ts1, i.e. the largest value in the
range. In the second iteration (s= 2), some of the detected
communities by CDBS CAN (M inP ts2)are new and some of
them were previously detected at iteration (s= 1). In order
to keep only the newly detected communities of the second
iteration (s= 2), we keep only the group of numbers of the
same cluster ID with size greater than or equal to M inP ts2,
but lower than MinP ts1, and set the rest to 0.
Formally, we define the sequence of communities C(s), s =
1,2, . . . , S, where C(1) =CDBSC AN(M inP ts1)and:
C(s)[j] := 0, if nja previously detected community
CDBS CAN (M inP tss)[j], otherwise
(1)
Finally, we relabel the IDs of the detected communities.
Assuming that rnew communities are detected at iteration s,
we update the labels of C(s)starting from 1 + maxjC(s1)[j]
to r+maxjC(s1)[j]. The sum of all vectors C(s)up to stage
Sis the final communities vector of our algorithm:
C=C(1) +C(2) +· · · +C(S)(2)
The sequence of vectors Xs=C(1) +C(2) +· · ·+C(s), s =
1,2, . . . , S is Doob’s martingale for the sequence of random
variables Yt=CDBSC AN (MinP tss), s = 1,2, . . . , S. Each
random selection of M inP tss, s = 1,2, . . . , S provides one
vector CDBSCAN (M inP tss)of community IDs for all s=
1,2, . . . , S. As sdecreases, more vectors are combined and we
gain knowledge about the final vector Cof community IDs.
The vector C(1) +C(2) +· · ·+C(S)is our “best prediction” for
the final vector Cof community IDs at stage s. The expected
final vector of community IDs at stage s=Shas extracted all
available communities of various sizes.
 
 
 
 
Update the labels of the communities
Update the vector     
 
 

New community detected for 
Fig. 2. DBSCAN*-Martingale for S= 2 iteraions. The two communities
detected at the first iteration are re-discovered in the second iteration as a
single community, but the update keeps them as separate, together with the
newly discovered community of the second iteration.
This DBSCAN*-Martingale process that detects commu-
nities in a progressive manner and combines them in a single
communities vector is presented as pseudo-code in Algorithm
1 and it is also illustrated in Figure 2 for two iterations and
values M inP ts1= 5 and M inP ts2= 4, where XTdenotes
the transpose of vector X.
The DBSCAN*-Martingale may not assign all nodes to
a community. To address this issue, an optional propaga-
tion step is applied where each unassigned node is assigned
Algorithm 1: DBSCAN*-Martingale(,MinP ts)return C
1: Generate a random sample of Svalues in [MinP tsmin , M inP tsmax]
2: Sort the generated sample s, s = 1,2,...,S
3: for t= 1 to S
4: find CDBSCAN (,MinP tss)
5: compute C(s)as in Eq. (1)
6: update the community IDs
7: update the vector Cas in Eq. (2)
8: end for
9: return C
to the community that belongs to its -neighborhood. This
propagation process is iteratively repeated until there are no
unassigned nodes in the connected components of the detected
communities. Figure 3 illustrates this process for the case
of two communities detected by DBSCAN*-Martingale with
= 1, i.e. t0signifies the start of the propagation process
following the end of the community detection algorithm. At
each iteration ti, i > 0, these two communities are expanded
with their immediate neighbours (since = 1) and after
five iterations, both communities consist of all nodes in their
connected component.
The DBSCAN*-Martingale requires Siterations of the
DBSCAN* algorithm, which runs in O(Nlog N)if a tree-
based spatial index is used and in O(N2)without tree-based
spatial indexing [22]. Therefore, the DBSCAN*-Martingale
runs in O(SN log N)for tree-based indexed datasets and in
O(SN 2)without tree-based indexing. The optional propa-
gation step has worst-case complexity O(N), since in the
worst case scenario the algorithm will examine all nodes
for deciding whether to update their community ID or not.
Our code is written in R1and uses the DBSCAN-Martingale
implementation available on Github2for implementing the
proposed DBSCAN*-Martingale.
IV. EVALUATION
A. Experimental Set-Up
Evaluation is performed using the community detection
benchmark networks developed by Lancichinetti, Fortunato,
and Radicchi (LFR) [15]. These LFR networks were developed
with the goal to reflect the structure of real networks and in
particular to account for the heterogeneity in the distributions
of node degrees and of community sizes. This work employs
four such networks, namely LFR1, LFR2, LFR3 and LFR4,
constructed under the realistic assumptions (i) the network is
scale-free and its degree distribution has a power-law behavior
with power-law exponent τ1, (ii) the community sizes also
obey a power-law distribution with exponent τ2and (iii) the
communities are mixed, i.e. links appear from a node in a
community ito a node in a community j, where i6=j. The
ratio of links between different communities to the number
of links within a community determines the mixing parameter
µ. When µ= 0 there is no mixing, thus all communities are
also disconnected components, and when µ= 1 there is no
community structure.
We used four datasets of sizes 650, 3,182, 21,226 and
41,791 nodes with 10, 50, 200 and 50 communities, respec-
tively. Their characteristics are as follows:
1https://www.r-project.org/
2https://github.com/MKLab-ITI/topic-detection
(a) t0(b) t1(c) t2
(d) t3(e) t4(f) t5
Fig. 3. Iterative propagation of community membership to unassigned nodes until all nodes in the connected components of the communities detected by
DBSCAN*-Martingale are assigned to a community.
TABLE I. CO MMU NI TY DE TEC TI ON EVAL UATIO N RE SULT S.
Size
Method 650 3,182 21,226 41,791
NMI RAND NMI RAND NMI RAND NMI RAND
Edge Betweenness [4] 0.7018 0.8793 0.8567 0.9601 NA NA NA NA
Fast Greedy [17] 0.7038 0.8808 0.8543 0.9598 0.7046 0.8196 0.4177 0.6303
Label Propagation [8] 0.5930 0.8553 0.7116 0.9490 0.5458 0.8144 0.2882 0.6255
Louvain [9] 0.6947 0.8792 0.8589 0.9606 0.7077 0.8198 0.4200 0.6305
Walktrap [10] 0.6904 0.8808 0.8653 0.9621 0.7081 0.8336 0.3842 0.6529
Infomap [18], [19] 0.5852 0.8551 0.7180 0.9488 0.5569 0.8144 0.2954 0.6255
DBSCAN*-Martingale 0.7898 0.9303 0.8665 0.9626 0.7234 0.8437 0.4526 0.6627
LRF benchmark dataset 1 (LRF1): 10 commu-
nities, 650 vertices, minimum community size 20,
community size power-law fit beta = 1.89 (p-value =
0.16 >0.05), degree distribution power-law fit gamma
= 3.54 (p-value = 0.29 >0.05) and maximum degree
= 13.
LRF benchmark dataset 2 (LRF2): 50 communi-
ties, 3,182 vertices, minimum community size 15,
community size power-law fit beta = 1.98 (p-value =
0.93 >0.05), degree distribution power-law fit gamma
= 3.63 (p-value = 0.99 >0.05) and maximum degree
= 28.
LRF benchmark dataset 3 (LRF3): 200 communi-
ties, 21,226 vertices, minimum community size 10,
community size power-law fit beta = 2.00 (p-value =
0.70 >0.05), degree distribution power-law fit gamma
= 3.33 (p-value = 0.13 >0.05) and maximum degree
= 52.
LRF benchmark dataset 4 (LRF4): 50 communi-
ties, 41,791 vertices, minimum community size 10,
community size power-law fit beta = 1.69 (p-value =
0.87 >0.05), degree distribution power-law fit gamma
= 3.49 (p-value=0.98 >0.05) and maximum degree =
124.
All these datasets have ground truth community structure,
i.e. they provide annotated graph nodes based on the com-
munity they belong to.
The proposed DBSCAN*-Martingale is evaluated against
the well-established and parameter-free community detection
algorithms presented in Section II and listed in Table I; to this
end, their respective implementations in igraph (version 1.0.1,
date: 2015-06-26) are used. Based on preliminary experiments,
the range of M inP ts values was set to [5,30] and the number
of iterations Sto 5. The parameter was set to 1as many
community detection approaches consider only the immediate
neighborhood of each node. In addition, the propagation pro-
cess was applied for determining the community membership
of some of the unassigned nodes, given that the LRF datasets
provide ground truth for all nodes, i.e. no nodes are left
unassigned. Finally, the most prominent evaluation measures
in community detection were employed, namely Normalized
Mutual Information [23] and RAND [24].
B. Results
Table I presents the results of the evaluation experiments in
each of the four datasets. All community detection approaches
were applied in all datasets, apart from the GirvanNewman
(Edge Betweenness) approach [4] which is applicable only to
small-scale datasets and thus it was not applied to LFR3 and
LFR4.
The proposed DBSCAN*-Martingale is the best perform-
ing community detection approach for both evaluation metrics
across all datasets, indicating its quality and robustness across
heterogeneous networks of different sizes. The most signifi-
cant differences to the other approaches for both evaluation
metrics are observed for the smallest LRF dataset. For LRF1,
DBSCAN*-Martingale indicates improvements over the other
community detection approaches ranging from 12% to 35%
in terms of NMI and ranging from 5.6% to 8.8% in terms
of RAND. In the larger datasets, the DBSCAN*-Martingale
still performs better than all the other approaches, but the
differences in the effectiveness are smaller, particularly for the
RAND evaluation metric.
Interestingly, the second best performing community detec-
tion approach is Walktrap [10], with the exception of NMI for
LFR1 and LFR2, where the Fast Greedy [17] and the Louvain
[9] methods perform second best, respectively.
V. CONCLUSIONS
This work proposed a novel community detection approach
based on the DBSCAN* density-based algorithm and a Mar-
tingale process that aims to progressively detect communities
in complex networks at various levels of granularity. To this
end, it applies an iterative process that progressively lowers
the threshold of the size of the acceptable communities, while
maintaining the communities detected for higher thresholds.
The output of our proposed community detection approach is
usually a .json file, which is then imported by other appli-
cations. Evaluation experiments over four benchmark datasets
with diverse characteristics and sizes against several state-of-
the-art community detection methods indicate the effective-
ness and robustness of the proposed approach. Further work
includes its application in large-scale social media networks
where communities can be defined along various dimensions
given the multitude of relationships that exist between users
(i.e. the nodes in the network) and further optimizations for
automatically determining the range of lower bound values
to explore in the Martingale process based on the network
characteristics. We expect that our method will achieve high
performance especially in covert networks where communities
are sparsely connected and not very mixed.
ACKNOWLEDGMENT
This work was partially supported by the European Com-
mission by the projects MULTISENSOR (FP7-610411) and
HOMER (FP7-312883).
REFERENCES
[1] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos,
“Community detection in social media,” Data Mining and Knowledge
Discovery, vol. 24, no. 3, pp. 515–554, 2012.
[2] S. Fortunato, “Community detection in graphs,Physics reports, vol.
486, no. 3, pp. 75–174, 2010.
[3] M. Girvan and M. E. Newman, “Community structure in social and
biological networks,” Proceedings of the national academy of sciences,
vol. 99, no. 12, pp. 7821–7826, 2002.
[4] M. Newman and M. Girvan, “Finding and evaluating community
structure in networks,” Physical Review E, vol. 69, no. 2, p. 026113.
[5] Y. Dourisboure, F. Geraci, and M. Pellegrini, “Extraction and classifi-
cation of dense communities in the web,” in Proceedings of the 16th
international conference on World Wide Web. ACM, 2007, pp. 461–
470.
[6] F. D. Malliaros and M. Vazirgiannis, “Clustering and community
detection in directed networks: A survey,” Physics Reports, vol. 533,
no. 4, pp. 95–142, 2013.
[7] S. Harenberg, G. Bello, L. Gjeltema, S. Ranshous, J. Harlalka, R. Seay,
K. Padmanabhan, and N. Samatova, “Community detection in large-
scale networks: a survey and empirical evaluation,Wiley Interdisci-
plinary Reviews: Computational Statistics, vol. 6, no. 6, pp. 426–439,
2014.
[8] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algo-
rithm to detect community structures in large-scale networks,Physical
Review E, vol. 76, no. 3, p. 036106, 2007.
[9] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast
unfolding of communities in large networks,Journal of statistical
mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[10] P. Pons and M. Latapy, “Computing communities in large networks
using random walks.” J. Graph Algorithms Appl., vol. 10, no. 2, pp.
191–218, 2006.
[11] M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex
networks reveal community structure,Proceedings of the National
Academy of Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.
[12] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering
based on hierarchical density estimates,” in Advances in Knowledge
Discovery and Data Mining. Springer, 2013, pp. 160–172.
[13] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based
algorithm for discovering clusters in large spatial databases with noise.
in Kdd, vol. 96, no. 34, 1996, pp. 226–231.
[14] J. L. Doob, “Stochastic processes,Wiley, New York, vol. 101, 1953.
[15] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for
testing community detection algorithms,” Physical review E, vol. 78,
no. 4, p. 046110, 2008.
[16] U. Brandes, “A faster algorithm for betweenness centrality*,” Journal
of mathematical sociology, vol. 25, no. 2, pp. 163–177, 2001.
[17] A. Clauset, M. E. Newman, and C. Moore, “Finding community
structure in very large networks,Physical review E, vol. 70, no. 6,
p. 066111, 2004.
[18] M. Rosvall, D. Axelsson, and C. T. Bergstrom, “The map equation,”
The European Physical Journal Special Topics, vol. 178, no. 1, pp.
13–23, 2009.
[19] L. Bohlin, D. Edler, A. Lancichinetti, and M. Rosvall, “Community
detection and visualization of networks with the map equation frame-
work,” in Measuring Scholarly Impact. Springer, 2014, pp. 3–34.
[20] T. M. Cover and J. A. Thomas, Elements of information theory. John
Wiley & Sons, 2012.
[21] I. Gialampoukidis, S. Vrochidis, and I. Kompatsiaris, “A hybrid frame-
work for news clustering based on the dbscan-martingale and lda,” in
Machine Learning and Data Mining in Pattern Recognition. Springer,
2016, pp. 170–184.
[22] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics:
ordering points to identify the clustering structure,” in ACM Sigmod
Record, vol. 28, no. 2. ACM, 1999, pp. 49–60.
[23] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing
community structure identification,” Journal of Statistical Mechanics:
Theory and Experiment, vol. 2005, no. 09, p. P09008, 2005.
[24] W. M. Rand, “Objective criteria for the evaluation of clustering meth-
ods,” Journal of the American Statistical association, vol. 66, no. 336,
pp. 846–850, 1971.
... (2) many scalable methods based on the seed set expansion process [2-4, 9, 25-28] may lack well-designed seeding strategies [10,11,29] and often rely on ad-hoc strategies; (3) some algorithms that claim to be local, as opposed to methods based on an optimization over the entire graph, in fact still optimize on the community level and thus do not guarantee complete locality; (4) the number of communities in the graph is often predetermined in certain algorithms, which might not be a good treatment, despite its claimed advantage [30] and the possible determination by the non-backtracking matrix [31]; (5) the overlapping communities revealed by some algorithms are in fact still exhaustive in their corresponding link communities [6], which should not be an implicit constraint imposed by algorithms; (6) in many cases, the revealed communities do not follow any order and instead are treated as of equal significance to the graph ('blended' [30]), which may deviate from realistic situations; (7) most algorithms assume that all nodes in the graph should belong to at least one community, without taking care of those isolated nodes that do not have any community membership [32-35]; (8) finally, a OPEN ACCESS RECEIVED ...
... Kloumann and Kleinberg [11] studied different seed set expansion algorithms through a comparative analysis, focusing on the determination of a good seed set. More recently, Gialampoukidis et al [29] proposed a core identification strategy, an algorithm based on the DBSCAN method [44,45] where two parameters are adopted: (1) ò defines the radius of the neighborhood of a node that is considered; (2) MinPts is the minimum number of neighbors of a node's ò-neighborhood; nodes are defined as cores if they have more than MinPts neighbors in their ò-neighborhood. Similarly, Bai et al [10] proposed an algorithm for overlapping community detection using the nodes that are density peaks as community cores, an idea borrowed from clustering analysis [60]. ...
Article
Full-text available
No community detection algorithm can be optimal for all possible networks, thus it is important to identify whether the algorithm is suitable for a given network. We propose a multi-step algorithmic solution scheme for overlapping community detection based on an advanced label propagation process, which imitates the community formation process on social networks. Our algorithm is parameter-free and is able to reveal the hierarchical order of communities in the graph. The unique property of our solution scheme is self-falsifiability; an automatic quality check of the results is conducted after the detection, and the fitness of the algorithm for the specific network is reported. Extensive experiments show that our algorithm is self-consistent, reliable on networks of a wide range of size and different sorts, and is more robust than existing algorithms on both sparse and large-scale social networks. Results further suggest that our solution scheme may uncover features of networks' intrinsic community structures.
... Despite the success of different solution schemes on various application fronts, some weak points of existing community detection algorithms could be pinned down in practice, which we believe might be problematic in certain cases (see Appendix A for a detailed discussion). These weaknesses include: (1) many solution schemes are over-parameterized, and in some cases the tuning of parameters depends largely on unwarranted heuristics; (2) many scalable methods based on the seed set expansion process [1, 2,4,11,31,32,37,46] may lack well-designed seeding strategies [5,19,26] and often rely on ad-hoc strategies; (3) some algorithms that claim to be local, as opposed to methods based on an optimization over the entire graph, in fact still optimize on the community level and thus do not guarantee complete locality; (4) the number of communities in the graph is often predetermined in certain algorithms, which might not be a * Corresponding author: tianyil@mit.edu good treatment, despite its claimed advantage [17] and the possible determination by the non-backtracking matrix [27]; (5) the overlapping communities revealed by some algorithms are in fact still exhaustive in their corresponding link communities [3], which should not be an implicit constraint imposed by algorithms; (6) in many cases, the revealed communities do not follow any order and instead are treated as of equal significance to the graph ("blended" [17]), which may deviate from realistic situations; (7) most algorithms assume that all nodes in the graph should belong to at least one community, without taking care of those isolated nodes that do not have any community membership [18,23,54,55]; (8) finally, a notification of the quality of detection results is not incorporated in most algorithms, failing to indicate the inevitable limited applicability of the method. ...
... Kloumann and Kleinberg [26] studied different seed set expansion algorithms through a comparative analysis, focusing on the determination of a good seed set. More recently, Gialampoukidis et al. [19] proposed a core identification strategy, an algorithm based on the DBSCAN method [7,15] where two parameters are adopted: (1) defines the radius of the neighborhood of a node that is considered; (2) M inP ts is the minimum number of neighbors of a node's -neighborhood; nodes are defined as cores if they have more than M inP ts neighbors in their -neighborhood. Similarly, Bai et al. [5] proposed an algorithm for overlapping community detection using the nodes that are density peaks as community cores, an idea borrowed from clustering analysis [47]. ...
Preprint
No community detection algorithm can be optimal for all possible networks, thus it is important to identify whether the algorithm is suitable for a given network. We propose a multi-step algorithmic solution scheme for overlapping community detection based on an advanced label propagation process, which imitates the community formation process on social networks. Our algorithm is parameter-free and is able to reveal the hierarchical order of communities in the graph. The unique property of our solution scheme is self-falsifiability; an automatic quality check of the results is conducted after the detection, and the fitness of the algorithm for the specific network is reported. Extensive experiments show that our algorithm is self-consistent, reliable on networks of a wide range of size and different sorts, and is more robust than existing algorithms on both sparse and large-scale social networks. Results further suggest that our solution scheme may uncover features of networks' intrinsic community structures.
... Network community detection can be seen as a clustering task, highly used in data mining scenarios, but applied to networks (GUIDOTTI; COSCIA, 2017). In this way, traditional clustering methods, such as Density-based Spatial Clustering of Applications with Noise (DBSCAN) (ESTER et al., 1996), can be adapted to the community detection task (GIALAMPOUKIDIS et al., 2016). Figure 5 shows a network with 13 nodes and 3 communities (each one represented by blue, green or red nodes). ...
Thesis
Full-text available
Temporal networks (also known as dynamic networks) are often used to model connections that occur over time between parts of a system by using nodes and edges. In temporal networks, all nodes, edges, and times, are known and available to be used in the analysis. However, in several real-world applications, data are produced in a massive and continuous way, which is known as data stream. In this case, the volume of data may be so large that the storage may be impossible and mining tasks become more challenging. In streaming temporal networks, edges are continuously arriving in non-stationary distribution. In both temporal and streaming temporal networks, patterns related to node and edge activity are typically irregular in time, which makes the visualization of such networks helpful to gain insights about network structure and dynamics. Nevertheless, the non-stationary distribution of incoming data increases complexity and turns the streaming temporal network visualization even more challenging. Several visualization layouts have been proposed, but they all have limitations. The main challenge in this context is the amount of visual information, that increases depending on the network size and density, and causes visual clutter due to edge overlap, fine temporal resolution, and node proximity. In this thesis, we propose methods to enhance the visualization of streaming temporal networks through the manipulation of the three network dimensions, namely node, edge, and time. Specifically, we propose: (i) CNO, a visual scalable node ordering method; (ii) SEVis, a streaming edge sampling method; and (iii) a streaming method that adapts the temporal resolution according to local levels of node activity. We also present a comparative study considering the combination of these methods. We show through case studies with real-world networks that each of these methods greatly improves layout readability, thus leading to a fast and reliable decision making.
... Network community detection can be seen as a clustering task, highly used in data mining scenarios, but applied to complex networks (Guidotti and Coscia, 2017). In this way, traditional clustering methods, such as DBSCAN (Ester et al., 1996), can be adapted to the community detection task (Gialampoukidis et al., 2016;Linhares et al., 2020). Among the existent community detection algorithms, Louvain (Blondel et al., 2008) and Infomap (Rosvall and Bergstrom, 2008) represent two of the most recommended approaches due to their performances and low computational complexity (Fortunato and Hric, 2016). ...
Article
Full-text available
Information Visualisation strategies can be applied in a variety of domains. In the context of temporal networks, i.e., networks in which interactions between individuals occur throughout time, efforts have been conducted to develop visual approaches that allow finding interaction patterns, anomalies, and other behaviours not previously perceived in the data. This paper presents two case studies involving real-world education networks from a primary school and a high school. For this purpose, we used the Massive Sequence View (MSV) layout with the Community-based Node Ordering (CNO) method, two well established approaches for visual analysis of temporal networks. Our results show that the identified patterns involving students/students and students/ teachers represent important information to benefit and support decision making about school management and teaching strategies, especially those related to strategic group formation.
... The detection of network communities is an important but also complex computational task [13]. Network community detection is similar to network clustering [13] and thus traditional clustering algorithms such as Density-based Spatial Clustering of Applications with Noise (DBSCAN) [10] and Shared-Nearest-Neighbor (SNN) [17] can be adapted to detect communities in networks [16,38]. There are indeed several algorithms in the literature trying to solve this problem using different approaches [8,24,25,31]. ...
Article
Full-text available
Networks are often used to model the structure of interactions between parts of a system. One important characteristic of a network is the so-called network community structures that are groups of nodes more connected between themselves than with nodes from other groups. Such community structure is fundamental to better understand the organization of networks. Although there are several community detection algorithms in the literature, choosing the most appropriate for a specific task is not always trivial. This paper introduces a methodology to analyze the performance of community detection algorithms using network visualization. We assess the methodology using two widely adopted community detection algorithms: Infomap and Louvain. We apply both algorithms to four real-world networks with a variety of characteristics to demonstrate the usefulness and generality of the methodology. We discuss the performance of these algorithms and show how the user may use statistical and visual analytics to identify the most appropriate network community detection algorithm for a certain network analysis task.
... One can relate communities to clusters in the network, for example, representing groups of friends in social networks, proteins with the same function, or related diseases [9] . Indeed, there are several traditional clustering methods, such as Shared-Nearest-Neighbor (SNN) [27] and Densitybased Spatial Clustering of Applications with Noise (DBSCAN) [28] , adapted to community detection [29,30] . ...
Article
Temporal networks have been used to map the structural evolution of social, technological, and biological systems, among others. Due to the large amount of information on real-world temporal networks, increasing attention has been given to issues related to the visual scalability of network visualization layouts. However, visual clutter due to edge overlap remains the main challenge calling for efficient methods to improve the visual experience. In this paper, we propose a novel and scalable node reordering approach for temporal network visualization, named Community-based Node Ordering (CNO), combining static community detection with node reordering techniques to enhance the identification of visual patterns. The perception of trends, periodicity, anomalies, and other temporal patterns, is facilitated, resulting in faster decision making. Our method helps not only the study of network activity patterns within communities but also the analysis of relatively large networks by breaking down its structure in smaller parts. Using CNO, we further propose a taxonomy to categorize activity patterns within communities. We performed a number of experiments and quantitative analyses using two real-world networks with distinct characteristics and showed that the proposed layout and taxonomy speed up the identification of patterns that would otherwise be difficult to see.
... This value was taken to be half the mean nearest neighbour distance between all user activity observations. This statistic allows clustered centres of activity areas to be identified by using the epsilon threshold in the DBSCAN algorithm to label each UGC item as a member of a particular activity area cluster (Gialampoukidis et al. 2016). ...
Article
This paper presents a crowd sensing system (CSS) that captures geospatial social media topics and allows the review of results. Using Web-resources derived from social media platforms, the CSS uses a spatially-situated social network graph to harvest user-generated content from selected organizations and members of the public. This allows ‘passively’ contributed social media-based opinions, along with different variables, such as time, location, social interaction, service usage, and human activities to be examined and used to identify trending views and influential citizens. The data model and CSS are used for demonstration purposes to identify geotopics and community interests relevant to municipal affairs in the City of Toronto, Canada.
... This value was taken to be half the mean nearest neighbour distance between all user activity observations. This statistic allows clustered centres of activity areas to be identified by using the epsilon threshold in the DBSCAN algorithm to label each UGC item as a member of a particular activity area cluster (Gialampoukidis et al. 2016). ...
Article
This paper presents a crowd sensing system (CSS) that captures geospatial social media topics and allows the review of results. Using Web-resources derived from social media platforms, the CSS uses a spatially-situated social network graph to harvest user-generated content from selected organizations and members of the public. This allows ‘passively’ contributed social media-based opinions, along with different variables, such as time, location, social interaction, service usage, and human activities to be examined and used to identify trending views and influential citizens. The data model and CSS are used for demonstration purposes to identify geotopics and community interests relevant to municipal affairs in the City of Toronto, Canada.
... Since DBSCAN, in its original implementation, does not predict that the dataset may have multiple density granularities, Gialampoukidis [29] developed DBSCAN*-Martingale. This algorithm adapts the density parameters in order to discover group members with different similarity levels. ...
Article
Social network communities are composed of people with common interests who influence or are influenced by themselves. In the scientific context, Scientific Social Networks are characterized as social networks that represent the social relations established by researchers. Identifying and exploring these relationships are fundamental activities to support scientific experiments. In this study, we aim to discuss the use of complex networks combined with semantic analysis in a network of scientific publications called DBLP. DBLP can be classified as big data, and its use for the analysis of connections and influences among researchers can be considered a context-aware approach. Therefore, in the present study, concepts of complex network analysis are applied to verify the level of influence among researchers, by analyzing the structure of the scientific social network under study and its communities. A bidirectional graph-based model was proposed in order to evaluate the influence of researchers, in addition to algorithms to analyze the network structure and identify scientific communities, using ontological terms and rules, considering the scientific context, and identifying new connections to promote scientific collaboration. For the identification of scientific communities, we proposed an overlapping community detection algorithm, named NetSCAN. A large scientific database (DBLP) together with digital libraries were used to evaluate the model and the algorithms in a historical research experiment. The results point to the viability and effectiveness of the proposed solution.
Article
Density-based clustering is an effective clustering approach that groups together dense patterns in low- and high-dimensional vectors, especially when the number of clusters is unknown. Such vectors are obtained for example when computer scientists represent unstructured data and then groups them into clusters in an unsupervised way. Another facet of clustering similar artifacts is the detection of densely connected nodes in network structures, where communities of nodes are formulated and need to be identified. To that end, we propose a new DBSCAN algorithm for estimating the number of clusters by optimizing a probabilistic process, namely DBSCAN-Martingale, which involves randomness in the selection of density parameter. We minimize the number of iterations required to extract all clusters by the DBSCAN-Martingale process, by providing an analytic formula. Experiments on spatial, textual and visual clustering show that the proposed analytic formula provides a suitable indicator for the optimal number of required iterations to extract all clusters.
Chapter
Full-text available
Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.
Article
Full-text available
Community detection is a common problem in graph data analytics that consists of finding groups of densely connected nodes with few connections to nodes outside of the group. In particular, identifying communities in large-scale networks is an important task in many scientific domains. In this review, we evaluated eight state-of-the-art and five traditional algorithms for overlapping and disjoint community detection on large-scale real-world networks with known ground-truth communities. These 13 algorithms were empirically compared using goodness metrics that measure the structural properties of the identified communities, as well as performance metrics that evaluate these communities against the ground-truth. Our results show that these two types of metrics are not equivalent. That is, an algorithm may perform well in terms of goodness metrics, but poorly in terms of performance metrics, or vice versa.Conflict of interest: The authors have declared no conflicts of interest for this article.For further resources related to this article, please visit the WIREs website.
Article
Full-text available
The proposed survey discusses the topic of community detection in the context of Social Media. Community detection constitutes a significant tool for the analysis of complex networks by enabling the study of mesoscopic structures that are often associated with organizational and functional characteristics of the underlying networks. Community detection has proven to be valuable in a series of domains, e.g. biology, social sciences, bibliometrics. However, despite the unprecedented scale, complexity and the dynamic nature of the networks derived from Social Media data, there has only been limited discussion of community detection in this context. More specifically, there is hardly any discussion on the performance characteristics of community detection methods as well as the exploitation of their results in the context of real-world web mining and information retrieval scenarios. To this end, this survey first frames the concept of community and the problem of community detection in the context of Social Media, and provides a compact classification of existing algorithms based on their methodological principles. The survey places special emphasis on the performance of existing methods in terms of computational complexity and memory requirements. It presents both a theoretical and an experimental comparative discussion of several popular methods. In addition, it discusses the possibility for incremental application of the methods and proposes five strategies for scaling community detection to real-world networks of huge scales. Finally, the survey deals with the interpretation and exploitation of community detection results in the context of intelligent web applications and services.
Chapter
Large networks contain plentiful information about the organization of a system. The challenge is to extract useful information buried in the structure of myriad nodes and links. Therefore, powerful tools for simplifying and highlighting important structures in networks are essential for comprehending their organization. Such tools are called community-detection methods and they are designed to identify strongly intraconnected modules that often correspond to important functional units. Here we describe one such method, known as the map equation, and its accompanying algorithms for finding, evaluating, and visualizing the modular organization of networks. The map equation framework is very flexible and can identify two-level, multi-level, and overlapping organization in weighted, directed, and multiplex networks with its search algorithm Infomap. Because the map equation framework operates on the flow induced by the links of a network, it naturally captures flow of ideas and citation flow, and is therefore well-suited for analysis of bibliometric networks.
Conference Paper
We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.
Article
Networks (or graphs) appear as dominant structures in diverse domains, including sociology, biology, neuroscience and computer science. In most of the aforementioned cases graphs are directed - in the sense that there is directionality on the edges, making the semantics of the edges non symmetric. An interesting feature that real networks present is the clustering or community structure property, under which the graph topology is organized into modules commonly called communities or clusters. The essence here is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity. Revealing the underlying community structure of directed complex networks has become a crucial and interdisciplinary topic with a plethora of applications. Therefore, naturally there is a recent wealth of research production in the area of mining directed graphs - with clustering being the primary method and tool for community detection and evaluation. The goal of this paper is to offer an in-depth review of the methods presented so far for clustering directed networks along with the relevant necessary methodological background and also related applications. The survey commences by offering a concise review of the fundamental concepts and methodological base on which graph clustering algorithms capitalize on. Then we present the relevant work along two orthogonal classifications. The first one is mostly concerned with the methodological principles of the clustering algorithms, while the second one approaches the methods from the viewpoint regarding the properties of a good cluster in a directed network. Further, we present methods and metrics for evaluating graph clustering results, demonstrate interesting application domains and provide promising future research directions.