ArticlePDF Available

Abstract and Figures

In this era of big data, as the data size is scaling up, the need for computing power is exponentially increasing. However, most of the community detection algorithms in the literature are classified as global algorithms, which require access to the entire information of the network. These algorithms designed to work on a single machine cannot be directly parallelized. Hence, it is impossible for such algorithms working in stand-alone machines to find communities in large-scale networks and also the required processing power far exceeds the processing capabilities of single machines. In this paper, a set of novel Decentralized Iterative Community Clustering Approaches to extract an efficient community structure for large networks are proposed and devalued using the LFR benchmark model. The approaches have the ability to identify the community clusters from the entire network without global knowledge of the network topology and will work with a range of computer architecture platforms (e.g., cluster of PCs, multi-core distributed memory servers, GPUs). Detecting and characterizing such community structures is one of the fundamental topics in network systems’ analysis, and it has many important applications in different branches of science including computer science, physics, mathematics and biology ranging from visualization, exploratory and data mining to building prediction models.
This content is subject to copyright. Terms and conditions apply.
Vol:.(1234567890)
The Journal of Supercomputing (2019) 75:4894–4917
https://doi.org/10.1007/s11227-019-02765-1
1 3
Decentralized iterative approaches forcommunity
clustering inthenetworks
AmhmedBhih1 · PrincyJohnson1· MartinRandles1
Published online: 9 February 2019
© The Author(s) 2019
Abstract
In this era of big data, as the data size is scaling up, the need for computing power
is exponentially increasing. However, most of the community detection algorithms
in the literature are classified as global algorithms, which require access to the
entire information of the network. These algorithms designed to work on a single
machine cannot be directly parallelized. Hence, it is impossible for such algorithms
working in stand-alone machines to find communities in large-scale networks and
also the required processing power far exceeds the processing capabilities of single
machines. In this paper, a set of novel Decentralized Iterative Community Clustering
Approaches to extract an efficient community structure for large networks are pro-
posed and devalued using the LFR benchmark model. The approaches have the abil-
ity to identify the community clusters from the entire network without global knowl-
edge of the network topology and will work with a range of computer architecture
platforms (e.g., cluster of PCs, multi-core distributed memory servers, GPUs).
Detecting and characterizing such community structures is one of the fundamental
topics in network systems’ analysis, and it has many important applications in differ-
ent branches of science including computer science, physics, mathematics and biol-
ogy ranging from visualization, exploratory and data mining to building prediction
models.
Keywords Community detection· Connectivity-based graph clustering· Distributed
algorithm
* Amhmed Bhih
a.a.bhih@2011.ljmu.ac.uk; Amhmed_bhih@hotmail.com
Princy Johnson
P.Johnson@ljmu.ac.uk
Martin Randles
M.J.Randles@ljmu.ac.uk
1 Department ofElectronics andElectrical Engineering/Computer Science, LJMU,
LiverpoolL33AF, UK
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4895
1 3
Decentralized iterative approaches forcommunity clustering…
1 Introduction
Many real-world complex systems can be represented as networks (also referred
to as graphs), with nodes representing functional units and links describing the
interactions between nodes.
Recently, it has become common to analyse interactions in the real-world by
looking at the networks that underlie these interactions [1]. Real-world networks
are not random networks, they usually exhibit inhomogeneity and reveal a high
level of order and organization [2]. An interesting feature that real-world networks
usually present is the community structure property, under which the topology of
network is organized into modules commonly called communities or clusters [3].
Detecting and characterizing such community structures is one of the fundamental
topics in network systems’ analysis. The determination of communities in networks
can help people better understand the structural makeup of the networks. Thus, the
outcome of this research work has valuable applications in several fields such as
biology, social science, physics, computer science, business science, etc. [4, 5].
In social networks, for example, clustering of communities can be beneficial for a
range of applications including finding a common research area in collaboration net-
works and finding a set of likeminded users for marketing and recommendations [6].
Community structure is important not only in social networks, but also in various other
networks. For example, determination of community structure in the Internet can address
questions such as how to route data as packets in an efficient way, how to reduce the time
consumption for such traffic, what is the fast and safe path to consider to reach the desti-
nation, etc. It can go further in depth, by elucidating questions like how computer viruses
are spreading through the Internet, what mechanisms they follow to hit organizations,
etc. Also in dark networks, community structure can reveal the hidden relationships
between individual terrorists [7]. Similarly, in the case of the World Wide Web (WWW)
pages related to the same subject are typically organized into communities, so that the
identification of these communities can help the task of seeking for identifying the cate-
gory of the network as well as understanding its dynamic evolution and organization [8].
Thus, the problem of finding the community structure of networks has attracted a huge
amount of research work and the range of proposed algorithms is rich and diverse. How-
ever, most of the research on community detection algorithms has been designed to work
on a single machine employing a form of basic random access to the entire network, so
they require access to the entire network at all times [3, 9].
Driven by the recent emergence of big data, clustering of real-world networks
using traditional methods and algorithms is almost impossible to be processed
in a single machine. The existing methods are limited by their computational
requirements, and most of them cannot be directly parallelized. Furthermore, in
many cases the data set is very big and does not fit into the main memory of a sin-
gle machine and therefore needs to be distributed among several machines [10].
Faced with the challenge of a big data set, many researchers pay great attention
to parallel clustering algorithms that would improve the bottleneck of traditional
clustering methods on a single machine. To cope with this scenario, a distributed
and parallel computing model is needed to process a large data set by scaling
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4896
A.Bhih et al.
1 3
the data set out to multiple machines across a cluster and process it. Some novel
parallel computing frameworks shine, of which MapReduce is one of the most
popular [11]. However, the traditional clustering algorithms are centralized (need
global information) and do not have the capability to process data across multiple
servers in parallel (or distributed manner).
The main goal of this work is to design and implement novel techniques and
algorithms for the problem of clustering and community detection in large and
undirected networks. The proposed approaches all assume that the given network
structure is needed to be divided into communities in such a way that every node
belongs to one of the communities (non-overlapping communities).
The following summary provides a short overview of the key contributions of
this work:
1. A novel Decentralized Iterative Community Clustering Approach (DICCA) to
extract an efficient community structure for large networks is proposed. A major
advantage of this approach is eliminating the need for the global knowledge of
the network in order to efficiently cluster networks. This allows the DICCA to
be run in parallel and the network data need not be loaded into a single memory.
Hence, the proposed approach is adapted to cluster communities in large networks
without the penalties involved. This cannot be done in the majority of the existing
community detection algorithms as they implicitly assume that the entire struc-
ture of the big network is known and is available. Another perspective of DICCA
approach is reducing the problem size by aggregating the nodes in the network
to cluster the large-scale data set efficiently.
2. A Parallel Decentralized Iterative Community Clustering Approach (PDICCA) that
transforms the operations of the DICCA approach from a serial process into a paral-
lelized approach is presented. The PDICCA is a pipelined parallel implementation
and maintains the overall structure of the serial method (DICCA). The novelty of
the design comes from the following fact: even though the PDICCA solves the same
problem and maintains the overall structure as does the serial method, the PDICCA
is distinguished due to the features of exploiting the use of distributed memory and
extracting parallelism under the MapReduce framework. The proposed algorithm
does not require any global knowledge of the network topology, is scalable and will
work with a range of computer architecture platforms (e.g., cluster of PCs, multi-
core distributed memory servers, GPUs), where the master and slave workers could
represent either different threads in a single machine or different machines in a com-
puting cluster. Also, one of the main contributions of this work is to take advantage
of the graph partitioning when performing parallel community clustering in order
to speed up the process by minimizing the communication between slave–work-
ers. Furthermore, a parallel implementation of PDICCA based on the most popular
MapReduce model to accelerate processing in large-scale networks is proposed.
The rest of this paper is organized as follows: Sect.2 presents a brief overview of
the related literature on graph partitioning and community detection algorithms. Sec-
tions3 gives a detailed description of Decentralized Iterative Community Clustering
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4897
1 3
Decentralized iterative approaches forcommunity clustering…
Approach, for detecting communities. Section4 centres around the design and imple-
mentation of the parallel framework version of the DICCA approach. In this section,
the principle and implementation of the proposed PDICCA approach is detailed. The
mathematical model to obtain optimal parameter values for the proposed approaches
is presented in Sect.5. The data benchmarks and experimental results are presented in
Sects.6 and 7 respectively. Finally, discussion and future work are presented in Sect.8.
2 Related work
2.1 Graph partitioning andcommunity detection
Community detection is an active area of network science research, and over the years,
a wide variety of community detection algorithms have been proposed to find the com-
munities in the network. Community detection is also named as graph partitioning, in
much of the literature [12, 13]. It is tempting to suggest that this community detection
and graph partitioning are really addressing the same question; in both, their aim is to
identify groups of nodes on a network that are better connected to each other than to
the rest of the network. However, it is very important to stress that the task of graph
partitioning and community detection can be distinguished from one another based
on whether the experimenter fixes the number and size of the groups or it is unspeci-
fied [14]. Graph partitioning is the problem of partitioning a graph into a predefined
number and size of clusters. It has been pursued particularly in computer science and
related fields with applications in parallel computing and very-large-scale integration
(VLSI) design. However, in the community detection, which has been pursued by
sociologists and more recently by physicists and applied mathematicians, with applica-
tions especially to social and biological networks, the number and size of clusters are
unspecified. Furthermore, the goal in the former is usually to identify the best division
of a network regardless of whether or not a good division existed. In case there are no
good divisions exist, the least bad one will be done as a solution. On the other hand,
in community detection, the algorithm only divides the network when good divisions
exist and leave the network undivided in case there are no existing good divisions [14].
The community detection algorithms can be classified into different ways, and
depending on the selected criteria, one algorithm can belong to more than one cat-
egory. Among them, those based on modularity maximization form the most promi-
nent family of community detection algorithms such as Fastgreedy algorithm [15]
and Louvain algorithm [16].
Fastgreedy algorithm is an agglomerative hierarchical clustering method proposed
by Newman [15]. The algorithm greedily maximizes the modularity function Q and
starts the process by assigning a different community to each node in the network.
Then at each stage in the process, the pair of clusters that yields greatest increase of
modularity or smallest decrease is merged until only one cluster remains containing all
nodes in the network. The whole procedure can be represented by a dendrogram (hier-
archical tree) that illustrates the order of the mergers. Cuts through the dendrogram
at different levels give different partitions into communities. The optimal community
cluster can be found by cutting the dendrogram at the level of maximum Q.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4898
A.Bhih et al.
1 3
Louvain algorithm is a hierarchical agglomerative optimization method proposed
by Blondel etal. [16] and attempts to optimize the modularity of a partition of the
network. The optimization is performed in two steps that are repeated iteratively. This
algorithm starts with each node in the network belonging to its own community. Then
in the first step and for each node in the network, the algorithm uses the local moving
heuristic to obtain an improved community structure by moving each node from its
own community to its neighbours’ community and evaluating the gain of modularity
associated with the moving of the node. The node is then placed in the community for
which the modularity change is the most positive. If none of these modularity changes
is positive, the node stays in its original community. This process is applied repeatedly
and sequentially for each node until all the nodes in the network are considered, and no
further improvement can be achieved. This concludes the first step. The second step of
the algorithm consists of building a new network from the communities discovered in
the first step whose nodes are the communities. The weight of the links between com-
munities is the total weight of the links between the nodes of these communities. Once
the second step is completed, it is possible to replay the first step and iterate again if
necessary. The two steps repeat iteratively and stop when there is no more change in
the modularity gain and consequently a maximum modularity is obtained.
Another popular method widely used to find communities in the network is
based on the random walk. An example includes Walktrap (WT) algorithm which
is proposed by Pons and Latapy [17]. Walktrap algorithm is based on the principle
that random walks on a network tend to get “trapped” into densely connected parts
defining the communities. In this method, the authors propose using a node similar-
ity measure based on short walks to capture structural similarities between nodes
instead of modularity to identify community via hierarchical agglomeration. The
algorithm starts by assigning each node to its own community, and the distance for
every pair of communities is computed. Communities are merged according to the
minimum of their distances and the process iterated. After n 1 steps, the algorithm
finishes and gives a hierarchical structure of communities called a dendrogram. The
best partition is then considered to be the one that maximizes modularity.
Information theoretic algorithms are another major type of community detection
clustering algorithms that use the concept of information theory to find community
clusters in the networks. Infomap algorithm is an example of information theoretic
algorithms proposed by Rosvall and Bergstrom [18]. Infomap algorithm character-
izes the problem of finding the optimal community clustering in the network as the
problem of finding the most compressed (shortest) description length of the random
walks on the network. It uses a random walk as a proxy for information flow in a
network and minimizes a map equation, which measures the description length of
a random walker, over all the network clusters to reveal its community structure.
To represent the community structure, the algorithm uses a two-level nomenclature
based on Huffman coding: a level to distinguish communities in the network and the
other to distinguish nodes in the community. In practice, the random walker is likely
to stay longer inside communities; therefore, in the process of finding a community
containing few inter-community links, only the second level is needed to describe its
path, leading to a compact representation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4899
1 3
Decentralized iterative approaches forcommunity clustering…
Recently, there have been several studies [1922] proposed to find the proper
cluster for specific applications. For example in [20, 22], an intelligent clustering
method for energy-efficient cluster-based routing of data packets in a wireless sensor
network application have been proposed. However, the above-mentioned algorithms
are classified as global algorithms, which require access to the entire information of
the network and are designed to work on a single machine [3].
2.2 Clustering withoutglobal knowledge
There are other algorithms apart from DICCA [23] and PDICCA that achieve some
degree of locality within the graph by considering partial information instead of
global information. The examples include Connectivity-based Decentralized Node
Clustering scheme (CDC) proposed by Ramaswamy etal. [24], Distributed Diffusive
Clustering algorithm (DiDiC) proposed by Joachim and Henning [25] and Ja-be-Ja
[10]. CDC is a distributed and scalable algorithm for discovering clusters in peer-to-
peer networks. However, the nodes executing CDC algorithm need to communicate
with their direct neighbours and require knowledge of all the neighbouring nodes.
Similarly, though DiDiC is designed to work based on the method of distributed
diffusion to eliminate global operations, DiDiC communication takes place between
neighbouring graph nodes thus requiring the knowledge about all the neighbouring
nodes. Ja-be-Ja is a decentralized local algorithm that uses local search for graph
partitioning; however, it is designed to find balanced size partitions rather than
good-shaped partitions. This is usually not the case for real-world networks.
3 Decentralized Iterative Community Clustering Approach forgraph
clustering (DICCA)
DICCA is an agglomerative clustering algorithm based on random walk and reacha-
bility, which is carried out through message propagation between neighbours. There
are two phases, local clustering and network reduction, that are run in an iterative
fashion. The former phase is used to define an originator node for each community
cluster and associate each node in the network to the best-fit originator. The reduc-
tion phase is used to rebuild the network using the communities resulting from the
previous phase, where each detected community becomes a node and the weight of
the edges in the new network represent the sum of the edges between two communi-
ties. The DICCA algorithm uses two parameters named threshold value and time to
live (TTL) [23]. The concept of the DICCA approach is presented in Algorithm1.
Each round of the iteration process comprises of choosing a node randomly to
be an originator. The originator node acts as a cluster head and advertises itself by
sending a message (Msg) to all its neighbours in the network. This message contains
three fields, Originator node ID (OnID), Message Weight (WMsg) and TTL. OnID
represents the node id of the originator of the message. WMsg is the weight carried
by the message that represents the estimated probability of reaching any node in the
network starting from the originator node. TTL represents the maximum distance in
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4900
A.Bhih et al.
1 3
hops before a message (Msg) expires. It is worth noting that, in order to avoid the
originator being assigned to any other clusters, the WMsg is set to 1 at the originator.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4901
1 3
Decentralized iterative approaches forcommunity clustering…
Consider two nodes, the originator
Oi
and its neighbouring node
Vi
, the model
used to compute the weight of the message sent from the originator
Oi
to node
Vi
depends on the weight of the edges between
Oi
and
Vi
. This is defined as [23]:
Every single node in the network maintains the information about the origi-
nator IDs and the total weights of the messages it has received for each origi-
nator. This information is represented as Total Message Weight. When the node
Vi
receives a message Msg from its neighbouring node, it first updates the Total
Message Weight value and then checks whether TTL > 0. If TTL > 0, it decre-
ments the TTL of the message by one and forwards the message to all its neigh-
bours except the sender.
The weight of the new message WMsg(Vi, Vk) sent from node
Vi
to its neighbour-
ing node
Vk
is defined as [23]:
However, if TTL = 0 or WMsg becomes insignificantly low compared to the pre-
defined threshold value, the Node
Vk
processes the message and stops the forwarding
phase.
The nodes join the closest originator
Oi
if the total weight of the message from
the originator is greater than the specified threshold value. If not, those nodes will
remain as outliers and do not join any cluster.
This procedure is iteratively repeated by adding one more originator and updat-
ing communities and outlier nodes until there is no outlier node remains left. How-
ever, some nodes may receive multiple messages generated from different originator
nodes. In that case, each node attaches itself to the cluster lead by the originator
from which it has received the highest total message weight.
The second phase of the algorithm uses the communities that are found in the
first phase to build a new network, with each community from the previous phase
represented as a node in the new network. Multiple edges between any two commu-
nities are collapsed into a single edge in the new network, and its weight being the
sum of the edges between them. The edges within each community in the first phase
are represented as self-loops in the new network [23].
Once the second phase is completed, the first phase process is repeated with the
new network. The two phases are iteratively applied until there is no more change
in the communities between two iterations, and consequently optimized community
clusters are obtained.
Although the exact computational complexity of DICCA is harder to formalize,
this algorithm behaves as O(m log(n m)2)), in which n is the total number of nodes
in the network and m the number of edges. However, the most effort is in the first
phase of the algorithm.
(1)
WMsg
Oi,Vi
=W
Oi,Vi
V
j
Nbr
(
O
i)
W
Oi,Vj
(2)
WMsg
Vi,VK
=WMsg ×W
Vi,Vk
V
j
Nbr
(
V
i)
W
Vi,Vj
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4902
A.Bhih et al.
1 3
4 Description oftheParallel Decentralized Iterative Community
Clustering Approach (PDICCA)
The core idea of PDICCA is to divide the data set into blocks and then iteratively
repeat the following three phases: clustering, re-clustering and rebuilding phase: the
clustering phase is responsible for finding local community clusters for each block
independently and in parallel. In the second phase, the local clusters thus extracted
from the individual blocks are aggregated to find the initial community clustering
for the entire network. The third phase involves building a new, but smaller net-
work for each block of data based on the initial community clustering. Each cycle
of this process through all the three phases is referred to as an iteration. The three
phases iterate until the old and the new community clustering list does not converge
anymore.
4.1 Framework ofthePDICCA approach
The PDICCA approach consists of two worker schemes: master and slave-clustering
workers. The master worker creates the blocks as it reads the data set, and passes
them to slave-clustering workers. The master worker is also responsible for receiving
and aggregating the cluster assignment results from all the slave-clustering workers,
perform some computation, assign the overlapped nodes into the best community
and return the final solution. On the other hand, slave-clustering worker’s function-
ality is to identify local communities by going through its own data set and apply-
ing the first phase of the DICCA approach. The overview of PDICCA approach is
shown in Fig.1.
Slave-clustering worker runs in parallel and stores the community clustering lists
in its local memory. However, since each slave-clustering worker has some part of
the data and does not have a global knowledge of the network, consequently, differ-
ent slave-clustering workers could cluster the same node into different communi-
ties. Thereby, when all the blocks are clustered and the local communities have been
identified, the master worker loads the local community clustering lists to aggregate.
Since the PDCCA approach is proposed to find non-overlapping clusters, the par-
tition C of N nodes should form a partition such that N =
k
i=1
Ci and Ci∩Cj = Ø
for any ij. So, the master worker is responsible for finding the set of overlapping
nodes. The overlapping node list is then sent back to the slave workers to calcu-
late the strength of clustering solutions for each overlapped node among different
machines. This is then sent back to the master worker for the re-clustering phase. In
the re-clustering phase, the master worker finds out the best solution for overlapped
nodes, the solution corresponding to the highest strength of clustering, and updates
the community clustering list. At the end of the re-clustering phase, the network is
partitioned into a number of communities.
Next step is the rebuild phase, which involves building a new network by each
of slave-clustering workers. Using the same method presented in Sect.3 where
the nodes in the new network are the communities from the re-clustering phase.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4903
1 3
Decentralized iterative approaches forcommunity clustering…
The weight of the link between two nodes in this new network is the total weight
of the links between the nodes of the two corresponding communities in the
original network. The links between the nodes of the same community become
self-loops of the corresponding node in the new network. The iteration is then
repeated until a stable set of community clusters (fulfilling the convergence con-
dition) is obtained.
It is to be noted that each slave-clustering worker has its own private non-
shareable memory, and there are no communications between the workers in the
clustering phase. Thus, each slave-clustering worker operation is independent of
the others and each of the slave-clustering worker’s operations can be performed
in parallel.
To calculate the strength of overlapped nodes, the clustering strength of over-
lapped node
Vm
is formalized in the following definition:
Denition 1 (Cluster strength)
Given a network set G = (V,E), with n = |V| nodes and m = |E| edges is presented.
During the clustering phase, each slave-clustering worker clusters these nodes into C
clusters and assigns
Vm
node to different communities. To find the best community
that fits Vm node, the proposed scheme carries out the following two steps.
First, the node
Vm
obtains two sets of information from each of its neighbours,
namely the degree of the neighbour node and the cluster to which it belongs to and
then calculates the neighbour attraction between
Vm
and its neighbour Vi, which is
defined as:
Convergence?
End
Slave
worker 1
Slave
worker 2
Slave
worker N
Master
worker
Slave
worker 1
Slave
worker 2
Slave
worker N
Master
worker
Slave
worker 1
Slave
worker 2
Slave
worker N
Master
worker
Split N
Split 1
Split3
Split 2
….
Yes
No
Data
records
Run next iteration
….
Find the local
communities
Find strength of
clustering for
overlapping
nodes
Updated community
clustering list and
Rebuild the network
Clustering
aggregaon and
find overlapped
nodes
Find out the best
solution for
overlapped
nodes
Convergence
test
One Iteration
Fig. 1 Framework of the PDICCA approach
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4904
A.Bhih et al.
1 3
where
V
,V
represents the weight of the edge between
Vm
and
Vi
.
Then the strength value of
Vm
for all the clusters (C) where
Vm
belongs to is cal-
culated by computing the sum of the attractions for
Vm
towards its neighbours (Nbr
Attraction) within these C clusters and as follows:
(3)
Nbr Attraction
Vm
Vi
=
W
Vm,Vi
VkNbr
(
V
i)
W
Vi,Vk
(4)
Cluster strength(
Vm,C1
)
=
V
i
C1&V
i
Nbr
(
V
m)
Nbr Attraction Vm
(
Ni
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1234
Clustering Accuracy
(Q \NMI)
TTL
n={500,1000}
NMI,n=500 Modularity, n=500
NMI, n=1000 Modularity, n=1000
Fig. 2 Performance of the DICCA algorithm using different TTL values
Fig. 3 Comparison between computing time and the message complexities over different TTL values
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4905
1 3
Decentralized iterative approaches forcommunity clustering…
5 Setting theoptimal values fortheparameters
As mentioned in Sects. 3 and 4, DICCA and PDICCA use two parameters to be
defined. The first parameter TTL is defined as the number of hops that a message
is permitted to travel before being discarded. The next parameter is threshold value
that determines the difficulty of merging communities.
5.1 Time tolive
Each message has a time to live (TTL) field that is initiated with some value
T > 0 that limits the number of times the message is forwarded. In reality, care
needs to be applied in choosing an appropriate TTL value, because a small TTL
value means the message may expire before reaching all relevant nodes in the
network. On the other hand, a large TTL means more nodes than needed are vis-
ited, thus increasing both the message load on the network and the running time
of the algorithm. Therefore, in this work, it is proposed to rebuild the network
before starting a new iteration to address this issue. Furthermore, based on the
real-world network properties, it is stated that networks from real-world applica-
tions are often small-world networks [26, 27]. The small-world concept in simple
terms describes the fact that even if the network has many nodes, there exists a
relatively small number of intermediate steps (short path) connecting any pair of
nodes within the network [28]. For example in social networking sites, it is stated
that people (and things and places) in the world are just six or fewer interpersonal
connections away from each other [29]. This is known as six degrees of separa-
tion theory.
In order to determine the effect of TTL value on the community clustering
accuracy, the TTL value ranging from 1 to 4 has been used in this evaluation.
Figures2 and 3 present the accuracy values of synthetic networks with 500 and
1000 nodes and the message complexity, respectively. The results demonstrate
that the DICCA yields good community clusters when the TTL is set to be 3.
Increasing the TTL value does not have significant impact on the quality of com-
munity detection but may result in a very high communication load. However,
selecting a small TTL value can reduce the broadcast overhead but will compro-
mise the accuracy.
In this work to achieve good trade-off between high modularity and low message
complexity (running time), TTL is set to a value of 3.
5.2 Threshold value
The threshold value is a parameter set at the beginning of the process in a range
between 0 and 1. If the total weight of the message received by the node from
originator
Oi
is equal to or greater than the threshold value, then the node is able
to join the cluster led by the
Oi
. This means the higher the threshold value there
is less chance for the node to be merged into the community. For example, setting
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4906
A.Bhih et al.
1 3
the threshold value close to zero will produce a single community cluster contain-
ing all the nodes in the network. On the other hand, setting the threshold value
close to one will make each node to be in its own cluster. In other words, low
threshold value produces high number of small-sized clusters; meanwhile higher
value will produce lower number of larger sized clusters. Therefore, the threshold
value has an important effect on clustering accuracy as well as the size of the
detected clusters. Obviously tuning the threshold value could be seen as a pos-
sible practical remedy to control the desired size and the number of communities.
The choice of selecting a suitable threshold value is very crucial and requires
a priori knowledge of network structure. However, generating a priori knowledge
is usually time-consuming since networks are usually big and have large amounts
of information [29]. Hence, in this work a mathematical model to automatically
calculate the threshold value is proposed by the authors. The model makes use of
density, size and layout structure of the network to find the optimal threshold value.
The set of Eqs.(5 to 11) presented below define the mathematical model for
setting the threshold value for undirected network:
where i is the main node, j represents all other nodes, A is the adjacency matrix
where the entry
Aij
represents the connectivity if the value is 1; otherwise, it is 0, n
is the total number of nodes in the network, t is the iteration number,
Ki
is the degree
of node i and C is network clustering coefficient which is defined as:
where
Li
is the number of edges between neighbours of node i [8].
A fully connected network is a simple undirected graph in a network of n nodes,
in which every pair of distinct nodes is connected by a unique edge. Based on the
graph theory, the network clustering coefficient for a fully connected network is 1
and the degree of each node is defined as:
Thus, the total edges of the complete network having n nodes will be:
Using Eq.(5) to calculate the threshold value for complete network:
(5)
Threshold value =avg_t+(t1)×((1C)×avg_t)
(6)
avg_
t=log (log (n))
log (n)
n
i=1
(
1
Ki
+
Ki1
K2
i
+
Ki2
K3
i)
(7)
K
i=
n
j=0
A
ij
(8)
C
=
1
n
n
i=1
2Li
K
i[
K
i
1
]
(9)
Ki=n1
(10)
n
i=0
Ki=n(n1
)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4907
1 3
Decentralized iterative approaches forcommunity clustering…
In Eq.(11), the first part
(
log (log (n))
log
(n)
)
is always less than 1. Whereas, the second
part of equation
n
i=1
1
K
i
+Ki1
K2
i
+Ki2
K3
i
represents the maximum weight of mes-
sages received by node i, when the TTL = 3. For a fully connected network and
using Eqs.(511) to set the threshold parameter, the proposed algorithm will pro-
duce one cluster containing all the nodes in the network. This is acceptable since
there is no meaningful subsets that are clusters in the fully connected network.
It is worthwhile mentioning that, in each iteration, the threshold value for a given
network is stepwise increased by (t − 1) × ((1 − C) × avg_t) as seen in Eq.(5), so that
it becomes progressively difficult for clusters that are not so densely connected to
join with each other. Only the strongly connected ones will be able to merge. Addi-
tionally, the maximum threshold value cannot be larger than 1 [23].
6 Experimental data sets
To analyse the efficiency of the community detection algorithm on a range of net-
work size and due to the scarce availability of real networks that have ground-truth
communities, LFR benchmark is used to generate synthetic data sets. The LFR
benchmark model was proposed by Lancichinetti etal. [30] to generate undirected
and unweighted networks that closely resemble real-world networks with com-
munity structure. LFR model has become a popular choice for assessing the per-
formance of community detection algorithms, and the model was subsequently
extended to generate weighted and/or directed networks, with the possibility of over-
lapping communities.
Most of the real-life network have been defined and modelled as undirected
and unweighted/weighted networks. This paper focuses on this type of networks
with non-overlapping communities. The LFR model is proposed to address most
characteristics of real networks, e.g., size of the network and heterogeneous degree
distribution. In the LFR benchmark, both the node degrees of a network and the size
of each community are controlled by a power-law distribution with exponent γ and
β, respectively. However, it has been observed that real-world graphs have such a
power-law degree distribution [28] with typical values of: 2 γ ≤ 3, 1 ≤ β ≤ 2 [30].
LFR model provides some other parameters to control the network topology,
including the number of nodes, maximum degrees and mixing parameter μ. Mixing
parameter μ [0,1] is used to control the fraction of intra-cluster and inter-cluster
edges on the network. For small values of μ, there will be small number of edges
going outside the communities, which indicates that there are clear clusters available
in the networks. The larger the μ value, the more challenging it is to detect com-
munities in the network. The code of LFR mode is made publicly available by the
authors [31].
(11)
Threshold Value
=log (log (n))
log (n)
n
i=1(
1
Ki
+
Ki1
K2
i
+
Ki2
K3
i)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4908
A.Bhih et al.
1 3
7 Analysis ofresults anddiscussion
7.1 Environment setup
Using the LFR networks a set of undirected networks are generated. The default
benchmark parameter values are used as the benchmark parameters for the expo-
nents of the degree distribution and community size, viz. γ = 2, β = 1. The average
degree and the maximal degree are 25 and 50, respectively. The mixing parameter is
varied from 0.1 to 0.75 and the number of nodes is varied from 500 to 5000.
The DICCA and PDICCA are implemented in Matlab and the experiments are
performed on a system configured with 4® Core™ i7 6700K CPU 4.00GHz and
16 RAM available memory running windows. Because the approaches initialize the
originator randomly, and in order to neglect the effect of randomness in our method
each result is averaged over 100 runs.
7.2 Accuracy measure forgraph clustering
The true community structure (ground truth) is known for the benchmark network.
Therefore, Normalized Mutual Information (NMI) [32] is used to evaluate the per-
formance of DICCA and PDICCA by comparing the obtained partitions in the
experiments with the ground truth for the LFR benchmark. NMI metric quantifies
the accuracy of the proposed methods by evaluating the level of correspondence
between detected and ground-truth communities. In addition, modularity measure-
ment introduced by Newman and Girvan in [33] is used to evaluate the effectiveness
of the algorithms in terms of modularity optimization.
Denition 2 [Normalized Mutual Information (NMI)]
Normalized Mutual Information (NMI) is a similarity measure for comparing two
partitions based on the information theory concept. It is introduced in the commu-
nity detection domain by Danon etal. [32], and since then it has been widely used to
evaluate the accuracy of community detection algorithms.
For an n-node network with two partitions X = {X1, X2, X3, …, Xk} and
Y = {Y1, Y2, Y3, …,
Y
K
} where X and Y represent the real communities and found
communities, respectively, the normalized mutual information NMI(X, Y) of two
divisions X and Y of a network is defined as follows [34]:
where
(
K,
K
)
=
X
K
Y
K
n
,
P
(K)=
X
K
n
and
P(
K
)
=
Y
K
n
.
(12)
NMI
(X,Y)=
2
k
K=1
K
K=1P
K,
K
Log
P
(
K,
K
)
)
P(K)P(
K)
K
K=1
P(K)Log[P(K)]+
K
K=1
P
K
Log
P
K

Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4909
1 3
Decentralized iterative approaches forcommunity clustering…
If the found partition by the algorithm is identical to the real community, then
NMI takes its maximum value of 1. If the partition found is totally independent of
the real partition then NMI = 0 [34].
Denition 3 [Modularity (Q)]
Modularity (Q) is a prominent measure for the quality of a community structure,
and it has become a widely accepted quality of measure for community detection.
Modularity states that a good cluster should have a bigger-than-expected number of
connections between the nodes within modules and a smaller-than-expected number
of connections between nodes in different modules. The higher the value of modu-
larity, the better is its community strength.
The general concept of modularity optimization algorithms is to detect the best
community structure in terms of modularity by searching over possible divisions of
a network that have high modularity.
Formally, modularity can be defined as [3]:
where Aij is an element of the adjacency matrix,
Ki
is the degree of node i.
𝛿
c
i
c
j
is the
Kronecker delta symbol, which is equal to 1 if ci = cj and 0 otherwise, and ci is the
label of the community to which node i is assigned.
7.3 Experimental results
In this section, the results from the experiments conducted using synthetic networks
are presented, analysed and discussed in detail.
7.3.1 Horizontal scalability inrelation tothenumber ofparallel cores
To demonstrate how well the proposed approaches handle data sets when more
workers are available, the number of nodes in the network used in this evaluation
is kept constant and the number of workers is varied from 1 to 4. It is worth men-
tioning that if the number of workers is 1, the algorithm simply represents DICCA.
Figure4 shows the results of different cores when the number of nodes is constant,
n {500, 1000}.
7.3.1.1 Quality From Fig.4, the approaches show a good scalability close to the opti-
mal value, which is indicated by average modularity and NMI values. In addition, it is
clear that using more than one worker to parallelize the algorithm does not adversely
affect the accuracy of the result. Consequently, the results prove that the algorithm is
effective and able to achieve very high-quality results in a parallel manner. More espe-
cially, PDICCA is capable of exploiting multi-core architecture efficiently.
(13)
Q
=
1
2
|
m
|
ij [
Aij
K
i
K
j
2
|
m
|]
𝛿cic
j
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4910
A.Bhih et al.
1 3
7.3.1.2 Message complexity Considering the number of exchanged messages for
each worker, Fig.5 shows the percentage of exchanged messages at each iteration
by each worker processor. As can be observed in each iteration, each worker gener-
ates almost the same number of messages; this can be clarified by the fact that the
data has been partitioned equally among the workers so each worker has to process
the same size of data. Hence, at each iteration, the master worker must wait until all
workers have completed their processes. So, splitting the data equally over workers
can significantly reduce the expected time needed to wait until the slowest machine
worker returned data.
For more in-depth analysis, Fig.6 shows the average percentage of exchanged
messages in each iteration. It can be easily observed from the figure that data
exchange for the algorithm is much greater at the first stage of iteration when each
node is in its own cluster. Just after 2 to 3 initial iterations, most nodes have their
cluster labels and the algorithm has merged the nodes belonging to the same cluster
to be one node. It also becomes clear from the percentage of exchanged messages
between master and slavesas seen in Table1, the communication cost is negligible.
However,this is less compared with the cost of information exchanged locally in
slaves, which is costly and constitutes the main body of the time consumption of the
algorithm.
7.3.2 Clustering results forincreasing network size
To demonstrate the performance influenced by scalability, the number of nodes is
increased linearly from 500 to 5000 and the number of workers is kept constant at
1 and 3. All other parameters and factors remain the same as previous evaluations.
7.3.2.1 Quality The modularity values of the solutions obtained by the DICCA and
PDICCA are presented in Fig.7. It can be observed from the figure that the perfor-
mance of the both DICCA and PDICCA are consistently good and close to the opti-
mal value with NMI above 0.90 on average.
(a) (b)
Fig. 4 NMI, Q-PDICCS and ground-truth Q scores (y-axis) as number of workers (x-axis) changes num-
ber of nodes: a 500 and b 1000
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4911
1 3
Decentralized iterative approaches forcommunity clustering…
7.3.2.2 Evaluating repeatability oftheperformance To further investigate the abil-
ity of the DICCA and PDICCA to produce consistent results across random starts
across random data partitioning and initialization, the standard deviation of the clus-
tering results is measured where both the DICCA and PDICCA are run 100 times
each time with different random data partitioning and algorithm initialization. The
standard deviation value of both NMI and modularity for the data sets with different
network size are displayed in Fig.8, which is relatively very small and in some cases
around zero variation.
7.3.2.3 Evaluation ofcomplexity oftheDICCA andPDICCA approaches To investi-
gate the relationship between the number of nodes and complexity of approach, the
total number of exchanged messages as a function of the network size is presented in
Fig.9. Since the DICCA and PDICCA require a large number of exchanged messages
between nodes, which is the most time-consuming part during execution, the perfor-
mance of DICCA and PDICCA highly depended on the total number of exchanged
messages. Therefore, the number of exchanged messages increases with the network
size. For example, the total number of messages exchanged by PDICCA for n {500;
5000} is {1,344,282; 15,633,691}, respectively.
0
0.5
1
1.5
2
2.5
Number of Messages
(in millinons)
n=1000, 1 worker
1 worker
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Messages
(in millinons)
n=1000, 2 workers
1 worker 2 workers
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Messages
(in millinons)
n=1000, 3 workers
1 worker 2 workers 3 workers
0
0.1
0.2
0.3
0.4
0.5
Number of Messages
(in millinons)
n=1000, 4 workers
1 worker 2 workers
3 workers4 workers
Fig. 5 Number of message exchanged in each iterations and for each worker with respect to the number
of workers varied from 2 to 4 for number of nodes 1000
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4912
A.Bhih et al.
1 3
7.3.3 Evaluation ofclustering performance using mixing parameter
To investigate the ability of the DICCA and PDICCA to detect community clusters
when the community structure is weakened by highly increasing the number of inter-
community links (occurs when μ of the LFR benchmark has high values), the DICCA
and PDICCA are evaluated with varying values of mixing parameter between
0.1 and 0.75, µ {0.1, 0.15,…, 0.75}, and keeping the number of nodes constant,
91.65%
5.82%
1.19% 1.34%
n=1000, 1 worker
1st Iteration 2nd Iteration
3rd Iteration The rest
85%
11%
2% 2%
n=1000,2 workers
1st Iteration 2nd Iteration
3rd IterationThe rest
72%
19%
5% 4%
n=1000, 3 workers
1st Iteration 2nd Iteration
3rd Iteration The rest
52%
28%
10%
10%
n=1000,4 workers
1st Iteration 2nd Iteration
3rd IterationThe rest
Fig. 6 Average percentage of message exchanged per each iteration with number of cores varied from 1
to 4 workers for network size 1000
Table 1 Comparison with message exchanged locally in hosts and messages exchanged between master
and hosts
Number of nodes 500 1000
No. of workers % Messages
exchanged locally
among slaves
% Messages
exchanged between
master and slaves
% Messages
exchanged locally
among slaves
% Messages
exchanged between
master and slaves
1 100 0 100 0
2 99.9767 0.0233 99.9760 0.0240
3 99.9636 0.0364 99.9631 0.0369
4 99.9599 0.0401 99.9629 0.0371
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4913
1 3
Decentralized iterative approaches forcommunity clustering…
0.5
0.6
0.7
0.8
0.9
1
Clustering Accuracy
(Q \NMI)
Number of Nodes
NMI-PDICCA Q-PDICCA Ground-truth Q
NMI-DICCA Q-DICCA
Fig. 7 NMI, Q and ground-truth Q scores (y-axis) as number of nodes (x-axis) changes
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Standard Deviation
Number of Nodes
Std(NMI)-PDICCA Std(Q)-PDICCA
Std(NMI)-DICCA Std(Q)-DICCA
Fig. 8 Standard deviation of final modularity/NMI with network sizes
0
5
10
15
20
Number of Messages
(in millions)
Number of Nodes
Number of Messages-PDICCA
Number of Messages-DICCA
Fig. 9 Total number of exchanged messages (y-axis) as number of nodes (x-axis) changes
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4914
A.Bhih et al.
1 3
n {1000}. Figure 10 shows the results obtained for both modularity and NMI accu-
racy as a function of the mixing parameter for network sizes 1000 nodes. As can be
clearly seen, the natural partitions of the network are always found (in principle) for
the mixing parameter value of up to 0.5, after which the method starts to fail where
the quality of DICCA and PDICCA was rather poor. The reason for this behaviour is
that a small value of mixing parameter indicates well-defined community structures
in the generated network. The reason is that most of the edges fall inside the com-
munities. This is in line with the definition of a community that each node should
have more connections within the community than with the rest of the graph [27].
On the other hand, networks with higher values of mixing parameter are more chal-
lenging to cluster accurately, as there are no clear divisions between communities in
these networks as most of the edges fall outside the communities. Furthermore, the
performance comparison of the proposed algorithms, the ground-truth network and
the fast greedy modularity optimization proposed by Clauset etal. [35] in terms of
modularity is shown in Fig.10. This comparison shows low values of Q for all the
algorithms considered here and the ground-truth network, when the mixing param-
eter value ≥ 0.5. This low value of Q indicates that the communities in the network
are indistinguishable due to the network structure rather than poor performance.
8 Conclusion
In this paper, a novel Decentralized Iterative Community Clustering Approach
(DICCA) and its parallel version (PDICCA) to extract an efficient community
structure for large networks are presented. An important property of the proposed
approaches is their ability to identify optimal community clusters from an entire net-
work without the global knowledge of the network topology. This ability means that
(a) (b)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Clustering Accuracy
(Q \NMI)
Mixing parameter
n=1000
NMI-DICCA Q-DICCA
Ground Modularity NMI-Fast Greedy
0
0.5
1
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Clustering Accuracy
(Q \NMI)
Mixing parameter
n=1000
NMI-PDICCAQ-PDICCA
Ground-truth Q NMI-Fast greedy
Fig. 10 Performance of the proposed algorithm using mixing parameter μ. a DICCA and b PDICCA
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4915
1 3
Decentralized iterative approaches forcommunity clustering…
the entire network does not need to be loaded into one memory and could be easily
adapted to run in parallel on as many processors as available to find community
clusters in big networks. This cannot be done using the majority of existing com-
munity detection algorithms that implicitly assume that the entire structure of the
network is known and is available.
The DICCA and PDICCA are based on the random walk procedure and reach-
ability of nodes in the network. In addition, the proposed approaches address the
issues surrounding computational demands for dealing with big data sets. They
optimally utilize the hardware capabilities of modern multi-core systems for faster
execution by processing multiple blocks in a parallel manner. Furthermore, when
scalability issues occur as the data size grows beyond the processing power of a sin-
gle machine, the proposed distributed approach based on the MapReduce comput-
ing platform will help address this. Finally, the effectiveness and complexity of the
approaches are tested and analysed using synthetic networks with ground-truth com-
munities. The experimental results of the approaches prove to be very promising.
Real-world networks often do not contain perfect communities, and in reality
nodes may belong to multiple communities simultaneously. Identifying such over-
lapping communities (also known as fuzzy) is crucial for understanding the struc-
ture as well as the function of real-world networks. A further direction is to extend
the proposed approaches to be able to detect such fuzzy communities. Further, in
this work, only the undirected networks have been taken into consideration. There-
fore, considering the directed networks may be an interesting direction for further
research.
OpenAccess This article is distributed under the terms of the Creative Commons Attribution 4.0 Inter-
national License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribu-
tion, and reproduction in any medium, provided you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license, and indicate if changes were made.
References
1. Chen J, Zaiane OR, Goebel R (2009) Detecting communities in large networks by iterative local
expansion. In: International Conference on Computational Aspects of Social Networks, 2009.
CASON’09, pp 105–112. IEEE
2. Mahata D, Patra C (2016) Detecting and analyzing invariant groups in complex networks. In:
Behera H, Mohapatra D (eds) Computational intelligence in data mining, vol 1. Springer, New
Delhi, pp 85–93
3. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75–174
4. Orman GK, Labatut V, Cherifi H (2011) On accuracy of community structure discovery algorithms.
ArXiv preprint arXiv :1112.4134
5. Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
6. Khatoon M, Banu WA (2015) A survey on community detection methods in social networks. Int J
Educ Manag Eng 5(1):8
7. Warnke SD (2016) Partial information community detection in a multilayer network. Naval Post-
graduate School, Monterey
8. Costa LF, Rodrigues FA, Travieso G, Villas Boas PR (2007) Characterization of complex networks:
a survey of measurements. Adv Phys 56(1):167–242
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4916
A.Bhih et al.
1 3
9. Qi G-J, Aggarwal CC, Huang T (2012) Community detection with edge content in social media
networks. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp 534–545.
IEEE
10. Rahimian F, Payberah AH, Girdzijauskas S, Jelasity M, Haridi S (2013) Ja-be-ja: a distributed algo-
rithm for balanced graph partitioning. In: 2013 IEEE 7th International Conference on Self-Adaptive
and Self-Organizing Systems (SASO), pp 51–60. IEEE
11. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun
ACM 51(1):107–113
12. Wang M, Wang C, Yu JX, Zhang J (2015) Community detection in social networks: an in-depth
benchmarking study with a procedure-oriented framework. Proc VLDB Endow 8(10):998–1009
13. Aggarwal CC, Wang H (2010) A survey of clustering algorithms for graph data. In: Managing and
mining graph data, pp 275–301. Springer, Boston, MA
14. Newman M (2010) Networks: an introduction. Oxford University Press, Oxford
15. Newman ME (2004) Fast algorithm for detecting community structure in networks. Phys Rev E
69(6):066133
16. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in
large networks. J Stat Mech: Theory Exp 2008(10):P10008
17. Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph
Algorithms Appl 10(2):191–218
18. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community
structure. Proc Natl Acad Sci 105(4):1118–1123
19. Ganapathy S, Kulothungan K, Yogesh P, Kannan A (2012) A novel weighted fuzzy C-means clus-
tering based on immune genetic algorithm for intrusion detection. Proc Eng 38:1750–1757
20. Munuswamy S, Saravanakumar JM, Sannasi G, Harichandran KN, Arputharaj K (2018) Virtual
force-based intelligent clustering for energy-efficient routing in mobile wireless sensor networks.
Turk J Electr Eng Comput Sci 26(3):1444–1452
21. Priya P, Ghosh D, Kannan A, Ganapathy S (2014) Behaviour analysis model for social networks
using genetic weighted fuzzy c-means clustering and neuro-fuzzy classifier. Int J Soft Comput
9(3):138–142
22. Thangaramya K, Logambigai R, SaiRamesh L, Kulothungan K, Ganapathy AKS (2017) An
energy efficient clustering approach using spectral graph theory in wireless sensor networks. In:
2017 Second International Conference on Recent Trends and Challenges in Computational Models
(ICRTCCM), pp 126–129. IEEE
23. Bhih A, Johnson P, Nguyen T, Randles M (2017) Decentralized Iterative Community Clustering
Approach (DICCA). In: 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and
Mobile Radio Communications (PIMRC), pp 1–7. IEEE
24. Ramaswamy L, Gedik B, Liu L (2005) A distributed approach to node clustering in decentralized
peer-to-peer networks. IEEE Trans Parallel Distrib Syst 16(9):814–829
25. Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual P2P
supercomputer. In: 2010 IEEE International Symposium on Parallel and Distributed Processing,
Workshops and Ph.D. Forum (IPDPSW), pp 1–8. IEEE
26. Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature
393(6684):440–442
27. Silva TC, Zhao L (2016) Machine learning in complex networks. Springer, Berlin
28. Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256
29. Griffiths MD, Kuss DJ, Demetrovics Z (2014) Social networking addiction: an overview of pre-
liminary findings. In: Rosenberg KP, Feder LC (eds) Behavioral addictions. Elsevier, New York, pp
119–141
30. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection
algorithms. Phys Rev E 78(4):046110
31. Fortunato S. Benchmark graphs for testing community detection algorithms. www.santo .fortu nato.
googl epage s.com/bench mark.tgz
32. Danon L, Diaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification.
J Stat Mech: Theory Exp 2005(09):P09008
33. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev
E 69(2):026113
34. Labatut V (2015) Generalised measures for the evaluation of community detection methods. Int J
Soc Netw Min 2(1):44–63
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4917
1 3
Decentralized iterative approaches forcommunity clustering…
35. Clauset A, Newman ME, Moore C (2004) Finding community structure in very large networks.
Phys Rev E 70(6):066111
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
... Popular Community Detection Algorithms with low runtime complexities, such as Louvain, Label Propagation, and Infomap methods, have been implemented and compared on Peer-topeer (P2P) networks. Existing methods [6][7][8][9][10][11][12] mainly utilize only topological data and neglect the rich data obtained from the content data. As the size and complexity of P2P networks increases, more sophisticated techniques are needed to detect communities. ...
... In [9], the authors have proposed a method to monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts; instead of relying on the study of content characteristics or packet properties. In [10], the authors have proposed a Decentralized Iterative Community Clustering Approach (DICCA) to reveal the community structure for large networks using the LFR benchmark model. The proposed method identifies the community clusters from an entire network without the global knowledge of the network topology due to the use of the Parallel Decentralized Iterative Community Clustering Approach (PDICCA), a pipelined parallel implementation that transforms the serial process of the DICCA into a parallelized approach. ...
Article
Full-text available
Community detection is essential in P2P network analysis as it helps identify connectivity structure, undesired centralization, and influential nodes. Existing methods primarily utilize topological data and neglect the rich content data. This paper proposes a technique combining topological and content data to detect communities inside the Bitcoin network using a deep feature representation algorithm and Deep Feedforward Autoencoders. Our results show that the Bitcoin network has a higher clustering coefficient, assortativity coefficient, and community structure than expected from a random P2P network. In the Bitcoin network, nodes prefer to connect to other nodes that share the same characteristics.
... Therefore, the proposed approach will consider attribute information and structure information. The structure information consists of shared neighbours information and connectivity information aspects of the network [3]. ...
... In case there are no good divisions existing, the least bad one will be identified as the solution. On the other hand, in the latter, the algorithm only divides the network when good divisions exist and leave the network undivided in case there are no good divisions existing [3,15]. ...
Article
Full-text available
With the recent prevalence of information networks, the topic of community detection has gained much interest among researchers. In real-world networks, node attribute (content information) is also available in addition to topology information. However, the collected topology information for networks is usually noisy when there are missing edges. Furthermore, the existing community detection methods generally focus on topology information and largely ignore the content information. This makes the task of community detection for incomplete networks very challenging. A new method is proposed that seeks to address this issue and help improve the performance of the existing community detection algorithms by considering both sources of information, i.e. topology and content. Empirical results demonstrate that our proposed method is robust and can detect more meaningful community structures within networks having incomplete information, than the conventional methods that consider only topology information.
Article
A mobile wireless sensor network (MWSN) consists of many sensor nodes, which can move from one position to another and gather data from the environment, and such nodes are coordinated with the support of a sink node. In recent years, the mobility behavior of sensor nodes present in wireless sensor networks is used to form effective clustering and to perform cluster-based routing. Virtual force is an important phenomenon in sensor nodes, which is used to model the mobility behavior. Production rules that use spatiotemporal constraints are able to make more accurate decisions on mobility speed, mobility area, and the required time. Routing in MWSNs under the mobility scenario will provide better performance if virtual force-based mobility modeling is used to form clusters. In this paper, an intelligent routing algorithm called virtual force-based intelligent clustering for energy-efficient routing in MWSNs has been proposed for effective and energy-efficient cluster-based routing of data packets collected by mobile sensor nodes in a MWSN. This algorithm uses attractive and repulsive forces for finding the cluster members. Moreover, spatiotemporal constraints are used in the form of rules for clustering, reclustering, and cluster head election and to perform routing through the cluster heads using intelligent rules. The main advantage of the proposed algorithm is that it increases the network lifetime and packet delivery ratio. Moreover, it reduces the delay and the energy consumption.
Chapter
Real-world complex networks usually exhibit inhomogeneity in functional properties, resulting in densely interconnected nodes, communities. Analyzing such communities in large networks has rapidly become a major area in network science. A major limitation of most of the community finding algorithms is the dependence on the ordering in which vertices are processed. However, less study has been conducted on the effect of vertex ordering in community detection. In this paper, we propose a novel algorithm, DIGMaP to identify the invariant groups of vertices which are not affected by vertex ordering. We validate our algorithm with the actual community structure and show that these detected groups are the core of the community.
Book
This book presents the features and advantages offered by complex networks in the machine learning domain. In the first part, an overview on complex networks and network-based machine learning is presented, offering necessary background material. In the second part, we describe in details some specific techniques based on complex networks for supervised, non-supervised, and semi-supervised learning. Particularly, a stochastic particle competition technique for both non-supervised and semi-supervised learning using a stochastic nonlinear dynamical system is described in details. Moreover, an analytical analysis is supplied, which enables one to predict the behavior of the proposed technique. In addition, data reliability issues are explored in semi-supervised learning. Such matter has practical importance and is not often found in the literature. With the goal of validating these techniques for solving real problems, simulations on broadly accepted databases are conducted. Still in this book, we present a hybrid supervised classification technique that combines both low and high orders of learning. The low level term can be implemented by any classification technique, while the high level term is realized by the extraction of features of the underlying network constructed from the input data. Thus, the former classifies the test instances by their physical features, while the latter measures the compliance of the test instances with the pattern formation of the data. We show that the high level technique can realize classification according to the semantic meaning of the data. This book intends to combine two widely studied research areas, machine learning and complex networks, which in turn will generate broad interests to scientific community, mainly to computer science and engineering areas.
Chapter
Revealing the latent community structure, which is crucial to understanding the features of networks, is an important problem in network and graph analysis. During the last decade, many approaches have been proposed to solve this challenging problem in diverse ways, i.e. different measures or data structures. Unfortunately, experimental reports on existing techniques fell short in validity and integrity since many comparisons were not based on a unified code base or merely discussed in theory. We engage in an in-depth benchmarking study of community detection in social networks. We formulate a generalized community detection procedure and propose a procedure-oriented framework for benchmarking. This framework enables us to evaluate and compare various approaches to community detection systematically and thoroughly under identical experimental conditions. Upon that we can analyze and diagnose the inherent defect of existing approaches deeply, and further make effective improvements correspondingly. We have re-implemented ten state-of-the-art representative algorithms upon this framework and make comprehensive evaluations of multiple aspects, including the efficiency evaluation, performance evaluations, sensitivity evaluations, etc. We discuss their merits and faults in depth, and draw a set of take-away interesting conclusions. In addition, we present how we can make diagnoses for these algorithms resulting in significant improvements.
Article
Genetic algorithms are helpful to make effective decisions using suitable fitness functions. They can be used to perform both clustering and classification. However, Clustering algorithms enhanced only with genetic operators are not sufficient for making decision in many critical applications. In this study, researchers propose a new user behaviour analysis model by combining Genetic algorithm with Weighted Fuzzy C-Means Clustering Algorithm (GNWFCMA) for effective clustering. The proposed clustering algorithm is used to improve the classification accuracy by providng initial groups. In addition, researchers use a five factor analysis also for effective clustering. Finally, researchers use a neuro-fuzzy classifier for classifying the data. The experimental results obtained from th~ss tudy shows that the clustering results when combined with classification algorithm provides better classification accuracy when tested with Weblog dataset.
Article
A social network is a social structure made up of a set nodes, which represents social actors (such as people, organizations), and edges or lines represents relationship between these nodes or actors. Social networks have important roles in the dispersal of information and innovation, the analysis of such networks, attracted much attention in the research area. The analysis of social network can be done as a whole, which means the representations of all of its actors and identification of structures, present in that social network, that lead to the presence of communities. In the method of community detection, the main aim is to partition the network into dense regions of the graph, and those dense regions typically correspond to entities which are closely related, and can hence be said to belong to a community. In any complex network, communities are able to exchange and offer information because members in one community have similar tastes and desires. The determination of such communities is useful in the context of a variety of applications in social-network analysis, including customer segmentation, recommendations, link inference, and vertex labeling and influence analysis. This paper presents a survey on community detection approaches, which have already been proposed, and also discussing the type of social networks on which those proposed approaches are applicable. This survey can play a significant role in the analysis and evaluation of community detection approaches in different application domains.