ArticlePDF Available

Uncovering the overlapping community structure of complex networks in nature and society

Authors:

Abstract and Figures

Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins, industrial sectors and groups of people) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.
Content may be subject to copyright.
arXiv:physics/0506133v1 [physics.soc-ph] 15 Jun 2005
Uncovering the overlapping
community structure of complex
networks in nature and society
Gergely Palla
†‡
, Imre Der´enyi
, Ill´es Farkas
, and Tam´as Vicsek
†‡
Biological Physics Research Group of HAS, P´azm´any P. stny. 1A, H-1117 Budapest, Hungary,
Dept. of Biological Physics, E¨otv¨os University, P´azm´any P. stny. 1A, H-1117 Budapest, Hungary.
Many complex systems in nature and society can be described in terms of networks capturing
the intricate web of connections among the units they are made of [1, 2, 3, 4]. A question of interest
is how to interpret the global organisation of such networks as the coexistence of their structural
sub-units (communities) associated with more highly interconnected parts. Identifying these a pri-
ori unknown building blocks (functionally related proteins [5, 6], industrial sectors [7], groups of
people [8, 9], etc.) is crucial to the understanding of the structural and functional properties of
networks. The existing deterministic methods used for large networks find separated communi-
ties, while most of the actual networks are made of highly overlapping cohesive groups of nodes.
Here we introduce an approach to analyse the main statistical features of the interwoven sets of
overlapping communities making a step towards the uncovering of the modular structure of com-
plex systems. After defining a set of new characteristic quantities for the statistics of communities,
we apply an efficient technique to explore overlapping communities on a large scale. We find that
overlaps are significant, and the distributions we introduce reveal universal features of networks.
Our studies of collaboration, word association, and protein interaction graphs demonstrate that
the web of communities has non-trivial correlations and specific scaling properties.
Most real networks typically contain parts in which the nodes (units) are more highly connected to
each other than to the rest of the network. The sets of such nodes are usually called clusters, communities,
cohesive groups, or modules [8, 10, 11, 12, 13], having no widely accepted, unique definition. In spite
of this ambiguity the presence of communities in networks is a signature of the hierarchical nature of
complex systems [5, 14]. The existing methods for nding communities in large networks are useful if
the community structure is such that it can be interpreted in terms of separated sets of communities (see
Fig. 1b and Refs. [10, 15, 16, 17, 18]). However, most real networks are characterised by well defined
statistics of overlapping and nested communities. Such a statement can be demonstrated by the numerous
communities each of us belongs to, including those related to our scientific activities or personal life
(school, hobby, family) and so on, as illustrated in Fig. 1a. Furthermore, members of our communities
have their own communities, resulting in an extremely complicated web of the communities themselves.
This has long been appreciated by sociologists [19], but has never been studied systematically for large
networks. Another, biological example is that a large fraction of proteins belong to several protein
complexes simultaneously [20].
In general, each node i of a network can be characterised by a membership number m
i
, which is
the number of communities the node belongs to. In turn, any two communities α and β can share s
ov
α,β
nodes, which we define as the overlap size between these communities. Naturally, the communities
also constitute a network with the overlaps being their links. The number of such links of community
1
Scientific
Community
Schoolmates
Family
Hobby
Friends
Department of
Biological Physics
Mathematicians
Biologists
Scientists
Physicists
"zoom"
"zoom"
all
people
Includes colleagues,
friends, schoolmates,
family members, etc.
a)
b) c)
Figure 1: Illustration of the concept of overlapping communities. a) The black dot in the middle rep-
resents either of the authors of this Letter, with several of his communities around. Zooming into the
scientific community demonstrates the nested and overlapping structure of the communities, while de-
picting the cascades of communities starting from some members exemplifies the interwoven structure
of the network of communities. b) Divisive and agglomerative methods grossly fail to identify the com-
munities when overlaps are significant. c) An example of overlapping k-clique-communities at k = 4.
The yellow community overlaps with the blue one in a single node, whereas it shares two nodes and
a link with the green one. These overlapping regions are emphasised in red. Notice that any k-clique
(complete subgraph of size k) can be reached only from the k-cliques of the same community through a
series of adjacent k-cliques. Two k-cliques are adjacent if they share k 1 nodes.
α can be called as its community degree, d
com
α
. Finally, the size s
com
α
of any community α can most
naturally be defined as the number of its nodes. To characterise the community structure of a large
network we introduce the distributions of these four basic quantities. In particular, we will focus on their
cumulative distribution functions denoted by P (s
com
), P (d
com
), P (s
ov
), and P (m), respectively. For
the overlap size, e.g., P (s
ov
) means the proportion of those overlaps that are larger than s
ov
. Further
relevant statistical features will be introduced later.
The basic observation on which our community definition relies is that a typical community consists
of several complete (fully connected) subgraphs that tend to share many of their nodes. Thus, we define a
community, or more precisely, a k-clique-community as a union of all k-cliques (complete subgraphs of
size k) that can be reached from each other through a series of adjacent k-cliques (where adjacency means
sharing k 1 nodes) [21, 23, 24]. This definition is aimed at representing the fact that it is an essential
feature of a community that its members can be reached through well connected subsets of nodes. There
are other parts of the whole network that are not reachable from a particular k-clique, but they potentially
contain further k-clique-communities. In turn, a single node can belong to several communities. All
these can be explored systematically and can result in a large number of overlapping communities (see
2
Fig. 1c for illustration). Note that in most cases relaxing this definition (e.g., by allowing incomplete
k-cliques) is practically equivalent to lowering the value of k. For finding meaningful communities, the
way they are identified is expected to satisfy several basic requirements: it cannot be too restrictive,
should be based on the density of links, is required to be local, should not yield any cut-node or cut-
link (whose removal would disjoin the community) and, of course, should allow overlaps. We employ
the community definition specified above, because none of the others in the literature satisfy all these
requirements simultaneously [21, 22].
Although the numerical determination of the full set of k-clique-communities is a polynomial prob-
lem, we use an algorithm which is exponential, because it is significantly more efficient for the graphs
corresponding to actual data. This method is based on first locating all cliques (maximal complete
subgraphs) of the network and then identifying the communities by carrying out a standard component
analysis of the clique-clique overlap matrix [21]. For more details about the method and its speed see the
Supplementary Information.
We use our method for binary networks (i.e., with undirected and unweighted links). An arbitrary
network can always be transformed into a binary one by ignoring any directionality in the links and
keeping only those that are stronger than a threshold weight w
. Changing the threshold is like changing
the resolution (as in a microscope) with which the community structure is investigated: by increasing
w
the communities start to shrink and fall apart. A very similar effect can be observed by changing the
value of k as well: increasing k makes the communities smaller and more disintegrated, but at the same
time, also more cohesive.
When we are interested in the community structure around a particular node, it is advisable to scan
through some ranges of k and w
, and monitor how its communities change. As an illustration, in Fig. 2
we are depicting the communities of three selected nodes of three large networks: (i) the social network
of scientific collaborators [25], (ii) the network of word associations [26] related to cognitive sciences,
and (iii) the molecular-biological network of protein-protein interactions [27]. These pictures can serve
as tests or validations of the efficiency of our algorithm. In particular, the communities of the author G.
Parisi (who is well known to have been making significant contributions in different fields of physics)
shown in Fig. 2a are associated with his fields of interest, as it can be deduced from the titles of the papers
involved. The 4-clique-communities of the word “bright” (Fig. 2b) correspond to the various meanings
of this word. An important biological application is finding the communities of proteins, based on their
interactions. Indeed, most proteins in the communities shown in Figs. 2c and 3 can be associated with
either protein complexes or certain functions, as can be looked up by using the GO-TermFinder package
[28] and the online tools of the Saccharomyces Genome Database (SGD) [29]. For some proteins no
function is available yet. Thus, the fact that they show up in our approach as members of communities
can be interpreted as a prediction for their functions. One such example can be seen in the enlarged
portion of Fig. 3. For the protein Ycr072c, which is required for the viability of the cell and appears in
the dark green community on the right, SGD provides no biological process (function). By far the most
significant GO term for the biological process of this community is “ribosome biogenesis/assembly”.
Thus, we can infer that Ycr072c is likely to be involved in this process. Also, new cellular processes can
be predicted if yet unknown communities are found with our method.
These examples (and further ones included in the Supplementary Information) clearly demonstrate
the advantages of our approach over the existing divisive and agglomerative methods recently used for
large real networks. Divisive methods cut the network into smaller and smaller pieces, each node is
3
Figure 2: The community structure around a particular node in three different networks. The commu-
nities are colour coded, the overlapping nodes and links between them are emphasised in red, and the
volume of the balls and the width of the links are proportional to the total number of communities they
belong to. For each network the value of k has been set to 4. a) The communities of G. Parisi in the
co-authorship network of the Los Alamos cond-mat archive (for threshold weight w
= 0.75) can be
associated with his fields of interest. b) The communities of the word “bright in the South Florida Free
Association norms list (for w
= 0.025) represent the different meanings of this word. c) The commu-
nities of the protein ZDS1 in the DIP core list of the protein-protein interactions of S. cerevisiae can be
associated with either protein complexes or certain functions.
forced to remain in only one community and be separated from its other communities, most of which
then necessarily fall apart and disappear. This happens, e.g., with the word “bright” when we apply the
method described in Ref. [16]: it tends to stay together mostly with the words of the community related
to “light”, while most of its other communities (e.g., those related to “colors”, see Fig. 2b) completely
disintegrate (“green gets to the vegetables, “orange” to the fruits, etc.). Agglomerative methods do the
same, but in the reverse direction. For example, when we applied the agglomerative method of Ref. [18],
at some point “bright”, as a single word, joined a “community” of 890 other words. In addition, such
methods inevitably lead to a tree-like hierarchical rendering of the communities, while our approach
allows the construction of an unconstrained network of communities.
The networks chosen above have been constructed in the following ways. In the co-authorship net-
4
work of the Los Alamos e-print archives [25] each article contributes the value 1/(n 1) to the weight
of the link between every pair of its n authors. In the South Florida Free Association norms list [26]
the weight of a directed link from one word to another indicates the frequency that the people in the
survey associated the end point of the link with its start point. For our purposes these directed links have
been replaced by undirected ones with a weight equal to the sum of the weights of the corresponding
two oppositely directed links. In the DIP core list of the protein-protein interactions of S. cerevisiae [27]
each interaction represents an unweighted link between the interacting proteins. These networks are very
large, consisting of 30739, 10617, and 2609 nodes and 136065, 63788, and 6355 links, respectively.
Although different values of k and w
might be optimal for the local community structure around
different nodes, we should set some global criterion to fix their values if we want to analyse the statistical
properties of the community structure of the entire network. The criterion we use is based on finding
a community structure as highly structured as possible. In the related percolation phenomena [24] a
giant component appears when the number of links is increased above some critical point. Therefore,
to approach this critical point from below, for each selected value of k (typically between 3 and 6) we
lower the threshold w
until the largest community becomes twice as big as the second largest one. In
this way we ensure that we find as many communities as possible, without the negative effect of having a
giant community that would smear out the details of the community structure by merging many smaller
communities. We denote by f
the fraction of links stronger than w
, and use only those values of k for
which f
is not too small (not smaller than 0.5). This has led us to k = 6 and 5 with f
= 0.93 and
0.75, respectively, for the collaboration network, and k = 4 with f
= 0.67 for the word association
network. For the former network both sets of parameters result in very similar communities (see the
Supplementary Information). Since for unweighted networks no threshold weight can be set, for these
we simply select the smallest value of k for which no giant community appears. In case of the protein
interaction network this gives k = 4, resulting in 82 communities. Due to this relatively low number, we
can depict the entire network of protein communities as presented in Fig. 3.
The four distributions characterising the global community structure of these networks are displayed
in Fig. 4. Although the scaling of the size of non-overlapping communities has already been demon-
strated for social networks [18, 17], it is striking to observe how this aspect of large real networks is
preserved even when a more complete picture (allowing overlaps) is investigated. In Fig. 4a the power
law dependence P (s
com
) (s
com
)
τ
with an exponent ranging between τ = 1 and 1.6 is well pro-
nounced and is valid nearly over the entire range of community sizes.
It is well known [2, 3, 4] that the nodes of large real networks have a power law degree distri-
bution. Will the same kind of distribution hold when we move to the next level of organisation and
consider the degrees of the communities? Remarkably, we find that it is not the case. The community
degrees (Fig. 4b) have a very unique distribution, consisting of two distinct parts: an exponential decay
P (d
com
) exp(d
com
/d
com
0
) with a characteristic community degree d
com
0
(which is in the order of
hd
com
i shown in Table 1), followed by a power law tail (d
com
)
τ
. This new kind of behaviour is
consistent with the community size distribution assuming that on average each node of a community has
a contribution δ to the community degree. The tail of the community degree distribution is, therefore,
simply proportional to that of the community size distribution. At the first part of P (d
com
), on the other
hand, a characteristic scale d
com
0
kδ appears, because the majority of the communities have a size of
the order of k (see Fig. 4a) and their distribution around d
com
0
dominates this part of the curve. Thus,
the degree to which P (d
com
) deviates from a simple scaling depends on k or, in other words, on the
5
complex
Set3c
Chromatin silencing
(cellular fusion)
Pheromone response
Cell polarity,
budding
Protein phosphatase
type 2A complex (part)
CK2 complex and
transcription regulation
43S complex and
protein metabolism
Ribosome
biogenesis/assembly
DNA packaging,
chromatin assembly
(septin ring)
Cytokinesis
Tpd3
Sif2
Hst1
Snt1
Hos2
Cph1
Zds1
Set3
Hos4
Mdn1
Hcr1
Sui1
Ckb2
Cdc68
Abf1
Cka1
Arp4
Hht1
Sir4
Sir3
Htb1
Sir1
Zds2
Bob1
Ste20
Cdc24
Bem1
Far1
Cdc42
Gic2
Gic1
Cla4
Cdc12
Cdc11
Rga1
Kcc4
Cdc10
Cdc3
Shs1
Gin4
Bni5
Sda1
Nop2
Erb1
Has1
Dbp10
Rpg1
Tif35
Sua7
Tif6
Hta1
Nop12
Ycr072c
Arx1
Cic1
Rrp12
Nop4
Cdc55
Pph22
Pph21
Rts3
Rrp14
Nsa2
Ckb1
Cka2
Prt1
Tif34
Tif5
Nog2
Hhf1
Brx1
Mak21
Mak5
Nug1
Bud20
Mak11
Rpf2
Rlp7
Nop7
Puf6
Nop15
Ytm1
Nop6
Figure 3: Network of the 82 communities in the DIP core list of the protein-protein interactions of S.
cerevisiae for k = 4. The area of the circles and the width of the links are proportional to the size of
the corresponding communities (s
com
α
) and to the size of the overlaps (s
ov
α,β
), respectively. The coloured
communities are cut out and magnified to reveal their internal structure. In this magnified picture the
nodes and links of the original network have the same colour as their communities, those that are shared
by more than one community are emphasised in red, and the grey links are not part of these communities.
The area of the circles and the width of the links are proportional to the total number of communities
they belong to.
prescribed minimum cohesiveness of the communities.
The extent to which different communities overlap is also a relevant property of a network. Although
the range of overlap sizes is limited, the behaviour of the cumulative overlap size distribution P (s
ov
),
shown in Fig. 4c, is close to a power law for each network, with a rather large exponent. We can conclude
that there is no characteristic overlap size in the networks. Finally, in Fig. 4d we display the cumulative
distribution of the membership number, P(m). These plots demonstrate that a node may belong to a
number of communities. In the collaboration and the word association network there seems to be no
characteristic value for the membership number: the data are close to a power law dependence, with
a large exponent. In the protein interaction network, however, the largest membership number is only
6
d
com
10
−1
10
−2
10
−3
10
−4
10
2
10
3
10
−1
10
−2
10
−3
10
−1
10
−2
10
−3
10
−4
10
−5
1s −(k− )
com
s
ov
word assoc.
co−authorship
prot. interact.
b)
a)
P
d)c)
P
P
m
1
10 1
1
1
1 10 1 10
Figure 4: Statistics of the k-clique-communities for three large networks. These are the co-authorship
network of the Los Alamos cond-mat archive (triangles, k=6, f
= 0.93), the word association network
of the South Florida Free Association norms (squares, k = 4, f
= 0.67), and the protein interaction
network of the yeast S. cerevisiae from the DIP database (circles, k = 4). (a) The cumulative distribution
function of the community size follows a power law with exponents between 1 (upper line) and 1.6
(lower line). (b) The cumulative distribution of the community degree starts exponentially and then
crosses over to a power law (with the same exponent as for the community size distribution). Plot (c) is
the cumulative distribution of the overlap size and (d) is that of the membership number.
4, which is consistent with the also rather short distribution of its community degree. To show that the
communities we find are not due to some sort of artifact of our method, we have also determined the
above distributions for “randomised” graphs with parameters (size, degree sequence, k and f
) being the
same as those of our three examples, but with links stochastically redistributed among the nodes. We
have found that indeed the distributions are extremely truncated, signifying a complete lack of the rich
community structure determined for the original data.
In Table 1 we have collected a few interesting statistical properties of the network of communities.
It should be pointed out that the average clustering coefficients hC
com
i are relatively high, indicating
that two communities overlapping with a given community are likely to overlap with each other as well,
mostly because they all share the same overlapping region. The high fraction of shared nodes is yet
another indication of the importance of overlaps between the communities.
The specific scaling of the community degree distribution is a novel signature of the hierarchical
nature of the systems we study. We find that if we consider the network of communities instead of the
7
N
com
hd
com
i hC
com
i hri
co-authorship 2450 12.10 0.44 58%
word assoc. 670 11.33 0.56 72%
prot. interact. 82 1.54 0.17 26%
Table 1: Statistical properties of the network of communities. N
com
is the number of communities,
hd
com
i is the average community degree, hC
com
i is the average clustering coefficient of the network of
communities, and hri represents the average fraction of shared nodes in the communities.
nodes themselves, we still observe a degree distribution with a fat tail, but a characteristic scale appears,
below which the distribution is exponential. This is consistent with our understanding of a complex
system having different levels of organisation with units specific to each level. In the present case the
principle of organisation (scaling) is preserved (with some specific modifications) when going to the next
level in good agreement with the recent finding of the self-similarity of many complex networks [30].
With recent technological advances, huge sets of data are accumulating at a tremendous pace in vari-
ous fields of human activity (including telecommunication, the internet, stock markets) and in many areas
of life and social sciences (biomolecular assays, genetic maps, groups of web users, etc.). Understanding
both the universal and specific features of the networks associated with these data has become an actual
and important task. The knowledge of the community structure enables the prediction of some essential
features of the systems under investigation. For example, since with our approach it is possible to “zoom”
onto a single unit in a network and uncover its communities (and the communities connected to these,
and so on), we provide a tool to interpret the local organisation of large networks and can predict how
the modular structure of the network changes if a unit is removed (e.g., in a gene knock out experiment).
A unique feature of our method is that we can simultaneously look at the network at a higher level of
organisation and locate the communities that play a key role within the web of communities. Among the
many possible applications is a more sophisticated approach to the spreading of infections (e.g., real or
computer viruses) or information in highly modular complex systems.
References
[1] Watts, D. J. & Strogatz, S. H. Collective dynamics of small-world’ networks. Nature 393, 440–442
(1998).
[2] Barab´asi, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512
(1999).
[3] Albert, R. & Barab´asi, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97
(2002).
[4] Mendes, J. F. F. & Dorogovtsev, S. N. Evolution of Networks: From Biological Nets to the Internet
and WWW (Oxford University Press, Oxford, 2003).
[5] Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z., & Barab´asi, A.-L. Hierarchical organization
of modularity in metabolic networks. Science 297, 1551–1555 (2002).
[6] Spirin, V. & Mirny, L. A. Protein complexes and functional modules in molecular networks. Proc.
Natl. Acad. Sci. USA 100, 12123–12128 (2003).
[7] Onnela, J.-P., Chakraborti, A., Kaski, K., Kert´esz, J., & Kanto, A. Dynamics of market correlations:
Taxonomy and portfolio analysis. Phys. Rev. E 68, 056110 (2003).
8
[8] Scott, J. Social Network Analysis: A Handbook, 2nd ed. (Sage Publications, London, 2000).
[9] Watts, D. J., Dodds, P. S., & Newman, M. E. J. Identity and search in social networks. Science 296,
1302–1305 (2002).
[10] Shiffrin, R. M. & B¨orner, K. Mapping knowledge domains. Proc. Natl. Acad. Sci. USA 101 5183–
5185 Suppl. 1 (2004).
[11] Everitt, B. S. Cluster Analysis, 3th ed. (Edward Arnold, London, 1993).
[12] Knudsen, S. A Guide to Analysis of DNA Microarray Data, 2nd ed. (Wiley-Liss, 2004).
[13] Newman, M. E. J. Detecting community structure in networks. Eur. Phys. J. B, 38, 321–330 (2004).
[14] Vicsek, T. The bigger picture. Nature 418, 131 (2002).
[15] Blatt, M., Wiseman, S., & Domany, E. Super-paramagnetic clustering of data. Phys. Rev. Lett. 76,
3251–3254 (1996).
[16] Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Proc. Natl.
Acad. Sci. USA 99, 7821-7826 (2002).
[17] Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., & Parisi, D. Defining and identifying commu-
nities in networks. Proc. Natl. Acad. Sci. USA 101, 2658–2663 (2004).
[18] Newman, M. E. J. Fast algorithm for detecting community structure in networks. Phys. Rev. E 69,
066133 (2004).
[19] Faust, K. Using Correspondence Analysis for Joint Displays of Affiliation Networks. Models and
Methods in Social Network Analysis (Eds Carrington, P., Scott, J., & Wasserman, S.) Ch. 7 (Cam-
bridge University Press, New York, 2005).
[20] Gavin, A. C. et al. Functional organization of the yeast proteome by systematic analysis of protein
complexes. Nature 415, 141–147 (2002).
[21] Everett, M. G. & Borgatti, S. P. Analyzing clique overlap. Connections 21, 49–61 (1998).
[22] Kosub, S. Local density. Network Analysis, LNCS 3418 (Eds Brandes, U. & Erlebach, T.) pp. 112
142 (Springer-Verlag, Berlin Heidelberg, 2005).
[23] Batagelj, V. & Zaversnik, M. Short cycles connectivity. arXiv cs.DS/0308011 (2003).
[24] Der´enyi, I., Palla, G., & Vicsek, T. Clique percolation in random networks. Phys. Rev. Lett. (sub-
mitted).
[25] Warner, S. E-prints and the Open Archives Initiative. Library Hi Tech 21, 151–158 (2003).
[26] Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. The University of South Florida word association,
rhyme, and word fragment norms. http://www.usf.edu/FreeAssociation/.
[27] Xenarios, I. et al. DIP: the Database of Interacting Proteins. Nucl. Ac. Res. 28, 289–291 (2000).
[28] Boyle, E. I. et al. GO::TermFinder–open source software for accessing Gene Ontology information
and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinfor-
matics 20, 3710–3715 (2004).
[29] Cherry, J. M. et al. Genetic and physical maps of Saccharomyces cerevisiae. Nature 387, 67–73
Suppl. (1997). http://www.yeastgenome.org/.
[30] Song, C., Havlin, S., & Makse, H. A. Self-similarity of complex networks. Nature 433, 392–395
(2005).
9
Supplementary Information accompanies the paper on www.nature.com/nature.
Acknowledgements We thank A.-L. Barab´asi and P. Pollner for useful discussions. We acknowledge
the help of B. Kov´acs and G. Szab´o in connection with visualisation and software support. This research
was supported by the Hungarian Research Grant Foundation (OTKA).
Competing interests statement The authors declare that they have no competing financial interests.
Correspondence and requests for materials should be addressed to T.V. (vicsek@angel.elte.hu).
10
... These approaches define communities as subgraphs that have more edges per unit than the rest of the network. One well-known example is the Clique Percolation Method (CPM) introduced by Palla et al. [7], in which communities are created by combining nodesharing k-cliques or ultimately linked subgraphs. This method guarantees that the communities that are identified have substantial internal connectedness and capture overlapping community structures. ...
... The connections among community members are very dense, while the demands between communities are significantly less. With this characteristic, researchers have been able to propose various measures and methods for community detection, especially algorithms that can allow overlapping communities [6,7]. ...
... The density criterion to detect communities is put forward by [7]. They defined the density of the edges inside the community and that of the candidates. ...
Article
The selection of the initial centers of the communities is also significant in iteration-based methods for finding the communities in the networks. This is the reason why, if the initial centers of the communities are not chosen correctly, the errors and the time required for the application of the algorithm in the detection of the communities will be higher. Hence, selecting more significant nodes as starting points of communities can be the appropriate solution. Various techniques can be employed to achieve the selection of more significant nodes. In this thesis, the algorithm under discussion employs density and modularity criteria in the identification of communities in complex networks. This algorithm initially defines the number of nodes or the distinctive members of the community, in which these nodes have higher density levels and all the other nodes in their neighborhood have lower density levels. Next, the local communities are defined as the nodes that are in some way connected to the core nodes. Finally, the final communities are defined with the assistance of the merging algorithm, which is based on increasing modularity. In this algorithm, increasing modularity is used as a criterion for joining local communities together. Modularity is a criterion that indicates how the graph is like a modular or an organized community. When modularity becomes higher, local communities merge to form the final community. This means that it is possible to apply the presented algorithm and to use both density and modularity criteria to detect communities in complex networks. When the core nodes and local communities are first detected and then merged based on the increasing value of modularity, the resultant communities are more accurate. The results of the conducted experiments prove that the method applied in the Karate Club network clustering is equal to 0. 6913 for the NMI criterion and a value of 0. 733 for the accuracy criterion.
... The most straightforward examples of cohesive subgraphs are cliques [14], [15] and their variants, such as quasi-cliques [16] and defective cliques [17], [18]. Additionally, there are many concepts of cohesive subgraphs based on cliques, including k-clique communities [19], k-clique densest subgraphs [20], and nucleus decompositions [21]. Listing k-cliques. ...
... Algorithms for listing all k-cliques are used to detect communities within real-world networks [19], [22], [23]. A k-clique community [19] is a union of k-cliques adjacent to each other, and algorithms for listing k-cliques are used to detect k-clique communities (see Figure 1). ...
... Algorithms for listing all k-cliques are used to detect communities within real-world networks [19], [22], [23]. A k-clique community [19] is a union of k-cliques adjacent to each other, and algorithms for listing k-cliques are used to detect k-clique communities (see Figure 1). Palla et al. [19] proposed a method to analyze statistical features of real-world networks based on k-clique communities. ...
Preprint
Listing k-cliques plays a fundamental role in various data mining tasks, such as community detection and mining of cohesive substructures. Existing algorithms for the k-clique listing problem are built upon a general framework, which finds k-cliques by recursively finding (k-1)-cliques within subgraphs induced by the out-neighbors of each vertex. However, this framework has inherent inefficiency of finding smaller cliques within certain subgraphs repeatedly. In this paper, we propose an algorithm DIST for the k-clique listing problem. In contrast to existing works, the main idea in our approach is to compute each clique in the given graph only once and store it into a data structure called Induced Subgraph Trie, which allows us to retrieve the cliques efficiently. Furthermore, we propose a method to prune search space based on a novel concept called soft embedding of an l-tree, which further improves the running time. We show the superiority of our approach in terms of time and space usage through comprehensive experiments conducted on real-world networks; DIST outperforms the state-of-the-art algorithm by up to two orders of magnitude in both single-threaded and parallel experiments.
... Among them, the cluster percolation algorithm, proposed by Palla et al. [26], is an overlapping community discovery algorithm applicable to large-scale networks and can improve detection efficiency. However, the algorithm lacks stability in sparse networks. ...
... We used classical overlapping community detection algorithms and current state-of-the-art overlapping community detection algorithms for comparison and analysis. The overlapping community detection algorithms considered are as follows: the CPM algorithm [26], the LinkComm algorithm [51], the UMSTMO algorithm [52] and the LA_IS algorithm [53]. A detailed description of the above algorithms is as follows: The CPM algorithm finds connected subgraphs of k-factions (k-faction communities) by finding extremely complete subgraphs (factions) in the network. ...
Article
Full-text available
Overlapping community detection algorithms have attracted a lot of attention for their ability to better reflect the diversity and complex relationships in real-world social networks and complex networks. However, overlapping community algorithms are vulnerable to expose private information of users to unscrupulous elements. Furthermore, existing overlapping community-based hiding models define the entire overlapping region as a danger zone, which is inefficient and unrealistic as it cannot accurately identify critical and vulnerable users. In addition, existing hiding algorithms change the topology of the original network through their edge addition and deletion strategies, which can disrupt the network trends. Therefore, in this paper, we propose an overlapping community hiding algorithm based on multi-criteria learning decision analysis and a network growth model, called CoHide. The algorithm is divided into two components: firstly, the High-risk Seed node set Extraction (HrSE) algorithm is used to efficiently and accurately extract the set of high-risk nodes vulnerable to attacks; then the Network Growth-based Community Hiding (NGCH) algorithm is used to achieve a quicker community hiding process. In addition, we define community tendency indicators and propose OL-Permanence, a community hiding indicator applicable to overlapping networks, based on the Permanence indicator to evaluate the hiding effect. The effectiveness of the proposed CoHide algorithm is verified by experiments based on three public datasets and one real Twitter dataset.
... Network science [44] has emerged as a field providing a powerful framework of tools and methodologies to study complex networks. Its applications and importance span multiple real-world domains, to name a few (but not exclusive), community detection in random graphs [8], social networks [27], protein-protein interactions [40], the detection of co-authorship collaboration networks [39], and economic indicators [51]. One critical aim in the analysis of complex networks is that of identifying cliques, which is known for its NP-hard complexity [7,28]. ...
Preprint
Full-text available
Identifying cliques in dense networks remains a formidable challenge, even with significant advances in computational power and methodologies. To tackle this, numerous algorithms have been developed to optimize time and memory usage, implemented across diverse programming languages. Yet, the inherent NP-completeness of the problem continues to hinder performance on large-scale networks, often resulting in memory leaks and slow computations. In the present study, we critically evaluate classic algorithms to pinpoint computational bottlenecks and introduce novel set-theoretical approaches tailored for network clique computation. Our proposed algorithms are rigorously implemented and benchmarked against existing Python-based solutions, demonstrating superior performance. These findings underscore the potential of set-theoretical techniques to drive substantial performance gains in network analysis.
... Network theory is a very fruitful branch of mathematics [31]. A network, or graph, can recollect interactions between different components of complex systems, such as genes within a cell [32], words in a semantic space [33], species in an ecosystem [34], etc. This abstraction allows us to apply powerful computational and analytical methods derived from statistical mechanics [35], the study of graph structure [36][37][38][39], dynamical systems [40], etc.; and thus extract conclusions that might apply broadly across fields, to any networked system. ...
Preprint
Full-text available
Syntax is an aspect of human language responsible for the hierarchical ordering of linguistic structures. Syntax can be summarized by dependency trees with words as nodes and edges reflecting syntactic subordination. By merging trees from several sentences, we obtain syntax graphs or networks, which display distinct shapes depending on whether the language capability is well-formed, still developing, or pathological. Such graphs make syntactic capacity quantifiable at a systemic level, revealing emerging patterns and universalities. What is the structure of syntax networks during ontogeny in typically developing (TD) children? Do cognitively challenged children develop language through alternative routes? Here we quantify and portray the typical development of syntax networks in Dutch, and find that children affected by Down syndrome, hearing impairment, and specific language impairment initially seem to follow the typical developmental path but eventually halt, culminating in a different linguistic phenotype. Our expanded data set (with almost 50 times more data than earlier studies) and increased mathematical dimensions to quantify network shape enable us to: (i) confirm and refine a proposed sharp transition in language development, (ii) correlate specific network traits with syntax maturation, and (iii) quantify the aspects that fall short in atypical development---suggesting potential diagnostic tools. We also find grounds to hypothesize a gap in syntax maturation, separating challenged children who nevertheless reach the latest stage from others systematically stuck, regardless of their condition. Our quantitative analysis enables a rigorous visualization of linguistic development trajectories, an old (yet mostly qualitative) theme in linguistics. Similar works should allow to test and propose specific hypotheses on solid grounds, as we do here. Future efforts should generalize to other languages and/or clinical conditions, seeking patterns that might point at universalities in language development.
... As illustrated in Figure 1(b), most early research on community detection has focused on disjoint clusters, where each node belongs to a single community, and there is no overlap between communities [7,8,15,40,45,47,53]. However, nodes often participate in multiple communities in many real-world applications (as depicted in Figure 1(c)), sparking a growing interest in detecting overlapping communities [23,38,50]. Overlapping community detection typically entails higher computational costs and time overhead than disjoint community detection. ...
Preprint
Full-text available
Community detection is a critical task in graph theory, social network analysis, and bioinformatics, where communities are defined as clusters of densely interconnected nodes. However, detecting communities in large-scale networks with millions of nodes and billions of edges remains challenging due to the inefficiency and unreliability of existing methods. Moreover, many current approaches are limited to specific graph types, such as unweighted or undirected graphs, reducing their broader applicability. To address these issues, we propose a novel heuristic community detection algorithm, termed CoDeSEG, which identifies communities by minimizing the two-dimensional (2D) structural entropy of the network within a potential game framework. In the game, nodes decide to stay in current community or move to another based on a strategy that maximizes the 2D structural entropy utility function. Additionally, we introduce a structural entropy-based node overlapping heuristic for detecting overlapping communities, with a near-linear time complexity.Experimental results on real-world networks demonstrate that CoDeSEG is the fastest method available and achieves state-of-the-art performance in overlapping normalized mutual information (ONMI) and F1 score.
Preprint
Full-text available
Counting small subgraphs, referred to as motifs, in large graphs is a fundamental task in graph analysis, extensively studied across various contexts and computational models. In the sublinear-time regime, the relaxed problem of approximate counting has been explored within two prominent query frameworks: the standard model, which permits degree, neighbor, and pair queries, and the strictly more powerful augmented model, which additionally allows for uniform edge sampling. Currently, in the standard model, (optimal) results have been established only for approximately counting edges, stars, and cliques, all of which have a radius of one. This contrasts sharply with the state of affairs in the augmented model, where algorithmic results (some of which are optimal) are known for any input motif, leading to a disparity which we term the ``scope gap" between the two models. In this work, we make significant progress in bridging this gap. Our approach draws inspiration from recent advancements in the augmented model and utilizes a framework centered on counting by uniform sampling, thus allowing us to establish new results in the standard model and simplify on previous results. In particular, our first, and main, contribution is a new algorithm in the standard model for approximately counting any Hamiltonian motif in sublinear time. Our second contribution is a variant of our algorithm that enables nearly uniform sampling of these motifs, a capability previously limited in the standard model to edges and cliques. Our third contribution is to introduce even simpler algorithms for stars and cliques by exploiting their radius-one property. As a result, we simplify all previously known algorithms in the standard model for stars (Gonen, Ron, Shavitt (SODA 2010)), triangles (Eden, Levi, Ron Seshadhri (FOCS 2015)) and cliques (Eden, Ron, Seshadri (STOC 2018)).
Article
The brain comprises a complex network of interacting regions. To understand the roles and mechanisms of this intricate network, it is crucial to elucidate its structural features related to cognitive functions. Recent empirical evidence suggests that both feedforward and feedback signals are necessary for conscious perception, emphasizing the importance of subnetworks with bidirectional interactions. However, the link between such subnetworks and conscious perception remains unclear due to the complexity of brain networks. In this study, we propose a framework for extracting subnetworks with strong bidirectional interactions—termed the “cores” of a network—from brain activity. We applied this framework to resting-state and task-based human fMRI data from participants of both sexes to identify regions forming strongly bidirectional cores. We then explored the association of these cores with conscious perception and cognitive functions. We found that the extracted central cores predominantly included cerebral cortical regions rather than subcortical regions. Additionally, regarding their relation to conscious perception, we demonstrated that the cores tend to include regions previously reported to be affected by electrical stimulation that altered conscious perception, although the results are not statistically robust due to the small sample size. Furthermore, in relation to cognitive functions, based on a meta-analysis and comparison of the core structure with a cortical functional connectivity gradient, we found that the central cores were related to unimodal sensorimotor functions. The proposed framework provides novel insights into the roles of network cores with strong bidirectional interactions in conscious perception and unimodal sensorimotor functions. Significance Statement To understand the brain’s network, we need to decipher its structural features linked to cognitive functions. Recent studies suggest the importance of subnetworks with bidirectional interactions for conscious perception, but their exact relationship remains unclear due to the brain’s complexity. Here we propose a framework for extracting subnetworks with strong bidirectional interactions, or network “cores.” We applied it to fMRI data and explored the association of the cores with conscious perception and cognitive functions. The central cores predominantly included cortical regions rather than subcortical ones, and tended to comprise previously reported regions wherein electrical stimulation altered perception, suggesting the potential importance of bidirectional cores for conscious perception. Additionally, further analysis revealed the relationship of the cores to unimodal sensorimotor functions.
Article
Full-text available
Crystal structure prediction (CSP) is an evolving field aimed at discerning crystal structures with minimal prior information. Despite the success of various CSP algorithms, their practical applicability remains circumscribed, particularly for large and complex systems. Here, to address this challenge, we show an evolutionary structure generator within the MAGUS (Machine Learning and Graph Theory Assisted Universal Structure Searcher) framework, inspired by the symmetry principle. This generator extracts both global and local features of explored crystal structures using group and graph theory. By integrating an on-the-fly space group miner and fragment reorganizer, augmented by symmetry-kept mutation, our approach generates higher-quality initial structures, reducing the computational costs of CSP tasks. Benchmarking tests show up to fourfold performance improvements. The method also proves valid in complex phosphorus allotrope systems. Furthermore, we apply our approach to the diamond–silicon (111)-(7 × 7) surface system, identifying up to 42 metastable structures within an 18 meV Å⁻² energy range, demonstrating the efficacy of our approach in navigating challenging search spaces.
Article
Full-text available
There are a large number of techniques that try and determine areas within a network in which individuals are more closely linked to each other than outsiders. However, once these cohesive subgraphs have been identified researchers are often left with a long list of overlapping subgroups and have no means of assessing the structure or importance of these groups. In this paper we examine techniques for describing and reducing the amount of overlap so that the analyst can better understand the complex underlying clique structure.
Chapter
NormalizationDye Bias, Spatial Bias, Print Tip BiasExpression IndicesDetection of OutliersFold ChangeSignificanceMixed Cell PopulationsSummary
Article
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mech-anisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Article
Many networked systems, including physical, biological, social, and technological networks, appear to contain ``communities'' -- groups of nodes within which connections are dense, but between which they are sparser. The ability to find such communities in an automated fashion could be of considerable use. Communities in a web graph for instance might correspond to sets of web sites dealing with related topics, while communities in a biochemical network or an electronic circuit might correspond to functional units of some kind. We present a number of new methods for community discovery, including methods based on ``betweenness'' measures and methods based on modularity optimization. We also give examples of applications of these methods to both computer-generated and real-world network data, and show how our techniques can be used to shed light on the sometimes dauntingly complex structure of networked systems.
Article
The Open Archives Initiative (OAI) was created as a practical way to promote interoperability between e-print repositories. Although the scope of the OAI has been broadened, e-print repositories still represent a significant fraction of OAI data providers. This article presents a brief survey of OAI e-print repositories, and of services using metadata harvested from e-print repositories using the OAI protocol for metadata harvesting (OAI-PMH). It then discusses several situations where metadata harvesting may be used to further improve the utility of e-print archives as a component of the scholarly communication infrastructure.