ArticlePDF Available

Homophilic network decomposition: a community-centric analysis of online social services

Article

Homophilic network decomposition: a community-centric analysis of online social services

Abstract and Figures

In this paper we formulate the homophilic network decomposition problem: Is it possible to identify a network partition whose structure is able to characterize the degree of homophily of its nodes? The aim of our work is to understand the relations between the homophily of individuals and the topological features expressed by specific network substructures. We apply several community detection algorithms on three large-scale online social networks—Skype, LastFM and Google+—and advocate the need of identifying the right algorithm for each specific network in order to extract a homophilic network decomposition. Our results show clear relations between the topological features of communities and the degree of homophily of their nodes in three online social scenarios: product engagement in the Skype network, number of listened songs on LastFM and homogeneous level of education among users of Google+.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
Homophilic network decomposition: a community-centric analysis
of online social services
Giulio Rossetti
1,2
Luca Pappalardo
2
Riivo Kikas
3
Dino Pedreschi
2
Fosca Giannotti
1
Marlon Dumas
3
Received: 18 December 2015 / Revised: 15 October 2016 / Accepted: 19 October 2016 / Published online: 31 October 2016
The Author(s) 2016. This article is published with open access at Springerlink.com
Abstract In this paper we formulate the homophilic net-
work decomposition problem: Is it possible to identify a
network partition whose structure is able to characterize the
degree of homophily of its nodes? The aim of our work is
to understand the relations between the homophily of
individuals and the topological features expressed by
specific network substructures. We apply several commu-
nity detection algorithms on three large-scale online social
networks—Skype, LastFM and Google?—and advocate
the need of identifying the right algorithm for each specific
network in order to extract a homophilic network decom-
position. Our results show clear relations between the
topological features of communities and the degree of
homophily of their nodes in three online social scenarios:
product engagement in the Skype network, number of lis-
tened songs on LastFM and homogeneous level of educa-
tion among users of Google?.
1 Introduction
As the social media space grows more and more people
interact and share experiences through a plethora of dif-
ferent online services, producing every day a huge amount
of personal data. Companies providing social media plat-
forms are interested in exploiting these Big Data to
understand ‘‘user engagement,’’ i.e., the way individuals
use products provided via the platform. In particular pre-
dictive analytics allows these companies to exploit histor-
ical user engagement data, in conjunction with social
network data, in order to predict future product usage
(engagement) of individuals in the network. Traditional
approaches of predictive analytics focus on individuals:
they try to describe and predict the level of engagement of
a single individual. Focusing on individuals, however,
introduces many challenging issues. First the amount of
individuals to process is enormous and hence hardly
manageable. Think about online giants like Skype or
Facebook: in these contexts providing an up-to-date
description and prediction of user engagement for billion of
users is not practically feasible. Second addressing each
single individual is in many cases redundant, since neigh-
bors in networks tend to behave in a similar way and to
share specific features (age, location, language, interests),
i.e., they show a certain degree of social homophily
(McPherson et al. 2001; Himelboim et al. 2013). Indeed
the analysis of user engagement can be seen as an instan-
tiation of a more general problem: homophilic network
decomposition, which consists in finding a partition of the
network which guarantees a high degree of homophily in
the subgroups of the network.
Restricting the analysis to single users inevitably causes
the underestimation of the importance of social homophily,
whereas online social services are usually designed to
&Giulio Rossetti
giulio.rossetti@isti.cnr.it
Luca Pappalardo
lpappalardo@di.unipi.it
Riivo Kikas
kikas@ut.ee
Dino Pedreschi
dino.pedreschi@di.unipi.it
Fosca Giannotti
fosca.giannotti@isti.cnr.it
Marlon Dumas
marlon.dumas@ut.ee
1
KDDLab, ISTI -CNR, Via G. Moruzzi, 1, 56124 Pisa, Italy
2
KDDLab, University of Pisa, Largo B. Pontecorvo, 3,
56127 Pisa, Italy
3
Unversity of Tartu, Tartu, Estonia
123
Soc. Netw. Anal. Min. (2016) 6:103
DOI 10.1007/s13278-016-0411-4
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
foster social interactions between individuals. It is hence
fundamental to widen the analysis spectrum in order to
incorporate the social surrounding of users in order to
capture the homophily which characterizes real social
networks. We propose to move the focus from individuals
to groups, i.e., to analyze and describe the level of homo-
phily of social communities. If user-centric approaches fail
because they do not take into account the individuals’
social surroundings, on the other hand, it goes without
saying that analyzing the homophily on the overall network
does not make sense. The group-centric approach focuses
on social communities as a trade-off between the micro-
and the macro-level of network granularity (Fig. 1).
Moving the interest from individuals to communities
brings many advantages. First we reduce by several orders
of magnitude the space of analysis, shrinking the number of
objects to process and speeding up the analytical tasks.
Second targeting communities allows for capturing the
homophily inherent to the social network: we can ‘‘com-
press’’ into one object all the densely connected compo-
nents of a social group. Finally groups are complex objects
from which we can extract a wide set of features for the
analysis.
In this paper we investigate the potential of a group-
centric approach in describing user homophily. Using dif-
ferent community detection algorithms we compute social
communities from three large-scale online social networks
(Skype, LastFM and Google?) and extract salient features
from each community. We then build a repertoire of
classifiers to predict the level of homophily in the com-
munities both in terms of product engagement and simi-
larity of attributes. We find two main results. First group-
centric approaches outperform user-centric ones when we
use algorithms producing overlapping micro-communities.
In contrast, adopting partitioning algorithms which maxi-
mize modularity and produce macro-communities, the
performances are worse than the ones of classical user-
centric strategies. Second the group-centric approach is
useful when dealing with networks where social interac-
tions are a crucial part of the online service, such as the
Skype social network, while it fails when the social net-
work is just a marginal part of the service, such as for
LastFM. Our work shows how the choice of a proper
community detection algorithm—for the specific network
analyzed—is crucial to partition the network into
homophilic groups of users. Moreover, varying the online
social services analyzed (and related semantics) we
observe that the obtained communities are proxies for the
homophily in the network.
The rest of the paper is structured as follows. Section 2
defines the problem of homophilic network decomposition,
which is the basis for addressing predictive tasks at the
level of groups of individuals. Section 3introduces the
datasets and experimental setup used to test different
methods to address the problem of homophilic network
decomposition. Section 4presents the experimental results,
while Sect. 5discusses the implications of these results.
Finally Sect. 6discusses related work and Sect. 7provides
a summary of the contribution suggesting directions for
future work.
2 Problem definition
Online social services enable people to share interests,
interact and generate content. The users of these services
naturally tend to cluster around similar attributes (i.e., age,
location and tastes), a property called social homophily
(McPherson et al. 2001). To identify homophilic behaviors
we need to identify the right observation granularity:
Which is the subgraph size that maximizes the similarity of
users w.r.t. to a given attribute? Are specific online social
networks more homophilic than others? These questions
are instantiations of the more general problem of ho-
mophilic network decomposition:
Definition 1 (Homophilic network decomposition) Given
a social graph G¼ðV;EÞand a set Lof node labels, an
homophilic network decomposition is a collection of sub-
graphs of G, i.e., H¼fG1;...;Gngwhere G1¼ðV1;E1Þ,
... ,Gn¼ðVn;EnÞ, such that 8i1::n;ViV^EiE
and in each subgraph Githere is a dominant label, i.e.,
8Gi9l2Ljjfv2VijLðvÞ¼lgj
jVij[s. In this context, sis the dom-
inance threshold, meaning that the proportion of nodes in
Githat have the dominant label is at least s.
One key question in homophilic network decomposition
is how to break down the network in a way that is topo-
logically meaningful and preserves the desired homophily
property inside each group.
In this work we address the problem of homophilic
network decomposition in three different online social
networks: the full Skype contact graph, a nationwide
Google?snapshot and a sample of UK users of the LastFM
social network. All these networks have peculiar structures,
node attributes and semantics: we select for each network a
target feature (not directly related with network topology)
and identify the best partition across a set of candidates as a
Fig. 1 Interpolation between the local and the global level through
network partitions of different sizes
103 Page 2 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
classification problem. Our aim is to measure the ability of
the topology of a community in estimating the homophily
of nodes in the community: Do users in dense LastFM
communities listen to more music? Do users in big Goo-
gle?communities have the same education level? Are
Skype users in nation- homophilic communities frequent
video callers?
Our experiments address these questions, and the
obtained results are used to discuss the differences among
the analyzed networks and the role their semantics play in
the quality of the classification results.
3 Experimental settings
In this section we define our experimental settings: in Sect.
3.1 we describe the three online social network datasets we
analyze, and in Sect. 3.2 we present the community dis-
covery algorithms we use to partition the networks. Finally
in Sect. 3.3 we introduce the topological features used to
train the classifiers that discriminate high from low
homophilic communities.
3.1 Datasets description
We analyze three large-scale datasets of popular online
platforms: Skype, LastFM and Google?.
3.1.1 The Skype dataset
The first dataset is provided by Skype and includes anon-
ymized data of Skype users as of October 2011. Each user
(identified by hashed identifier) is associated with an
account creation date, a country, and city of account cre-
ation. The dataset also includes undirected connections
between users: a connection exists between two users if
and only if they belong to each other’s contact list. Con-
nections are established as follows: If a user uwants to add
another user vto her contact list, usends va contact
request. The connection is established at the moment
vapproves the request (or not established if the contact
request is not approved). In the dataset, each connection is
labeled with a timestamp corresponding to the contact
request approval. The dataset also includes data about
usage of two Skype products: video calling and chatting.
Product usage is aggregated monthly. Specifically, for each
product, for each user and for each month, we are given the
number of days in the month when the user used the pro-
duct in question. The product usage data do not provide
information about individual interactions between users,
such as participants in an interaction, content, length or
time of the interaction. The frequency of product usage is
not recorded at a finer granularity than monthly. In this
paper, we focus on analyzing the most recent available
snapshot of the network. Accordingly we focus on the
subset of the dataset containing only users who used one of
the two products, during at least two of the last three
months covered in the dataset. Our analyses are then exe-
cuted on a filtered dataset composed by several tens of
millions of users and connections.
3.1.2 The LastFM dataset
LastFM is a popular online social network platform where
people can share their music tastes and discover new music
based on what they like. Once a user subscribes to an
account, she can either start listening LastFM personalized
Radio or send data about her own offline listenings. For
each song a user can express her preferences and add tags
(e.g., genre of the song). Lastly a user can add friends
(undirected connections, the friendship request must be
confirmed) and search for other users with similar musical
tastes. A user can see, in her homepage, her friends’
activities. Using LastFM APIs
1
we downloaded a sample of
the UK user graph, starting from a set of nodes and
implementing a breadth-first approach. We decided to
explore the graph up to the fifth degree of separation from
our seeds. For each user, we retrieved: (a) her connections
and (b) for each week in the time window from January
2010 to December 2011, the number of single listenings of
a given artist (e.g., in the first week of April 2010, user
1324 has listened 66 songs from the artist Bon Jovi). The
number of listenings gives an estimate of the engagement
of the user with respect to the LastFM service. Each song
has a tag representing the music genre of the song (rock,
metal, jazz, punk, etc.). After the crawl and cleaning stages,
we build a social network where every node is a user and
each edge is generated by looking at the user’s friends in
the social media platform. The total amount of nodes is
75, 969, with 389, 639 edges connecting them.
3.1.3 The Google?dataset
Google?is an interest-based social network that is owned
and operated by Google. Each user in Google?has a
public visible account and can create links with other users
inserting them in proper social circles. In this paper we use
a social network built on the Google?service upon US
users, crawled by authors of Gong et al. (2012). Each user
has also attached semantic information about education
level, i.e., node labels identifying the schools attended by
the users. The network contains 33,381 nodes and 110,142
edges.
1
http://www.last.fm/api.
Soc. Netw. Anal. Min. (2016) 6:103 Page 3 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
3.2 Community detection algorithms
Among the many different community detection algo-
rithms proposed so far we identify two archetypal classes:
the algorithms that maximize community density and the
ones that maximize modularity. The former class ensures a
high density of links inside communities, while the latter
class imposes that the density of links inside a community
is higher than the density of links which connect a com-
munity to external nodes. The degree of overlap is another
property that discriminates between community discovery
(henceforth, CD) algorithms. Classical approaches produce
a partition of the network, i.e., an individual can be
involved in at most one community. Overlapping approa-
ches consider instead the multidimensional nature of social
networks allowing the individuals to belong to many dif-
ferent communities.
We use four different algorithms to extract social
communities from the Skype network: LOUVAIN,HDEMON,
EGO-NETWORK and BFS. Such algorithms cover several
declinations of both overlap and density/modularity
optimization.
LOUVAIN Blondel et al. (2008) is a fast and scalable
algorithm based on a greedy modularity approach. It per-
forms a modularity optimization in two steps. First the
method looks for ‘‘small’’ communities by optimizing
modularity locally. Second it aggregates nodes belonging
to the same community and builds a new network whose
nodes are communities. These steps are repeated iteratively
until a maximum of modularity is obtained, producing a
hierarchy of communities. LOUVAIN produces a complete
non-overlapping partitioning of the graph. It has been
shown that modularity-based approaches suffer a resolution
limit, and therefore, LOUVAIN is unable to detect medium
size communities (Fortunato and Barthe
´lemy 2007). This
produces communities with high average density, due to
the identification of a predominant set of very small com-
munities (usually composed by 2–3 nodes) and a few huge
communities. The LOUVAIN algorithm, which is parameter-
free, produces a hierarchy of seven levels when applied on
the Skype dataset.
HDEMON Coscia et al. (2014) is based on a recursive
aggregation of denser areas extracted from ego-networks.
Its definition allows to compute communities with high
internal density and tunable overlap. In its first hierarchical
level HDEMON operates extracting ego-networks and par-
titioning them into denser areas using label propagation.
The communities computed at a given hierarchical level
are subsequently used as meta-nodes to build a new net-
work in the next hierarchical level, where the edges
between the meta-nodes are weighted using the Jaccard of
meta-nodes’ contents. This procedure stops when discon-
nected meta-nodes, identifying the components of the
original network, are obtained. The algorithm has two
parameters: (1) the minimum community size land (2) the
minimum Jaccard wamong meta-nodes to create an edge
that connects them. We apply HDEMON on the Skype
dataset fixing l¼3 (the minimum community is a trian-
gle) and using two different values of the wparameter:
w¼0:25 which produces the HDEMON25 community set,
and w¼0:5 which produces the HDEMON50 community
set. For each community set we consider only the first 5
levels of the produced community hierarchy.
2
For LastFM
and Google?we only use the first hierarchical level pro-
duced by HDEMON, because of the reduced sizes of the
datasets.
EGO-NETWORK is a naive algorithm that models the
communities as the set of induced subgraphs obtained
considering each node with its neighbors. This approach
provides the highest overlap among the four considered
approaches: each node ubelongs exactly to jCðuÞj þ 1
communities, where CðuÞidentifies its neighbors set. We
apply a node sampling strategy and consider only a ratio
of the ego-networks for the analysis. We set the parameter
¼0:2 and randomly extracted a number of users equals to
the 20 % of the population. We choose ¼0:20 because it
produces a community overlap similar to the one produced
by HDemon. For each random user we extracted the cor-
responding ego-network, filtering only unique ones (two
users can have equal ego-networks if they share all their
contacts).
The BFS algorithm extracts randomly connected com-
ponents from the graph. It randomly samples a ratio of the
nodes of the network and, for each one of them, a number
csize is extracted from a power law distribution of com-
munity sizes. Similarly to EGO-NETWORK, we choose
¼0:20. As parameters for the power law distribution of
community size we choose the exponent b¼1:8 and the
cutoff s¼10;000, which are the values we observe for
HDemon25 on the Skype dataset.
3
Starting from a root
node, the algorithm explores other nodes performing a
breadth-first search and stopping when csize nodes are
discovered.
Both HDEMON and LOUVAIN generate different commu-
nity sets at different granularity, according to the parame-
ters. For the Skype network, due to its size, we choose to
analyze the two levels of the HDEMON hierarchy having the
highest average community density and the community set
2
We report the results of HDemon for w¼0:25 and w¼0:50 only.
For w\0:25 (i.e., low Jaccard in merge) there is an increase in
network density which produces a small number of huge communi-
ties, similarly to LOUVAIN. For w[0:50 (i.e., high Jaccard in merge)
we obtain an incomplete node coverage, i.e., most of the nodes in the
network are not assigned to a community.
3
We observe similar values of band son HDemon25 and
HDemon50 on the Skype, LastFM and Google?datasets.
103 Page 4 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
at level 0 and 6 for LOUVAIN, which corresponds, respec-
tively, to the first greedy iteration and the iteration having
the maximum modularity. Conversely for the analysis of
LastFM and Google?we consider only the first hierar-
chical level produced by HDEMON (we will refer to as
DEMON) and the last level of LOUVAIN which guarantees the
maximum modularity. Also on LastFM and Google?we
do not apply BFS due to their reduced size.
3.3 Community feature extraction
From the community sets produced by the four algorithms
we extract a set of structural features (see Table 1), which
convey information about the topology of a social com-
munity C¼ðVC;ECÞ, where VCand ECare the set of
nodes and edges in the community, respectively. The
number of nodes Nand edges Mprovides information
about the community size. The community density
D¼2M
NðN1Þ, i.e., the ratio between the actual links and all
the possible links, indicates the level of interaction within
the social group. The clustering coefficient (Watts and
Strogatz 1998) indicates how strong is the presence of
triangles within the community, measuring a ‘‘all-my-
friends-know-each-other’’ property. The degree assortativ-
ity Adeg indicates the preference for the nodes to attach to
others that have the same degree (Newman 2003). Other
structural features regard the level of hubbiness of a
community, such as the average/maximum degree com-
puted considering both the network links or the community
links only. The diameter d¼maxv2VðvÞand the radius
r¼minv2VðvÞare, respectively, the maximum and the
minimum eccentricity of any node, where the eccentricity
ðvÞis the greatest geodesic distance between a node vand
any other node in the community. They represent the linear
size of a community. Finally other structural features are
considered, such as the number of community neighbor-
hoods (nodes in the global network connected to nodes in
the community), the number of edges leaving the com-
munity, the number of triangles and the number of con-
nected triples.
Moreover, for the Skype dataset we introduce two
additional feature sets: community formation features and
geographical features (see Table 2). The community for-
mation features convey information regarding the temporal
appearance of nodes within the community, such as the
time of subscription to Skype of the first user to subscribe;
the average and the standard deviation of the inter-arrival
times of users; the inter-arrival time between the first node
to subscribe and the last node who adopted Skype. Geo-
graphical features provide information about the geo-
graphical diversity of a community or, in other words, its
cosmopolitan nature. The number of different countries
represented gives a first estimation of the international
nature of the community. The country entropy estimates
the national diversity through the Shannon entropy:
E¼Pc2CpðcÞlog pðcÞ, where Cis the set of the countries
represented in the community and p(c) is the probability of
the country cto be represented in the community. We also
compute the city entropy and the number of different cities
represented by the community. Moreover, for the users for
which we know the city name (those associated with cities
with more than 5000 Skype users), we compute their
geographical distance using the coordinates of the centers
of the cities. Once computed all the available distances, we
consider the average and the maximum geographical dis-
tances of each community.
Finally for each network we define the target features
we want to predict using the topological (and forma-
tion/geographical) features. For Skype the target features
indicate the mean level of Skype activity performed by
the community members. For such dataset we extract two
target features: (1) chat, the mean number of days they
used the instant messaging (chat) and (2) video,themean
number of days they used the video conference. Con-
versely the LastFM target feature indicates the mean level
of user listening activity (i.e., the average of the number
of listenings among the users of each community) while in
Google?it identifies the homogeneity of the users w.r.t.
the education level (computed through node label
entropy).
Table 1 Description of the
structural features extracted
from the communities
Structural features
NNumber of nodes MNumber of edges
DDensity CC Global clustering
CCavg Average clustering Adeg Degree assortativity
degC
max Max degree (community links) degC
avg Avg degree (community links)
degall
max Max degree (all links) degall
avg Avg degree (all links)
TClosed triads Topen Open triads
OvNeighborhood nodes OeOutgoing edges
Edist Num. edges with distance dApprox. diameter
rApprox. radius gConductance
Soc. Netw. Anal. Min. (2016) 6:103 Page 5 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4 Analytical results
In this section we construct the classification models to
estimate the degree of homophily from the community
features. In Sect. 4.1 we start with the Skype contact graph
describing a specific instantiation of the analyzed problem,
namely Social Engagement. In such scenario we are
interested in using topological, geographical and temporal
network features to estimate the average engagement each
community has on two Skype products, video and chat. In
Sect. 4.2 we analyze the LastFM graph and shift our
attention on a different formulation of our original prob-
lem: Service Engagement. Here, we want to estimate the
average community level of music listening, i.e., how
much users in the same community use in average the
LastFM scrobbler (estimated by the gross number of her
listenings). Finally in Sect. 4.3 we address the problem of
estimating the degree of homophily w.r.t. the education
level of Google?users within communities.
4.1 Skype: user engagement
We use the topological, geographical and temporal fea-
tures described above to classify the level of engagement
of social communities with respect to the chat and video
activity features. To this purpose, we build a supervised
classifier that assigns communities to two possible cate-
gories: high level of engagement or low level of
engagement. We address two different scenarios: (1) a
balanced class scenario where the two classes have the
same percentage of population and (2) an unbalanced
class scenario, where we consider an uneven population
distribution.
4.1.1 Balanced scenario
We consider two classes of user engagement for each of the
two activity features (chat and video): low engagement and
high engagement. To transform the two continuous activity
features into discrete variables we partition the range of
values through the median of their distribution. This pro-
duces, for each variable to predict, two equally populated
classes: (1) low engagement, ranging in the interval
[0, median] and (2) high engagement, ranging in the
interval [median, 31].
4
To perform classification we use
stochastic gradient descent (SGD) and area under the ROC
curve (AUC) to evaluate their performance. The ROC
curve illustrates the performance of a binary classifier and
is created by plotting the true positive rate (tpr, also called
sensitivity) versus the false positive rate (fpr, also called
fallout or 1-specificity), at various threshold settings. The
overall accuracy is instead the proportion of true results
(both true positives and true negatives) in the population.
Moreover, in a preliminary testing phase the classification
step was repeated also using a random forest model built
upon C4.5: due to the similar performance observed, the
more intuitively interpretation of the obtained results and
the lower execution time we decided to show only the
results obtained by SGD.
We learn the SGD classifier with logistic error function
(Tsuruoka et al. 2009; Zhang 2004) exploiting its imple-
mentation provided by the sklearn Python library.
5
We
execute 5 iterations, performing data shuffling before each
one of them, imposing the elastic-net penalty a¼0:0001
and l1-ratio = 0.05. The adoption of elastic-net penalty
results in some feature weights set to zero, thus eliminating
less important features.
We apply a fivefold cross-validation for learning and
testing. Table 3shows the AUC produced by the SGD
method on the features extracted from the community sets
produced by the four algorithms (for HDEMON and LOUVAIN
only the two best performing community sets are reported).
HDEMON produces the best performance, both in terms of
AUC and overall accuracy, for all the three activity fea-
tures. LOUVAIN, conversely, reaches a poor performance,
and it is outperformed by the more trivial BFS and EGO-
NETWORK algorithms. This result suggests that the adoption
of modularity optimization approaches, like LOUVAIN, is not
effective when categorizing group-based user engagement
due to their resolution limit which causes the creation of
huge communities (Fortunato and Barthe
´lemy 2007). As
the level of the LOUVAIN hierarchy increases, and hence, the
modularity increases, both the AUC and overall accuracy
Table 2 Description of the community formation features and geo-
graphical features extracted from the communities (only for the Skype
dataset)
Community formation features
TfFirst user arrival time
ITavg Avg user inter-arrival time
ITstd Std of user inter-arrival time
ITl;fLast–first inter-arrival time
Geographical features
NsNumber of countries
EsCountry entropy
Smax Percentage of most represented country
NtNumber of cities
EtCity entropy
distavg Avg geographical distance
distmax Max geographical distance
4
The maximum is 31 because it refers to the mean number of days
per month in which that activity was performed.
5
http://scikit-learn.org/stable/index.html.
103 Page 6 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
decrease. In the experiments, indeed, the first LOUVAIN
hierarchical level outperforms the last level, even though
the latter has the highest modularity. Figure 2shows the
features which obtain a weight value by the SGD method
higher than 0.2 or lower than 0:2 (i.e., the most
discriminative features for the classification process).
HDEMON distributes the weights in a less skewed way,
while the other algorithms tend to give high importance to
a limited subset of the extracted features. Moreover only a
few LOUVAIN features have a weight higher than 0.2 or
lower than 0:2 (see Fig. 3d), confirming that a modularity
approach produces communities with weak predictive
power with respect to user engagement. Moreover, an
interesting phenomenon emerges: independently from the
chosen community discovery approach, the most relevant
class of features for the classification process seems to be
to the topological one (i.e., the sum of the absolute values
of the SGD weights for the features belonging to such class
is always greater than the same sum for community for-
mation and geographical features combined). In particular
degree, density, community size and clustering-related
measures often appear among the most weighted features.
Figure 4shows the relationships between the average
community size, the average community density and the
AUC value produced by the SGD method on the commu-
nity sets which reach the best performances in the balanced
scenario. The best performance is obtained for the HDEMON
community sets, which constitute a compromise between
the micro- and the macro-level of network granularity.
Table 3 Skype: AUC and
accuracy (within brackets)
produced by the SGD method in
the balanced scenario, for video
and chat features
Algorithm Lv. Scores
Video: AUC and accuracy
HDEMON25 1 .74 (.67)
HDEMON50 0 .71 (.68)
LOUVAIN 0 .65 (.60)
LOUVAIN 6 .63 (.59)
EGO-NETS .70 (.64)
BFS .67 (.62)
Chat: AUC and accuracy
HDEMON25 2 .84 (.77)
HDEMON50 1 .81 (.73)
LOUVAIN 0 .69 (.64)
LOUVAIN 6 .65 (.60)
EGO-NETS .75 (.75)
BFS .81 (.72)
In bold the best model
(a) (b)
(c) (d)
Fig. 2 Skype: weights of the
features ([|0.2|) produced by the
SGD method for each community
set for the chat feature in the
balanced scenario. aHDEMON chat.
bEGO-NETWORK chat. cBFS chat.
dLOUVAIN chat
Soc. Netw. Anal. Min. (2016) 6:103 Page 7 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
When the average size of the communities is too low, as for
the ego-network level, we lose information about the sur-
roundings of nodes and do not capture the inner homophily
hidden in the social context. On the other hand, when
communities become too large, as in the case of commu-
nities produced by LOUVAIN we mix together different
social contexts losing definition. Communities expressing a
good trade-off between size and density, as in the case of
the HDEMON algorithm, effectively reach the best perfor-
mance in the problem of estimating user engagement.
4.1.2 Unbalanced scenario
We address also an unbalanced scenario where we use the
75th percentile for the low engagement class, which thus
contains the 75 % of the observations, and put the
remaining 25 % of the observations in the high engage-
ment class. Table 4describes the results produced by the
SGD methods in the unbalanced scenario, using the same
features and community discovery approaches discussed
before. The baseline method for the unbalanced scenario is
the majority classifier: it reaches an AUC of 0.75 by
assigning each item to the majority class (the low
engagement class). We observe that, regardless the com-
munity set used, the SGD method (as well as random
forest) is not able to improve significantly the baseline
classifier for video. Conversely the results obtained for the
chat feature by SGD outperform the baseline when we
adopt HDEMON,EGO-NETWORKS and BFS community sets,
reaching an AUC of 0.83.
In order to provide additional insights into the models
built with the adoption of the different CD algorithms, we
also compute the precision and recall measures with
respect to the minority class (see Table 5). Looking at these
measures enables us to understand which is the advantage
in using SGD to identify correctly instances of the less
(a) (b)
(c) (d)
Fig. 3 Skype: weights of the features produced by the SGD method for each community set for the video feature in the unbalanced scenario.
aHDEMON video. bEGO-NETWORK video. cBFS video. dLOUVAIN video
103 Page 8 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
predictable class. Moreover, we can observe how choosing
the 75th percentile led to a very difficult classification
setup: the instances belonging to the minority class often
represent outliers having very few examples from which
the classifier can learn the model. Here the baseline is the
minority classifier which reaches a precision of 25 % by
assigning each community item to the minority class (the
high engagement one). We observe that the SGD method
outperforms the baseline classifier on all the community
sets (reaching values in the range [.33, .57]). HDEMON and
EGO-NETWORKS are the community sets which led to the best
precision, on the video features and the chat feature,
respectively.
In order to measure the effectiveness of SGD we report
the lift chart which shows the ratio between the results
obtained with the built model and the ones obtained by a
random classifier. The charts in Fig. 5are visual aids for
measuring SGD’s performance on the community sets: the
greater the area between the lift curve and the baseline, the
better the model. We observe that HDEMON performs better
than the competitors for the video features. For the chat
features, the community sets produced by the three naive
algorithm win against the other two CD algorithms. For all
the three activity features, LOUVAIN reaches the worst per-
formance, as in the balanced scenario.
As done for the balanced scenario in Fig. 3we report the
features having weight greater than 0.2 or lower than 0:2.
In contrast with the results presented in the previous sec-
tion, where topological features alway show the higher
relative importance for the classification process, in this
scenario we observe how community formation and geo-
graphical features are the ones which ensure greater
descriptive power. As previously observed the minority
class identified by a 75th percentile split is mostly com-
posed by particular, rare, community instances. This
obviously affects the relative importance of temporal and
geographical information: the results suggest that the more
a community is active the more significative are its geo-
graphical and temporal bounds. Finally in Fig. 6we show
the relationships between the average community size, the
(a) (b)
(c) (d)
Fig. 4 Skype: AUC versus avg. density and AUC versus avg. size for video and chat in the balanced scenario. aAUC versus density: video.
bAUC versus density: chat. cAUC versus size: video. dAUC versus size: chat
Soc. Netw. Anal. Min. (2016) 6:103 Page 9 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
average community density and the AUC value produced
by the SGD method on the community sets which reach the
best performances in the unbalanced scenario. We can
observe how, in this settings, the algorithms producing
communities with small average sizes and high density are
the ones that assure the construction of SGD models
reaching higher AUC. In particular HDEMON in both its
instantiation outperforms the other approaches.
4.1.3 Skype community characterization
From our analysis a well-defined trend emerges: among the
compared methodologies, in both the balanced and unbal-
anced scenarios, HDEMON is the best in bounding homo-
phily producing communities that guarantee useful insights
into the product engagement level. For this reason starting
from the communities extracted by such bottom-up over-
lapping approach we computed the Pearson correlation for
all the defined features against the final class label (high/
low engagement). As shown in Fig. 7a when splitting the
video engagement using the 50th percentile we are able to
identify as highly active communities the ones having high
country entropy Esas well as high geographical distance
among its users distavg and whose formation is recent (i.e.,
whose first user has joined the network recently, Tf, as well
as the last one, ITl;f.). Moreover, video active communities
tends to be composed by users having on average low
degree as shown by degall
avg and degC
max. Conversely looking
at Fig. 7b we can notice that communities which exhibit
high chat engagement can be described by persistent
structures (i.e., social groups for which the inter-arrival
time ITl;ffrom the first to the last user is high), composed
by users showing almost the same connectivity (in partic-
ular having high degree) and sparse social connections
(low clustering coefficient CC, low density Dand high
radius). Moreover, we calculate the same correlations for
the 75th percentile split: in contrast with the new results for
the chat engagement (Fig. 7d) which do not differ signifi-
cantly from the ones discussed for the balanced scenario, in
this settings the highly active video communities show new
peculiarities. In Fig. 7c we observe how the level of
engagement inversely correlates with the community
radius (and diameter) and directly correlates with density.
This variation describes highly active video communities
as a specific and homogeneous subclass composed by small
and dense network structures composed by users who live
in different countries (high geographical entropy Es).
4.2 LastFM: service engagement
For the LastFM scenario we want to understand if the
topological features of the social network can explain
whether a community is predictive of the engagement into
the service, measured by the total number of listenings of
users into the community. To do that we transform the
Table 4 Skype: AUC and
accuracy (within brackets)
produced by the SGD method in
the unbalanced scenario, for the
video and chat features
Algorithm Lv. Scores
Video: AUC and accuracy
HDEMON25 1 .76 (.68)
HDEMON50 0 .73 (.65)
LOUVAIN 0 .64 (.59)
LOUVAIN 6 .61 (.58)
EGO-NETS .71 (.63)
BFS .68 (.61)
Baseline – .75
Chat: AUC and accuracy
HDEMON25 2 .82 (.78)
HDEMON50 3 .80 (.76)
LOUVAIN 0 .68 (.70)
LOUVAIN 6 .67 (.66)
EGO-NETS .83 (.79)
BFS .82 (.77)
Baseline – .75
In bold the best model. The
baseline method is the majority
classifier, which reaches an
AUC of 0.75 by assigning each
item to the majority class (the
low engagement class)
Table 5 Skype: precision and
recall (within brackets)
produced by the SGD model for
the video and chat features in
the unbalanced scenario
Algorithm Lv. Scores
Video: precision–recall
HDEMON25 2 .42 (.72)
HDEMON50 1 .39 (.70)
LOUVAIN 0 .33 (.69)
LOUVAIN 6 .33 (.67)
EGO-NETS .37 (.68)
BFS .35 (.71)
Baseline – .25
Chat: precision–recall
HDEMON25 2 .54 (.69)
HDEMON50 3 .50 (.67)
LOUVAIN 0 .40 (.41)
LOUVAIN 6 .44 (.33)
EGO-NETS .57 (.68)
BFS .52 (.71)
Baseline – .25
In bold the best model. Having
used the 75th percentile to dis-
criminate the class labels the
precision baseline w.r.t. the
positive class is .25
103 Page 10 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
(a) (b)
Fig. 5 Skype: lift plot for video and chat in the unbalanced scenario. aVideo. bChat
(a) (b)
(c) (d)
Fig. 6 Skype: AUC versus avg. density and AUC versus avg. size for video and chat in the unbalanced scenario. aAUC versus density: video.
bAUC versus density: chat. cAUC versus size: video. dAUC versus size: chat
Soc. Netw. Anal. Min. (2016) 6:103 Page 11 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
problem into a binary classification task by assigning each
community to one of the two classes: low volume of lis-
tenings or high volume of listenings. As for the Skype
network, we address two different scenarios: (1) a balanced
class scenario where the two classes have the same per-
centage of population (50th percentile split) and (2) an
unbalanced class scenario (75th percentile split) where we
consider an uneven class distribution.
4.2.1 Balanced scenario
The results reported in Table 6highlight how, in contrast
with Skype, LOUVAIN produces the best performance in
predicting the volume of listenings (both in AUC and
accuracy). This trend is also evident from Fig. 8:L
OUVAIN
shows lower average density and lower average size than
the other algorithms, albeit obtaining the highest AUC. The
EGO-NETS approach produces the worst performance high-
lighting how, in a balanced scenario, the community-based
approach improves the prediction of the engagement.
4.2.2 Unbalanced scenario
In the unbalanced scenario the low volume of listenings
class is the 75 % of the dataset. Tables 7and 8show two
main results. On the one hand, HDEMON produces the best
performance reaching an AUC = .78 (Table 7), a consid-
erable improvement with respect to the baseline classifier
(a) (b)
(c) (d)
Fig. 7 Skype: most relevant Pearson correlations between community feature values and target class (high/low activity) for HDEMON.Ina,bare
shown the indexes for the balanced class scenario while in c, d for the 75th percentile split
Table 6 LastFM: AUC and accuracy (within brackets) produced by
the best classifier in the balanced scenario, for the average total lis-
tenings feature
Algorithm Scores Classifier
LastFM: AUC and accuracy
DEMON .59 (.63) Logistic regression
LOUVAIN .71 (.72) Decision tree
EGO-NETS .55 (.57) Logistic regression
In bold the best model
103 Page 12 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
(.25). Figure 9shows that HDEMON communities are the
ones whose topological attributes better discriminate
among the high volume and low volume listenings classes.
On the other hand, the EGO-NETS algorithm produces the
best precision on the minority class (Table 8). In any case
all the algorithms outperform the baseline precision on the
minority class (0.25), even though they show a rather low
recall (while the baseline by definition has recall = 1).
4.3 Google1: community homogeneity
In this scenario we investigate the ability of topological
features in explaining whether a community is composed
by users having a homogeneous level of education. As
done before, we see the problem as a binary classification
task, i.e., each community is assigned to one of the two
classes: (1) homogeneous or (2) heterogeneous education
level. The target feature is built computing the node label
entropy eifor each community ci:ifei!0 community
users have the same education level, conversely if ei!1
they show heterogeneous education levels. The chosen
target feature distributes almost equally on all the partitions
made, following a normal distribution. We address two
different scenarios: (1) a balanced class scenario where the
two classes have the same percentage of population (50th
percentile split) and (2) an unbalanced class scenario (75th
percentile split), where we consider an uneven class
assignment (rising the threshold level for homogeneous
communities).
4.3.1 Balanced scenario
As done for LastFM, since the dataset has moderate size we
applied an ensemble of classification approaches and report
the results obtained by the best performer. The results
reported in Table 9highlight how, contrarily to what
observed on Skype, LOUVAIN guarantees the best perfor-
mances (both in AUC and accuracy). This trend is evident
in Fig. 10:L
OUVAIN seems to better capture the degree of
homophily because—due to the scale problem that affects
modularity-based approaches—it outputs huge
(a) (b)
Fig. 8 LastFM: AUC versus avg. density and AUC versus avg. size in the balanced scenario. aAUC versus density: LastFM. bAUC versus size:
LastFM
Table 7 LastFM: AUC and accuracy (within brackets) produced by
the best classifier in the unbalanced scenario, for the average total
listening feature
Algorithm Scores Classifier
LastFM: AUC and accuracy
DEMON .60 (.78) Logistic regression
LOUVAIN .55 (.36) Logistic regression
EGO-NETS .55 (.83) Random forest
Baseline .25 (.25)
In bold the best model. The baseline method is the majority classifier,
which reaches an AUC of 0.75 by assigning each item to the majority
class (the low engagement class)
Table 8 LastFM: precision and recall (within brackets) produced by
the best classifier for the average total listenings feature in the
unbalanced scenario
Algorithm Scores Classifier
LastFM: precision–recall
DEMON .78 (.03) Logistic regression
LOUVAIN .33 (.30) Decision tree
EGO-NETS .83 (.004) Random forest
Baseline .25 (1.0)
In bold the best model. Having used the 75th percentile to discrimi-
nate the class labels the precision baseline w.r.t. the positive class is
.25
Soc. Netw. Anal. Min. (2016) 6:103 Page 13 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
communities (whose entropy tends to 1) and tiny com-
munities (whose entropy tends to 0).
The reduced quality of prediction obtained by HDEMON
and EGO-NETWORK highlights the complexity of the problem:
EGO-NETWORKS guarantee smaller and denser communities,
but fail in recovering all the positive instances (low recall
on the homogeneous class, 0:41); HDEMON reaches a
higher recall but, due to the higher average sizes of the
identified communities, lacks in precision (0.52).
4.3.2 Unbalanced scenario
We applied the same strategy to address a more complex
scenario: in this settings the homogeneous level of educa-
tion is assigned only to communities having node label
entropy in the range [0, 0.25]. We are searching for the
most homogeneous communities.
Tables 10 and 11 show that the best classification is
reached when the HDEMON communities are used. As
expected LOUVAIN performances decrease while focusing
on the minority class (which contains small- and medium-
sized communities). From Table 11 we get a very clear
picture on the complexity of the problem itself: all the
(a) (b)
Fig. 9 LastFM: AUC versus avg. density and AUC versus avg. size in the unbalanced scenario. aAUC versus density: LastFM. bAUC versus
size: LastFM
Table 9 Google?: AUC and accuracy (within brackets) produced by
the best classifier (SGD) applied to the Google?topological features
in the balanced scenario
Algorithm Scores Classifier
Google?: AUC and accuracy
DEMON .67 (.71) SGD
LOUVAIN .74 (.84) SGD
EGO-NETS .61 (.65) SGD
Baseline .50 (.50)
(a) (b)
Fig. 10 Google?: AUC versus avg. density and AUC versus avg. size in the balanced scenario. aAUC versus density: Google?.bAUC versus
size: Google?
103 Page 14 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
proposed community discovery algorithms outperform the
baseline precision on the minority class; however, their
recall is quite low (while the baseline, by definition has
recall = 1). Again Fig. 11 shows that HDEMON communities
are the best in discriminating among homogeneous and
heterogeneous users education level.
5 Discussion
After the analysis of three different datasets we make some
general observations on the obtained results. Two impor-
tant aspects need to be evaluated while addressing the
homophilic network decomposition problem on a given
dataset:
The social network semantic (i.e., which kind of
relation is defined by edges? Are the links among
nodes viable proxy for real social connections?);
The nature of the target features.
In our applicative scenarios we instantiate the general
problem on online scenarios having different peculiarities
w.r.t. both these aspects. The Skype dataset, our primary
playground, can be considered a trustable social proxy:
each edge represents a connection among two users that
know each other. Moreover, the usage information of video
and chat, although individual, can be seen as a proxy for
the communication among connected users. In such sce-
nario we can assume that users within a community
intrinsically cooperate to reach a certain level of activity
w.r.t. a specific product/service. In LastFM, even if we are
still analyzing a social structure, the target attribute relates
to an average individual activity. While in Skype the usage
of chat/video within a community is likely to involve all
the users within the community, in LastFM the usage of the
platform is defined by individual actions. Finally on Goo-
gle?the target regards personal information, which rep-
resents one of the reasons behind the presence of some
network connections (i.e., if they studied together) but that
it is not necessarily the glue that keeps communities
together.
Table 10 Google?: AUC and accuracy (within brackets) produced
by the best classifier (decision tree) applied to the Google?topo-
logical features in the unbalanced scenario
Algorithm Scores Classifier
Google?: AUC and accuracy
DEMON .69 (.70) Decision tree
LOUVAIN .61 (.50) Decision tree
EGO-NETS .63 (.50) Decision tree
Baseline .75
In bold the best model
Table 11 Google?: precision and recall (within brackets) produced
by the best classifier (decision tree) applied to the Google?topo-
logical features in the unbalanced scenario
Algorithm Scores Classifier
Google?: precision–recall
DEMON .70 (.22) Decision tree
LOUVAIN .50 (.03) Decision tree
EGO-NETS .50 (.04) Decision tree
Baseline .25
In bold the best model
(a) (b)
Fig. 11 Google?: AUC versus avg. density and AUC versus avg. size in the unbalanced scenario. aAUC versus density: Google?.bAUC
versus size: Google?
Soc. Netw. Anal. Min. (2016) 6:103 Page 15 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
These differences among the considered scenarios are
the main reasons of the different outcomes the proposed
approach produces. For example, while in Skype the
community algorithms producing small and dense com-
munities (e.g., HDEMON) guarantee the best solutions to our
problem, in LastFM and Google?the modularity-based
algorithms tend to outperform the others. These results that
at first sight can appear conflicting are instead a clear
evidence that the network semantics and the definition of
target features have a great impact on the problem solution.
Moreover, as shown in Sect. 3we exploit geographical
features in order to improve the level of homophily across
the nodes within communities: we include such informa-
tion both implicitly—in LastFM and Google?the selected
users all have the same nationality—and where available
explicitly—as for Skype. In the Skype scenario we observe
that geographical proximity entropy information can be
used to explain differently each specific target feature to
predict: even though in social networks it is easy to observe
several homophilic phenomena on top of the same struc-
ture, it is possible to identify different partitions able to
guarantee high homogeneity w.r.t. specific attributes.
6 Related works
In this work we address the problem of predicting the
degree of homophily of communities from their network
topology. Homophily (McPherson et al. 2001) is a widely
studied property that permeates different social networks:
in recent studies, homophily has been leveraged to boost
classical graph mining tasks such as link prediction (Elk-
abani and Khachfeh 2015; Yuan et al. 2014; Rossetti et al.
2015) and community discovery (Zardi et al. 2014), to
build recommendation systems (Carullo et al. 2015; Zhao
et al. 2014; Wang et al. 2013) and to study diffusion of
(mis)information (Bessi et al. 2015).
6.1 Activity prediction and social targeting
In the Skype and LastFM scenarios we define clear
examples of how the general issue we defined can be
instantiated in very specific contexts. User/product
engagement analysis is one of the most valuable fields of
research for companies that needs to promote their services
on targeted audiences: in recent years, many works
addressed the issue of predicting users’ future activities
based on their past social behavior, thanks to the fertile
ground provided by social media like Facebook and
Twitter. For example, Zhu et al. (2013) conduct experi-
ments on the social media Renren using a social customer
relationship management (Social CRM) model, obtaining
superior performance when compared with traditional
supervised learning methods. Other works focus in partic-
ular on the prediction of churn, i.e., the loss of customers.
Oentaryo et al. (2012) propose a churn prediction approach
based on collective classification (CC), evaluating it using
real data provided by the myGamma social networking site.
They demonstrate that using CC on structural network
features produces better predictions than conventional
classification on user profile features. Richter et al. (2010)
analyze a large call graph to predict the churn rate of its
customers. They defines the churn probability of a cus-
tomer as a function of its local influence with immediate
social circle and the churn probability of the entire social
circle as obtained from a predictive model.
A different category of works focus on online adver-
tisement and market targeting on social networks. Bhatt
et al. (2010) address the problem of online advertising by
analyzing user behavior and social connectivity on online
social networks. Studying the adoption of a paid product by
members of the Instant Messenger (IM) network, they first
observe that the adoption is more likely if the product has
been widely adopted by the individual’s friends. They then
build predictive models to identify individuals most suited
for marketing campaigns, showing that predictive models
for direct and social neighborhood marketing outperform
several widely accepted marketing heuristics. Domingos
and Richardson (2001) propose to evaluate a user’s net-
work value in addition to their intrinsic value and its
effectiveness in viral marketing, while Hartline et al.
(2008) propose a strategy wherein a carefully chosen set of
users is influenced with free distribution of the product and
the remaining buyers are exploited for revenue maxi-
mization. Authors of Bagherjeiran and Parekh (2008) pre-
sent a machine learning approach which combines user
behavioral features and social features to estimate the
probability that a user to click on a display ad.
6.2 Community detection in social networks
One challenging problem in network science is the dis-
covery of communities within the structure of complex
networks. Two surveys by Fortunato (2010) and Coscia
et al. (2012) explore the most popular community detection
techniques and try to classify algorithms given the typol-
ogy of the extracted communities. One of the most adopted
definitions of community is based on the modularity con-
cept (Newman and Girvan 2004; Clauset et al. 2004), a
quality function of a partition which scores high values for
partitions whose internal cluster density is higher than the
external density. The seminal algorithm proposed by Gir-
van and Newman (2002) and Newman and Girvan (2004)
iteratively removes links based on the value of their
betweenness, i.e., the number of shortest paths that pass
through the link. The procedure of link removal ends when
103 Page 16 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the modularity of the resulting partition reaches a maxi-
mum. The method introduced by Clauset et al. (2004)is
essentially a fast implementation of a previous technique
proposed by Newman and Girvan (2004). A fast and effi-
cient greedy algorithm, LOUVAIN, has been successfully
applied to the analysis of huge subset of the WWW
(Blondel et al. 2008). Modularity is not the only key con-
cept that has been used for community detection: an
alternative approach is the application of information the-
ory techniques, as for example in INFOMAP (Rosvall and
Bergstrom 2008). An interesting property for community
discovery is the ability to detect overlapping substructures,
allowing nodes to be part of more than one community. A
wide set of algorithms are developed over this property,
such as CFINDER (Palla et al. 2005) and DEMON (Coscia
et al. 2014).
7 Conclusions
In this work we formulated the problem of homophilic
network decomposition. After the formulation of the gen-
eral problem we instantiated it on different scenarios: user/
service engagement analysis and attribute homogeneity
evaluation. We first produced several community sets from
the global Skype network by applying different community
detection algorithms on the data. We then extracted from
each community topological, geographical and temporal
features and learned classification models to predict the
level of usage for the video and chat products (Skype), the
average level of listening of users (LastFM) and the
homogeneity of the education level in a community
(Google?). On the Skype network, our results showed that
algorithms producing overlapping micro-communities like
HDEMON reach the best performances. Conversely modu-
larity-based approaches like LOUVAIN do not guarantee
good performance and are often outperformed by naive
algorithms such as EGO-NETS and BFS. Subsequently we
applied the same analytical framework to LastFM and
Google?. In contrast with the results observed on Skype,
in these scenarios LOUVAIN is the best approach in capturing
homophilic behavior. These counterintuitive results are due
to the different nature of the analyzed services and target
features: while the user engagement in Skype is strictly
related to the users within a community (and the final aim
of the network itself), the service engagement and educa-
tion level are only averages of individual peculiarities (thus
more difficult to relate to community structures).
Our results could be further improved by two properties
which are not present in the analyzed datasets: the strength
of the ties between the users and the dynamics of user
profiles and network links. On one side, tie strength
quantifies the degree of interaction between two
individuals, allowing to understand at what extent the level
of interactions inside a community is a proxy for users
homogeneity w.r.t. a specific feature. On the other side,
temporal information about the appearance/vanishing of
links as well as the geographical location of users allows us
to investigate how network and community structures
change in time, thus avoiding over/underestimation of the
real sociality as observed in a static network scenario.
Acknowledgments This research is supported by Microsoft/Skype
and ERDF via the Software Technology and Applications Compe-
tence Centre (STACC). This work is supported by the European
Community’s H2020 Program under the scheme ‘‘INFRAIA-1-2014-
2015: Research Infrastructures,’’ Grant agreement #654024 ‘‘SoBig-
Data: Social Mining & Big Data Ecosystem,‘http://www.sobigdata.
eu.
Funding This work was partially funded by the European Commu-
nity’s H2020 Program under the funding scheme ‘‘FETPROACT-1-
2014: Global Systems Science (GSS),’’ Grant agreement # 641191
CIMPLEX ‘‘Bringing CItizens, Models and Data together in Partic-
ipatory, Interactive SociaL EXploratories,’’ https://www.cimplex-pro
ject.eu.
Open Access This article is distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://crea
tivecommons.org/licenses/by/4.0/), which permits unrestricted use,
distribution, and reproduction in any medium, provided you give
appropriate credit to the original author(s) and the source, provide a
link to the Creative Commons license, and indicate if changes were
made.
References
Bagherjeiran A , Parekh R (2008) Combining behavioral and social
network data for online advertising. In: ICDM workshops
Bessi A, Petroni F, Vicario MD, Zollo F, Anagnostopoulos A, Scala
A, Caldarelli G, Quattrociocchi W (2015) Viral misinformation:
the role of homophily and polarization. In: Proceedings of the
24th international conference on world wide web companion,
WWW 2015, Florence, Italy, May 18–22, 2015—companion
volume
Bhatt R, Chaoji V, Parekh R (2010) Predicting product adoption in
large-scale social networks. In: CIKM
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast
unfolding of communities in large networks. J Stat Mech Theory
Exp 2008(10):P10008
Carullo G, Castiglione A, Santis AD, Palmieri F (2015) A triadic
closure and homophily-based recommendation system for online
social networks. World Wide Web 18(6): 1579–1601 (Online).
doi:10.1007/s11280-015-0333-5
Clauset A, Newman MEJ, Moore C (2004) Finding community
structure in very large networks. Rev E Phys
Coscia M, Giannotti F, Pedreschi D (2012) A classification for
community discovery methods in complex networks. In: CoRR
Coscia M, Rossetti G, Giannotti F, Pedreschi D (2014) Uncovering
hierarchical and overlapping communities with a local-first
approach. In: TKDD
Domingos P, Richardson M (2001) Mining the network value of
customers. In: SIGKDD
Elkabani I, Khachfeh RAA (2015) Homophily-based link prediction in
the facebook online social network: a rough sets approach. J Intell
Soc. Netw. Anal. Min. (2016) 6:103 Page 17 of 18 103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Syst 24(4):491–503 (Online). http://www.degruyter.com/view/j/
jisys.2015.24.issue-4/jisys-2014-0031/jisys-2014-0031.xml
Fortunato S (2010) Community detection in graphs. Phys Rep
486(3):75–174
Fortunato S, Barthe
´lemy M (2007) Resolution limit in community
detection. IN: PNAS
Girvan M, Newman MEJ (2002) Community structure in social and
biological networks. In: PNAS
Gong NZ, Xu W, Huang L, Mittal P, Stefanov E, Sekar V, Song D
(2012) Evolution of social-attribute networks: measurements,
modeling, and implications using google?. CoRR abs/
1209.0835 (Online). http://arxiv.org/abs/1209.0835
Hartline JD, Mirrokni VS, Sundararajan M (2008) Optimal marketing
strategies over social networks. In: WWW
Himelboim I, McCreery S, Smith M (2013) Birds of a feather tweet
together: integrating network and content analyses to examine
cross-ideology exposure on twitter. J Comput Med Commun
18(2):40–60
McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather:
homophily in social networks. In: Annual review of sociology
Newman MEJ (2003) Mixing patterns in networks. Phys Rev E
67:026126
Newman MEJ, Girvan M (2004) Finding and evaluating community
structure in networks. Phys Rev E 69(2):026113
Oentaryo RJ, Lim E-P, Lo D, Zhu F, Prasetyo PK (2012) Collective
churn prediction in social network. In: ASONAM
Palla G, Dere
´nyi I, Farkas I, Vicsek T (2005) Uncovering the
overlapping community structure of complex networks in nature
and society. Nat 435(7043):814–818
Richter Y, Yom-Tov E, Slonim N (2010) Predicting customer churn
in mobile networks through analysis of social groups. In: SDM
Rossetti G, Guidotti R, Pennacchioli D, Pedreschi D, Giannotti F
(2015) Interaction prediction in dynamic networks exploiting
community discovery. In: International conference on advances
in social network analysis and mining, IEEE, pp 553–558
(Online). http://dl.acm.org/citation.cfm?doid=2808797.2809401
Rosvall M, Bergstrom CT (2008) Maps of random walks on complex
networks reveal community structure. In: PNAS
Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent
training for l1-regularized log-linear models with cumulative
penalty. In: ACL/IJCNLP
Wang Y, Zang H, Faloutsos M (2013) Inferring cellular user
demographic information using homophily on call graphs. In:
2013 Proceedings IEEE INFOCOM workshops, Turin, Italy,
14–19 Apr 2013, pp 211–216 (Online). doi:10.1109/INFCOMW.
2013.6562897
Watts D, Strogatz S (1998) Collective dynamics of ’small-world’
networks. Nature 393:440–442
Yuan G, Murukannaiah PK, Zhang Z, Singh MP (2014) Exploiting
sentiment homophily for link prediction. In: Eighth ACM
conference on recommender systems, RecSys ’14, Foster City,
Silicon Valley, CA, 06–10 Oct 2014, pp 17–24 (Online). doi:10.
1145/2645710.2645734
Zardi H, Romdhane LB, Guessoum Z (2014) A multi-agent
homophily-based approach for community detection in social
networks. In: 26th IEEE international conference on tools with
artificial intelligence, ICTAI 2014, Limassol, Cyprus, 10-12 Nov
2014, pp 501–505 (Online). doi:10.1109/ICTAI.2014.81
Zhang T (2004) Solving large scale linear prediction problems using
stochastic gradient descent algorithms. In: ICML
Zhao T, Hu J, He P, Fan H, Lyu MR, King I (2014) Exploiting
homophily-based implicit social network to improve recommen-
dation performance. In: 2014 International joint conference on
neural networks, IJCNN 2014, Beijing, China, 6–11 July 2014,
pp 2539–2547. (Online). doi:10.1109/IJCNN.2014.6889743
Zhu Y, Zhong E, Pan SJ, Wang X, Zhou MQY (2013) Predicting user
activity level in social networks. In: CIKM
103 Page 18 of 18 Soc. Netw. Anal. Min. (2016) 6:103
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... In Rossetti et al. (2016), Asmi et al. (2017), Ding et al. (2018) and Arvind et al. (2018) different static approaches are proposed. Such approaches neglect the change in users' behaviors and the dynamic nature of SN. ...
... Such approaches neglect the change in users' behaviors and the dynamic nature of SN. In Rossetti et al. (2016), a comparison between two categories of community detection approaches: Maximizing community density and maximizing modularity is addressed. It assesses the relationship between different community detection categories and topological features of social network. ...
Article
Full-text available
The driving force behind content dissemination in Social network (SN) is the users’ interest in the content, which is strongly reflected in their interactions. Obviously, user interest varies with the disseminated content. Consequently, the dynamic interest results in decomposing SN into dynamic user clusters “interest groups”. The objective of this work is to rank interest-based communities using influence propagation. The contribution of this work is three-fold: First, to highlight the significance of the indirect influence among interest-based user groups. Second, to study its impact on content dissemination capability. Third, to propose an ultimate ranking model (UltRank) that uniquely considers direct and indirect influences which are reflected in a new reachability metric that considers: 1. Distance among interest groups. 2. Percentage of reachable interest groups. 3. Percentage of reachable nodes. UltRank model has been evaluated in comprehensive experiments. First, clustering quality perspective, the Silhouette coefficient for the identified interest groups is on average 0.996 and the Jaccard coefficient of 97% of different interest groups members equals 0. Second, ranking capability perspective, UltRank model can rank up to 91% of interest groups in SN. Finally, ranking effectiveness perspective, UltRank ranking list has a competing network coverage results against the other benchmark approaches.
... Rossetti et al. [23] (2016) investigated the relationship between topological features of various networks and the degree of homophily between their subscribers. Skype, Last FM and Google+ users were inspected to determine similarity in usage pattern, listening activity and education respectively. ...
Preprint
Full-text available
Twitter is one of the most popular social networking platforms of today’s generation and is a fundamental tool in harvesting the data of many users worldwide. Many discussions ranging from current affairs, news sharing, filing complaints to advertising and discussing some common interests etc. happen on Twitter. It is widely used by many famous politicians as a prominent communication medium to address large masses ,owing to its mass usage and popularity . Thus, it is safe to assume that people communicate their political ideologies on Twitter. Many works have been done so far to deduce the stance of user towards a particular party by performing sentiment analysis on their tweets using popular classifiers. To find connected and similar users, earlier works generated a social network graph, based on the assumption that friends and followers share similar interests, which might not be true in all cases. In contrast, the proposed work employs the concept of ensemble classifier (a single classifier generated from several base learning classifiers) to analyze the tweets and makes use of multiple interaction elements like followers/ following, mentions, re-tweets ,hash tags etc. to infer which political party a user identifies with. These interaction elements project out the homophily (users who share same beliefs and choices) amongst the users. The proposed study can be encompassed to any domain and can be used by advertising agencies, marketing companies, e-commerce, heath care etc. to identify their target audience.
... While most of these works do not have an extra objective for partition, a few of them involves additional property as constraint. For example, the homophilic network decomposition [13] partitions the networks while characterize the degree of homophily of its nodes. The authors assign a dominant label within each group. ...
Conference Paper
The proliferation of publicly accessible urban data provide new insights on various urban tasks. A frequently used approach is to treat each region as a data sample and build a model over all the regions to observe the correlations between urban features (e.g., demographics) and the target variable (e.g., crime count). To define regions, most existing studies use fixed grids or pre-defined administrative boundaries (e.g., census tracts or community areas). In reality, however, definitions of regions should be different depending on tasks (e.g., regional crime count prediction vs. real estate prices estimation). In this paper, we propose a new problem of task-specific city region partitioning, aiming to find the best partition in a city w.r.t. a given task. We prove this is an NP-hard search problem with no trivial solution. To learn the partition, we first study two variants of Markov Chain Monte Carlo (MCMC). We further propose a reinforcement learning scheme for effective sampling the search space. We conduct experiments on two real datasets in Chicago (i.e., crime count and real estate price) to demonstrate the effectiveness of our proposed method.
... Similarly, interesting results have been yielded in the context of social network data streams, as well as mobility data streams. For example, Rossetti et al. [73] have described an online algorithm to understand the relations between the homophily of individuals and the topological features expressed by specific network substructures. In [72], a supervised learning approach has been proposed, in order to exploit features computed by time-aware forecasts of topological measures calculated between social-aware node pairs. ...
Chapter
The aim of this article is to synthetically describe a sample of distinct approaches and applications of Relational Data Mining, which address the issue of managing complex, and possibly big, amounts of data. Specifically, we report a brief review of the literature on Relational Data Mining in the fields of Spatial Data Mining, Process Mining, Network Data Analysis and Stream Data Mining, with an emphasis on the Italian research. For each field, we describe the milestones that have been reached, as well as the future research trends that are fuelled by the emergent ubiquity of Big Data.
... DEMON leverages the nodes perspective to identify meaningful network substructures: it works by identify local-communities at the ego-network level exploiting label propagation and then merging them in an incremental fashion. Our approach has been used as a proxy for users homophily to support network quantification tasks [55]; as filter to reduce the computational cost of Link Prediction approaches [70]; as well as to bound set of Skype users while searching a network driven methodology to relate service usage to network position [71]. Moreover, in order to cope with the evolving nature of interaction networks, we proposed an online dynamic community discovery algorithm, TILES [72], able to track community life cycles as new perturbations appears in the network (i.e. ...
Chapter
During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to profound pervasiveness of relational databases in any kind of organization. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today.
Article
Full-text available
Abstract Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having different and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, node-centric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identification of high-quality overlapping partitions. We compare Angel, both on synthetic and real-world datasets, against state of the art community discovery algorithms designed for the same community definition. Our experiments underline the effectiveness and efficiency of the proposed methodology, confirmed by its ability to constantly outperform the identified competitors.
Chapter
Over the last decade, technology has thrived to provide better, quicker, and more effective platforms to help individuals connect and disseminate information to other individuals. The increasing popularity of these networks and its huge content in the form of text, images, and videos provides new opportunities for data analytics in the context of social networks. This motivates data mining experts and researchers to deploy various mining apparatus and application-specific tools for analysing the massive, intricate, and dynamic social media knowledge. The research detailed in this chapter would entail major social network concepts with data analysis techniques. Moreover, it gives insight to representation and modelling of social networks with research datasets and tools.
Conference Paper
Full-text available
Due to the growing availability of online social services, interactions between people became more and more easy to establish and track. Online social human activities generate digital footprints, that describe complex, rapidly evolving, dynamic networks. In such scenario one of the most challenging task to address involves the prediction of future interactions between couples of actors. In this study, we want to leverage networks dynamics and community structure to predict which are the future interactions more likely to appear. To this extent, we propose a supervised learning approach which exploit features computed by time-aware forecasts of topological measures calculated between pair of nodes belonging to the same community. Our experiments on real dynamic networks show that the designed analytical process is able to achieve interesting results.
Conference Paper
Full-text available
Link prediction on social media is an important problem for recommendation systems. Understanding the interplay of users' sentiments and social relationships can be potentially valuable. Specifically, we study how to exploit sentiment homophily for link prediction. We evaluate our approach on a dataset gathered from Twitter that consists of tweets sent in one month during U.S. 2012 political campaign along with the " follows " relationship between users. Our first contribution is defining a set of sentiment-based features that help predict the likelihood of two users becoming " friends " (i.e., mutually mentioning or following each other) based on their sentiments toward topics of mutual interest. Our evaluation in a supervised learning framework demonstrates the benefits of sentiment-based features in link prediction. We find that Adamic-Adar and Euclidean distance measures are the best predictors. Our second contribution is proposing a factor graph model that incorporates a sentiment-based variant of cognitive balance theory. Our evaluation shows that, when tie strength is not too weak, our model is more effective in link prediction than traditional machine learning techniques.
Conference Paper
Social information between users has been widely used to improve the traditional Recommender System in many previous works. However, in many websites such as Amazon and eBay, there is no explicit social graph that can be used to improve the recommendation performance. Hence in this work, in order to make it possible to employ social recommendation methods in those non-social information websites, we propose a general framework to construct a homophily-based implicit social network by utilizing both the rating and comments of items given by the users. Our scalable framework can be easily extended to enhance the performance of any recommender systems without social network by replacing the homophily-based implicit social relation definition. We propose four methods to extract and analyze the implicit social links between users, and then conduct the experiments on Amazon dataset. Experimental results show that our proposed methods work better than traditional recommendation methods without social information.
Article
Recommendation systems are popular both commercially and in the research community. For example, Online in Social Networks (OSNs) like Twitter, they are gaining an increasing attention since a lot of connection are established between users without any previous knowledge. This highlights one of the key features of a lot of OSNs: the creation of relationships between users. Therefore, it is important to find new ways to provide interesting friendships suggestions. However, mining and analyzing data from large scale Social Networks can become critical in terms of computational resources. This is particularly true in the context of ubiquitous access, where resource-constrained mobile devices are used to access the social network services. To this end, designing architectures/solutions offering the possibility of operating in a Mobile Cloud scenario is of key importance. Accordingly, we present a new recommendation system scheme that tries to find the right trade-offs between the exploitation of the already existing links/relationships and the interest affinities between users. In particular, such scheme is based on an inherently parallel Hubs And Authorities algorithm together with similarity measures that, for scalability purposes, can be easily transposed in a cloud scenario. The first one let us leverage triadic closures while the second one takes into account homophily. The proposal is supported by an extensive performance analysis on publicly available Twitter data. In particular, we proved the effectiveness of the proposed recommendation system by using several performance metrics available in the literature which include precision, recall, F-measure and G-measure. The results show encouraging perspectives in terms of both effectiveness and scalability, that are driving our future research efforts.