Conference PaperPDF Available

Community-centric analysis of user engagement in Skype social network

Authors:

Abstract

Traditional approaches to user engagement analysis focus on individual users. In this paper we address user engagement analysis at the level of groups of users (social communities). From the entire Skype social network we extract communities by means of representative community detection methods each one providing node partitions having their own peculiarities. We then examine user engagement in the extracted communities putting into evidence clear relations between topological and geographic features of communities and their mean user engagement. In particular we show that user engagement can be to a great extent predicted from such features. Moreover, from the analysis it clearly emerges that the choice of community definition and granularity deeply affect the predictive performance.
1
Community-centric analysis of user engagement
in Skype social network
Giulio Rossetti∗† , Luca Pappalardo∗†, Riivo Kikas, Dino Pedreschi, Fosca Giannottiand Marlon Dumas
University of Pisa, Italy Email: {giulio.rossetti,lpappalardo,pedre}@di.unipi.it
ISTI-CNR, Pisa Italy Email: {name.surname}@isti.cnr.it
Unversity of Tartu, Estonia Email: {riivokik, marlon.dumas}@ut.ee
Abstract—Traditional approaches to user engagement analysis
focus on individual users. In this paper we address user engage-
ment analysis at the level of groups of users (social communities).
From the entire Skype social network we extract communities by
means of representative community detection methods each one
providing node partitions having their own peculiarities. We then
examine user engagement in the extracted communities putting
into evidence clear relations between topological and geographic
features of communities and their mean user engagement. In
particular we show that user engagement can be to a great
extent predicted from such features. Moreover, from the analysis
it clearly emerges that the choice of community definition and
granularity deeply affect the predictive performance.
I. INTRODUCTION
As the social media space grows more and more people
interact and share experiences through a plethora of differ-
ent online services, producing every day a huge amount of
personal data. Companies providing social media services
are interested in exploiting these Big Data to understand
“user engagement”, i.e. the way individuals use the products
provided. Traditional approaches of predictive analytics focus
on individuals: they try to describe and predict the level
of engagement of a single individual, with the purpose of
suggesting proper products/services and favoring the diffusion
of the system over a larger population. Focusing on individuals,
however, introduces many challenging issues, i.e., the amount
of individuals to process is enormous, and hence hardly
manageable. Addressing each single individual is also in many
cases redundant, since neighbors in networks tend to behave in
a similar way showing a certain degree of homophily [11], [10]
and inevitably causes the underestimation of the surrounding
social context. It is hence fundamental to widen the analysis
spectrum to incorporate social surrounding of users in order to
capture the homophily which characterize real social networks.
We propose to move the focus from individuals to groups an-
alyzing and describing the engagement of social communities.
Moving the interest from individuals to communities brings
many advantages. First, we reduce by several orders of mag-
nitude the space of analysis, shrinking the number of objects to
process and speeding up the analytical tasks. Second, targeting
communities allows for capturing the homophily inherent to
the social network: we can “compress” into one object all
the densely connected components of a social group. Finally,
groups are more complex objects from which we can extract a
wide set of features for the analysis. To approach this problem,
we extract social communities from the global Skype network
and compute relevant structural and geographical features from
each one of them. We then build a classifier to predict how
much within a social community are used the video and instant
messaging products provided by Skype. We find that group-
centric approaches outperforms user-centric ones when we
use algorithms producing overlapping micro-communities. In
contrast modularity-based algorithms are worse than the ones
of classical user-centric strategies. Hence, we show how the
choice of a proper community detection algorithm is crucial
to reach high performances in the engagement prediction.
II. RE LATE D WOR KS
a) Activity prediction and social targeting: In recent
years, many works addressed the issue of predicting users’
future activities based on their past social behavior. Zhu et al.
[20] conduct experiments on the social media Renren using a
Social Customer Relationship Management model, obtaining
superior performance when compared with traditional super-
vised learning methods. Other works focus in particular on the
prediction of churn, i.e. the loss of customers. Oentaryo et al.
[14] propose a churn prediction approach based on collective
classification (CC), evaluating it using real data provided
by the myGamma social networking site. They demonstrate
that using CC on structural network features produces better
predictions than conventional classification on user profile
features. Richter et al. [16] analyze a large call graph to
predict the churn rate of its customers. They defines the churn
probability of a customer as a function of its local influence
with immediate social circle, and the churn probability of the
entire social circle as obtained from a predictive model. A
different category of works focus on online advertisement and
market targeting on social networks. [2] addresses the problem
of online advertising by analyzing user behavior and social
connectivity on online social networks. Studying the adoption
of a paid product by members of the Instant Messenger net-
work, they first observe that the adoption is more likely if the
product has been widely adopted by the individual’s friends.
They then build predictive models to identify individuals most
suited for marketing campaigns, showing that predictive mod-
els for direct and social neighborhood marketing outperform
several widely accepted marketing heuristics. [7] propose to
evaluate a user’s network value in addition to their intrinsic
value and its effectiveness in viral marketing, while [9] propose
2
a strategy wherein a carefully chosen set of users is influenced
with free distribution of the product and the remaining buyers
are exploited for revenue maximization. Authors of [1] present
a machine learning approach which combines user behavioral
features and social features to estimate the probability that a
user clicks on a display ad.
b) Community detection in social networks: One critical
task of social network analysis involves the identification of
groups and communities within complex social tissues. A
survey [6] explore the most popular community detection
techniques and try to classify algorithms given the typology
of the extracted communities. One of the most adopted com-
munity definitions is based on the modularity concept [13],
[4], a quality function of a partition which scores high values
for partitions whose internal cluster density is higher than
the external density. A fast and efficient modularity-based
greedy algorithm, LO UVAI N, has been successfully applied
to the analysis of huge subset of the WWW [3]. Moreover,
modularity is not the only key concept that has been used for
community detection: an alternative approach is the application
of information theory techniques, as for example in INFOMAP
[17]. An interesting property for community discovery is the
ability to detect overlapping sub-structures, allowing nodes to
be part of more than one community. A wide set of algorithms
are developed over this property, such as CFINDER [15], and
DEM ON [5].
III. MOD EL CONSTRUCTION
Data: We analyze a dataset of users and connections in
Skype as of October 2011. The dataset includes anonymized
data of Skype users. Each user (identified by hashed ID) is
associated with their account creation date and country and
city of account creation. The dataset also includes connections
between users. Connections are undirected: a connection exists
between two users if and only if they belong to each other’s
contact list. Moreover, each connection is labeled with a
timestamp corresponding to the contact request approval.
In addition to non-identifiable user profile data and network
data, the dataset includes data about usage of two Skype prod-
ucts: video calling and chatting. Product usage is aggregated
monthly. Specifically, for each product, for each user and for
each month, we are given the number of days in the month
when the user used the product. The product usage data do
not provide information about individual interactions between
users, such as participants in an interaction, content, length, or
time of the interaction. We analyze the most recent available
snapshot of the network. Hence, we focus on the subset of
users who used one of the two products, during at least two
of the last three months covered in the dataset. Our analyses
will be then executed on a filtered dataset composed by several
tens of millions of users and connections.
Community Detection: The degree of overlap among com-
munities is one of the properties that can be used to charac-
terize community detection algorithms. Classical approaches
produce crisp partition of the network, i.e. an individual can
be involved in at most one community while overlapping ones
considering the multidimensional nature of social networks
allow individuals to belong to many different communities.
To observe the impact overlap has on our analysis we use four
different algorithms to extract social communities from the
Skype network (in increasing degree of overlap): LOU VAIN,
BFS, HD EMON and EG O-NETWORK.
LOU VAIN [3] is a scalable algorithm based on a greedy
modularity approach. It produces a complete non-overlapping
partitioning of the graph. It has been shown that modularity-
based approaches suffer a resolution limit and therefore LO U-
VAIN is unable to detect medium size communities [8]. This
produces communities with high average density, due to the
identification of a predominant set of very small communities
(usually composed by 2-3 nodes) and a few huge communities.
HDEM ON [5] is based on a recursive hierarchical aggrega-
tion of denser areas extracted from ego-networks. Its definition
allows to compute communities with high internal density
and tunable overlap. In its first hierarchical level, HDE MO N
operates extracting ego-networks and partitioning them into
denser areas using Label Propagation. The algorithm has two
parameters: (i) the minimum community size µ; and (ii) the
minimum Jaccard ψamong meta-nodes to create an edge that
connects them while building the community hierarchy. We
apply HDEMON on the Skype dataset fixing µ= 3 (the
minimum community is a triangle) and using two different
values of the ψparameter: ψ= 0.25 which produced the
HDEM ON 25 community set, and ψ= 0.5which produced
the HDEMON50.
EGO -NE TWORK is a naive algorithm that models the com-
munities as the set of induced subgraphs obtained considering
each node with its neighbors. This approach provides the
highest overlap among the four considered approaches: each
node ubelongs exactly to |Γ(u)|+ 1 communities, where
Γ(u)identify its neighbors set. We apply a node sampling
strategy and consider only a ratio of the ego-networks for the
analysis. We set the parameter = 0.2, and randomly extracted
a number of users equals to the 20% of the population. For
each random user we extracted the corresponding ego network,
filtering only unique ones.
The BF S algorithm extracts random connected components
from the graph. It randomly samples a ratio of the nodes
of the network and, for each one of them, a number csize is
extracted from a power law distribution, modeling community
sizes. Starting from a root node, the algorithm explores other
nodes performing a breadth first search and stopping when
csize nodes are discovered.
Each algorithm, according to the specified parameters, pro-
duces different community sets when applied on the Skype
dataset. In Table I we report for each community set and
hierarchy level (Lv.) used in the following analysis: (i) the
number of communities (#C); (ii) the induced node coverage
w.r.t. the whole graph; (iii) the average number of communities
per node (σ, i.e. the mean degree of overlap); the average
community size (Avg.size). LO UVAI N is a partitioning al-
gorithm and guarantees the complete coverage of the nodes.
HDEM ON covers around 76% of the nodes because imposing
the parameter µ= 3 we exclude communities with two
nodes only. BFS and EG O-NETWORK are executed on a 20%
sample of the nodes, on which they cover the 90% and 69%
3
COMMUNITY STATISTICS
Algorithm Lv. #Ccoverage (%) σAvg. size
HDE MON 25 2 3.3e+07 76 13.2 27.9
HDE MON 50 2 8.2e+07 76 10.3 8.9
LOUVAI N 0 8.7e+06 100 1.0 10.7
6 9.8e+05 100 1.0 94.6
EGO- NET S - 1.5e+07 6913.7 15.6
BFS - 1.8e+07 90113.3 60.8
TABLE I: Characteristics of the community sets produced by
the algorithms on the Skype dataset.
STRU CTU RAL F EATUR ES
Nnumber of nodes
Mnumber of edges
Ddensity
CC global clustering
CCavg average clustering
Adeg degree assortativity
degC
max max degree (com-
munity links)
degC
avg avg degree (com-
munity links)
degall
max max degree (all
links)
degall
avg avg degree (all
links)
Tclosed triads
Topen open triads
Ovneighborhood nodes
Oeoutgoing edges
Edist num. edges with
distance
dapprox. diameter
rapprox. radius
gconductance
COMMUNITY FORM ATIO N FEATU RES
Tffirst user arrival
time
ITavg avg user inter-
arrival time
ITstd std of user inter-
arrival time
ITl,f last-first inter-
arrival time
GEOGRAPHIC FEATURES
Nsnumber of countries
Escountry entropy
Smax percentage of most
represented country
Ntnumber of cities
Etcity entropy
distavg avg geographic dis-
tance
distmax max geographic dis-
tance
ACTIVITY FEATURES
Video mean number of
days of video
Chat mean number of
days of chat
TABLE II: Description of the features extracted from the
communities.
respectively. For the LOUVAIN, we consider the hierarchical
levels 0 and 6 only, which correspond to the first greedy
iteration and the iteration having the maximum modularity.
A. Community Features
From the community sets produced by the four algorithms
we extract a wide set of features, belonging to four main
categories: structural,geographical,formation and activity
features (see Table II). Structural features convey informa-
tion about the topology of a social community. We analyze
community size and density, clustering coefficient, diameter
and radius as well as other relevant topological measures.
Moreover we take into account as proxy for homophily the
degree assortativity Adeg which indicates the preference for
the nodes to attach to others that have the same degree [12].
Other structural features regard the level of hubbiness of a
community, such as the average/maximum degree computed
considering both the network links or the community links
only. The community formation features convey information
regarding the temporal appearance of nodes within the com-
munity, such as: the time of subscription to Skype of the first
user to subscribe; the average and the standard deviation of the
1For EGO -NETS and BFS the coverage is computed starting from a 20%
sample of the total users.
inter-arrival times of users; the inter-arrival time between the
first node to subscribe and the last node who adopted Skype.
Geographic features provide information about the geographic
diversity of a community. The number of different countries
represented gives a first estimation of the international nature
of the community. The country entropy estimates the national
diversity through the Shannon entropy. We also compute the
city entropy and the number of different cities represented by
the community. Moreover, for the users for which we know
the city name (those associated to cities with more than 5,000
Skype users), we compute their geographic distance using
the coordinates of the centers of the cities. Once computed
all the available distances, we consider the average and the
maximum geographic distances of each community. Finally,
the activity features indicate the mean level of Skype activity
performed by the community members. We extract two activity
features: (i) chat, the mean number of days they used the
instant messaging (chat); and (ii) video, the mean number
of days they used the video conference. The distributions of
the chat feature for HDEMON, BFS and EG O-NE TW ORKS
follow a peaked distribution, while those of the chat feature
(for LOUVAIN) and of the video feature (for all algorithms)
follow an exponential distribution. In all cases, the separation
between high-engagement and low-engagement communities
is less clear for higher thresholds. For the video feature, the
median ranges from 3 to 3.75 (across algorithms) while the
75th-percentile ranges from 6 to 7. For the chat feature, the
median ranges from 5 to 5.9, while the 75th-percentile ranges
from 13.9 to 15.4.
IV. MOD EL EVAL UATION
We use the features described above to classify the level of
engagement of social communities with respect to the chat and
video activity features. To this purpose, we build a supervised
classifier that assigns communities to two possible categories:
high level of engagement or low level of engagement. We
address two different scenarios: (i) a balanced class scenario
where the two classes have the same percentage of population;
and (ii) an unbalanced class scenario, where we consider an
uneven population distribution.
Balanced scenario: In order to transform the video and chat
activity features into discrete variables we partition the range of
values through the median of their distribution. This produced,
for each variable to predict, two equal-populated classes: (i)
low engagement, ranging in the interval [0, median]; and
(ii) high engagement, ranging in the interval [median, 31].2
To perform classification we use Stochastic Gradient Descent
(SGD) and AUC (area under the ROC curve) to evaluate their
performance. The overall accuracy is instead the proportion
of true results (both true positives and true negatives) in the
population. We learn the SGD classifier with logistic error
function [18], [19] .We execute 5 iterations, performing data
shuffling before each one of them, imposing the elastic-net
penalty α= 0.0001 and l1-ratio = 0.05. The adoption of
elastic-net penalty results in some feature weights set to zero,
2the maximum is 31 because it refers to the mean number of days per
month in which that activity was performed.
4
(a) HDEMO N Chat (b) LO UVAIN Chat (c) HDEMO N Video (d) LO UVAIN Video
Fig. 1: Weights of the features produced by SGD method for HDEMON and LOUVAIN community sets, for the Chat feature in
the balanced scenario (a-b) and Video feature in the unbalanced scenario (c-d).
(a) AUC vs. Density: Video (b) AUC vs. Size: Video (c) AUC vs. Density: Video (d) AUC vs. Size: Video
Fig. 2: AUC vs. Avg. Density and AUC vs. Avg. Size: Balanced scenario (a-b) Unbalanced scenario (c-d)
VIDE O: AUC A ND ACC UR ACY
Algorithm Lv. Scores
HDE MON 25 1 .74 (.67)
HDE MON 50 0 .71 (.68)
LOUVAIN 0 .65 (.60)
LOUVAIN 6 .63 (.59)
EGO- NET S - .70 (.64)
BFS - .67 (.62)
CHAT: AUC A ND ACC URAC Y
Algorithm Lv. Scores
HDE MON 25 2 .84 (.77)
HDE MON 50 1 .81 (.73)
LOUVAIN 0 .69 (.64)
LOUVAIN 6 .65 (.60)
EGO- NET S - .75 (.75)
BFS - .81 (.72)
TABLE III: AUC and Accuracy (within brackets) in the
balanced scenario, for Video and Chat.
thus eliminating less important features. We apply a five fold
cross-validation for learning and testing. Table III shows the
AUC produced by the SGD method on the features extracted
from the community sets produced by the four algorithms
(for HDEM ON and LOUVAIN only the two best performing
community sets are reported). HDEMON produces the best
performance, both in terms of AUC and overall accuracy, for
all the three activity features. LO UVAIN, conversely, reaches
a poor performance and it is outperformed by BFS and
EGO -NETWORKS. This result suggests that the adoption of
modularity optimization approaches, like LOUVAIN, is not
effective when categorizing group-based user engagement due
to their resolution limit which causes the creation of huge
communities [8]. As the level of the LOUVAIN hierarchy in-
creases, and hence the modularity increases, both the AUC and
overall accuracy decrease. In the experiments, indeed, the first
LOU VAIN hierarchical level outperforms the last level, even
though the latter has the highest modularity. Figure 1 shows
the features which obtain a weight value by the SGD method
higher than 0.2or lower than 0.2(i.e. the most discriminative
features for the classification process). HD EM ON distributes
the weights in a less skewed way, while the other algorithms
give high importance to a limited subset of the extracted
features. Moreover only a few LOU VAIN features have a
weight higher than 0.2or lower than 0.2(see Figure 1, d),
confirming that a modularity approach produces communities
with weak predictive power with respect to user engagement.
Moreover, an interesting phenomena emerges: independently
from the chosen community discovery approach, the most
relevant class of features for the classification process is the
topological class. In particular degree, density, community
size and clustering related measures often appear among the
most weighted features. Figures 2(a-b) shows the relationships
between the average community size, the average community
density and the AUC value produced by the SGD method
on the community sets which reach the best performances
in the balanced scenario for the Video feature (Chat behave
similarly). The best performance is obtained for the HD EM ON
community sets, which constitute a compromise between the
micro and the macro level of network granularity. When the
average size of the communities is too low, as for the ego-
network level, we lose information about the surroundings of
nodes and do not capture the inner homophily hidden in the
social context. On the other hand, when communities become
too large, as in the case of LOUVAIN ones we mix together
different social contexts losing definition. Communities ex-
pressing a good trade-off between size and density, as in the
case of the HDEMON algorithm, reach the best performance
in the problem of estimating user engagement.
Unbalanced scenario We address also an unbalanced sce-
nario where we use the 75th percentile to discriminate the low
engagement class, which thus contains the 75% of the obser-
5
(a) Video median (b) Chat median (c) Video 75th percentile (d) Chat 75th percentile
Fig. 4: Most relevant Pearson correlations between community feature values and target class (high/low activity) for HDEMON.
In (a-b) are shown the indexes for the balanced class scenario while in (c-d) for the 75th percentile split.
VIDE O: AUC A ND ACC UR ACY
Algorithm Lv. Scores
HDE MON 25 1 .76 (.68)
HDE MON 50 0 .73 (.65)
LOUVAIN 0 .64 (.59)
LOUVAIN 6 .61 (.58)
EGO- NET S - .71 (.63)
BFS - .68 (.61)
baseline -.75
CHAT: AUC A ND ACC URAC Y
Algorithm Lv. Scores
HDE MON 25 2 .82 (.78)
HDE MON 50 3 .80 (.76)
LOUVAIN 0 .68 (.70)
LOUVAIN 6 .67 (.66)
EGO- NET S -.83 (.79)
BFS - .82 (.77)
baseline -.75
TABLE IV: AUC and Accuracy (within brackets) produced by
the SGD method in the unbalanced scenario, for the Video and
Chat features.
vations. Table IV describes the results produced by the SGD
methods in the unbalanced scenario, using the same features
and community discovery approaches discussed before. The
baseline method for the unbalanced scenario is the majority
classifier: it reaches an AUC of 0.75 by assigning each item
to the majority class (the low engagement class). We observe
that, regardless the community set used, the SGD method is
not able to improve significantly the baseline classifier for
Video. Conversely, the results obtained for the Chat feature
by SGD outperforms the baseline when we adopt HDEM ON ,
EGO -NET WORKS and BFS community sets, reaching an AUC
of 0.83. In order to provide additional insights on the models
built with the adoption of the different CD algorithms, we
compute the precision and recall measures with respect to the
minority class (see Table V). Looking at these measures enable
us to understand which are the advantage in using SGD to
identify correctly instances of the less predictable class. In
this more challenging settings, the baseline is the minority
classifier which reaches a precision of 25% by assigning each
community item to the minority class (the high engagement
one). We observe that the SGD method outperforms the
baseline classifier on all the community sets (reaching values in
the range [.33, .57]). HD EM ON and E GO -N ET WO RK S are the
community sets which led to the best precision, on the Video
features and the Chat feature respectively. In order to measure
the effectiveness of SGD we report the Lift chart which shows
the ratio between the results obtained with the built model
and the ones obtained by a random classifier. The charts in
Figure 3 are visual aids for measuring SGD’s performance
on the community sets: the greater the area between the lift
VIDE O: PRECISION - R ECALL
Algorithm Lv. Scores
HDEMON2 5 2 .42 (.72)
HDEMON5 0 1 .39 (.70)
LOUVAI N 0 .33 (.69)
LOUVAI N 6 .33 (.67)
EGO- NET S - .37 (.68)
BFS - .35 (.71)
baseline - .25
CHAT: PRECISION - RECA LL
Algorithm Lv. Scores
HDEMON2 5 2 .54 (.69)
HDEMON5 0 3 .50 (.67)
LOUVAI N 0 .40 (.41)
LOUVAI N 6 .44 (.33)
EGO- NET S -.57 (.68)
BFS - .52 (.71)
baseline - .25
TABLE V: Precision and Recall (within brackets) produced
by the SGD model for the Video and Chat features in the
unbalanced scenario.
(a) Video (b) Chat
Fig. 3: Unbalanced scenario: Lift plot for each product.
curve and the baseline, the better the model. We observe that
HDEM ON performs better than the competitors for the video
features. For the chat features, the community sets produced
by the three naive algorithm win against the other two CD
algorithms. For all the three activity features LOUVAIN reaches
the worst performance, as in the balanced scenario. As done for
the balanced scenario, in Figure 1(e-h) we report for each CD
the features having weight greater than 0.2or lower than 0.2.
Conversely from the results presented in the previous section,
where topological features always show the higher relative
importance for the classification process, in this scenario we
observe that community formation and geographical features
have greater descriptive power. As previously observed the
minority class identified by a 75th percentile split is mostly
composed by particular, rare, community instances affecting
the relative importance of temporal and geographical infor-
mations: the results suggest that the more a community is
active the more significative are its geographical and temporal
bounds. Finally in Figure 2(c-d) we show the relationships
6
between the average community size, the average community
density and the AUC value produced by the SGD method on
the community sets which reach the best performances in the
unbalanced scenario. We can observe how, in this settings, the
algorithms granting communities having on average small sizes
and high density are the ones that assure the construction of
SGD models reaching higher AUC. In particular HDE MO N in
both its instantiation outperforms the other approaches.
V. COMMUNITY CHARACTERIZATION
From our analysis emerged a well defined trend: among the
compared methodologies, HD EM ON is able, both in balanced
and unbalanced scenarios, to better bound homophily and
thus to extract communities that guarantee useful insights
on the product engagement level. For this reason, starting
from the communities extracted by such bottom-up overlap-
ping approach we computed the Pearson correlation for all
the defined features against the final class label (high/low
engagement). As shown in Figure 4(a), when splitting the
Video engagement using the 50th percentile we are able to
identify as highly active communities the ones having high
country entropy Esas well as high geographic distance among
its users distavg and whose formation is recent (i.e. whose
first user has joined the network recently, Tf, as well as
the last one, ITl,f .). Moreover, Video active communities
are composed by users having on average low degree as
shown by degall
avg and degC
max. Conversely, looking at Figure
4(b) we notice that communities which exhibit high Chat
engagement can be described by persistent structures (i.e.
social groups for which the inter-arrival time ITl,f from the
first to the last user is high), composed by users showing
almost the same connectivity (in particular having high degree)
and sparse social connections (low clustering coefficient CC,
low density Dand high radius). Moreover, we compute the
same correlations for the 75th percentile split: in contrast with
the new results for the Chat engagement (Figure 4(d)) which do
not differ significantly from the ones discussed for the balanced
scenario, in this settings the highly active Video communities
show new peculiarities. In Figure 4(c) we observe how the
level of engagement negatively correlates with the community
radius (and diameter) and positively correlates with the density.
This variations describe highly active Video communities as
a specific and homogeneous sub class composed by small
and dense network structures composed by users who live in
different countries (high geographical entropy Es).
VI. CONCLUSIONS
In this work we addressed the issue of predicting user
engagement in online social networks. In contrast with
traditional user-centric approaches, we focus on social
communities in order to exploit the inherent homophily
characteristic of social networks. Our results show that, both
in balanced and unbalanced classification scenarios, algorithms
producing overlapping micro-communities like HDEMON
reach the best performance. Conversely, modularity-based
approach like LOUVAIN do not guarantee good performance
and are outperformed by simple clustering strategies such
as EG O-NE TS and B FS. We also provide a description
for low/high engaged communities identified by HDEM ON
through the analysis of the correlations between their activity
level and the values of their features.
Acknowledgments. This research is supported by Microsoft/Skype
and ERDF via the Software Technology and Applications
Competence Centre (STACC) and partially funded by the
European Community’s H2020 Program under the funding
scheme “FETPROACT-1-2014: Global Systems Science (GSS)”,
grant agreement #641191 CIMPLEX3.
REFERENCES
[1] A. Bagherjeiran and R. Parekh, “Combining behavioral and social
network data for online advertising.” in ICDM Workshops, 2008.
[2] R. Bhatt, V. Chaoji, and R. Parekh, “Predicting product adoption in
large-scale social networks.” in CIKM, 2010.
[3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast
unfolding of communities in large networks,Journal of Statistical
Mechanics: Theory and Experiment, 2008.
[4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding community
structure in very large networks,Phys. Rev. E, 2004.
[5] M. Coscia, G. Rossetti, F. Giannotti, and D. Pedreschi, “Uncovering
hierarchical and overlapping communities with a local-first approach.
TKDD, 2014.
[6] M. Coscia, F. Giannotti, and D. Pedreschi, “A classification for com-
munity discovery methods in complex networks,CoRR, 2012.
[7] P. Domingos and M. Richardson, “Mining the network value of cus-
tomers,” in SIGKDD, 2001.
[8] S. Fortunato and M. Barth´
elemy, “Resolution limit in community
detection,” PNAS, 2007.
[9] J. D. Hartline, V. S. Mirrokni, and M. Sundararajan, “Optimal marketing
strategies over social networks.” in WWW, 2008.
[10] I. Himelboim, S. McCreery, and M. Smith, “Birds of a feather tweet
together: Integrating network and content analyses to examine cross-
ideology exposure on twitter,Journal of Computer-Mediated Commu-
nication, 2013.
[11] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:
Homophily in social networks,” Annual Review of Sociology, 2001.
[12] M. E. J. Newman, “Mixing patterns in networks,Phys. Rev. E, vol. 67,
p. 026126, 2003.
[13] M. E. J. Newman and M. Girvan, “Finding and evaluating community
structure in networks,” Phys. Rev. E, 2004.
[14] R. J. Oentaryo, E.-P. Lim, D. Lo, F. Zhu, and P. K. Prasetyo, “Collective
churn prediction in social network.” in ASONAM, 2012.
[15] G. Palla, I. Der´
enyi, I. Farkas, and T. Vicsek, “Uncovering the overlap-
ping community structure of complex networks in nature and society,
Nature, 2005.
[16] Y. Richter, E. Yom-Tov, and N. Slonim, “Predicting customer churn in
mobile networks through analysis of social groups,” in SDM, 2010.
[17] M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex
networks reveal community structure,PNAS, 2008.
[18] Y. Tsuruoka, J. Tsujii, and S. Ananiadou, “Stochastic gradient descent
training for l1-regularized log-linear models with cumulative penalty.”
in ACL/IJCNLP, 2009.
[19] T. Zhang, “Solving large scale linear prediction problems using stochas-
tic gradient descent algorithms.” in ICML, 2004.
[20] Y. Zhu, E. Zhong, S. J. Pan, X. Wang, M. Zhou, and Q. Y. 0001,
“Predicting user activity level in social networks.” in CIKM, 2013.
3“Bringing CItizens, Models and Data together in Participatory, Interactive
SociaL EXploratories’”, https://www.cimplex-project.eu
... Manuscript is under review. [25], [27], [33] have been devoted to finding the crucial (anchored) users who significantly impact the formation of social communities and the operations of social networking platforms. ...
... We also need to update the remaining degree of the neighbours of u * (Lines [19][20][21][22]. After that, V I maintains the vertices that are affected by the edge insertion, and these vertices have core number k − 1 in new K-order O of graph G t (Lines [24][25][26][27][28][29][30]. Finally, when the outer while loop terminates, we can output the maintained K-order and the affected vertices set V I (Line 33). ...
Preprint
Full-text available
User engagement has recently received significant attention in understanding the decay and expansion of communities in many online social networking platforms. Many user engagement studies have done to find a set of critical (anchored) users in the static social network. However, the social network is highly dynamic and its structure is continuously evolving. In this paper, we target a new research problem called Anchored Vertex Tracking (AVT) that aims to track the anchored users at each timestamp of evolving networks. To solve the AVT problem, we develop a greedy algorithm. Furthermore, we design an incremental algorithm to efficiently solve the AVT problem. Finally, we conduct extensive experiments to demonstrate the performance of our proposed algorithms.
... Grouping of similar data is the key idea of community detection. Its application ranges from molecular structures (Wang et al., 2010) to human interactions (Rossetti et al., 2015;Traud et al., 2012) and medical (Betzel et al., 2016;Cantini et al., 2015) to security bodies (Ríos & Muñoz, 2012). With increase in demand, one more dimension is being added to this field which is time. ...
Article
Community detection in social networks is an important field of research in data mining and has an abundant literature. Time varying social networks require algorithms that can comply with temporal changes and are also feasible with limited resources. The performance of static algorithms are not well suited for such perturbing networks. Continuously updating community structure, light computations, on-demand results etc. are few of the new challenges introduced on account of dynamic networks. The aforementioned challenges are addressed in the proposed work. The work proposes a tree-based community detection in dynamic social networks (TCD2) algorithm which exploits two important properties of social network, connectedness and influence, for finding communities in the network. TCD2 uses a tree-structure to maintain the information of dynamically changing community structures of the network. The experimental results on real-world social networks along with synthetic networks validate the performance of TCD2. The tests also confirmed its superiority over the state-of-the-art algorithms. The results showed that the proposed algorithm achieves a significant trade-off between quality and accuracy.
... It can better explain the decay and expansion of communities in a social network. The problem of community engagement has been studied in [10], [8], [2], [14]. Bhawalkar et al. [2] investigated the community engagement as anchored k-core problem which aims to find a set of anchored vertices that can further induce maximal anchored k-core. ...
Chapter
As Nietzsche once wrote “Without music, life would be a mistake” (Twilight of the Idols, 1889.). The music we listen to reflects our personality, our way to approach life. In order to enforce self-awareness, we devised a Personal Listening Data Model that allows for capturing individual music preferences and patterns of music consumption. We applied our model to 30k users of Last.Fm for which we collected both friendship ties and multiple listening. Starting from such rich data we performed an analysis whose final aim was twofold: (i) capture, and characterize, the individual dimension of music consumption in order to identify clusters of like-minded Last.Fm users; (ii) analyze if, and how, such clusters relate to the social structure expressed by the users in the service. Do there exist individuals having similar Personal Listening Data Models? If so, are they directly connected in the social graph or belong to the same community?.
Article
Full-text available
Abstract Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having different and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, node-centric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identification of high-quality overlapping partitions. We compare Angel, both on synthetic and real-world datasets, against state of the art community discovery algorithms designed for the same community definition. Our experiments underline the effectiveness and efficiency of the proposed methodology, confirmed by its ability to constantly outperform the identified competitors.
Chapter
Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having different and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, node-centric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identification of high-quality overlapping partitions. We compare Angel, both on synthetic and real-world datasets, against state of the art community discovery algorithms designed for the same community definition. Our experiments underline the effectiveness and efficiency of the proposed methodology, confirmed by its ability to constantly outperform the identified competitors.
Chapter
Encouraging lurkers to more actively participate in the OSN life, a.k.a. delurking, is desirable in order to make lurkers’ social capital available to other users. In this chapter, we discuss in detail the delurking problem and computational approaches to solve it. We first provide an overview of works focusing on user engagement methodologies to understand how users can be motivated to participate and contribute to the community living in a social environment. Then we concentrate on the presentation of algorithmic solutions to support the task of persuading lurkers to become active participants in their OSN.
Chapter
Evaluating a community detection algorithm is a complex task due to the lack of a shared and universally accepted definition of community. In literature, one of the most common way to assess the performances of a community detection algorithm is to compare its output with given ground truth communities by using computationally expensive metrics (i.e., Normalized Mutual Information). In this paper we propose a novel approach aimed at evaluating the adherence of a community partition to the ground truth: our methodology provides more information than the state-of-the-art ones and is fast to compute on large-scale networks. We evaluate its correctness by applying it to six popular community detection algorithms on four large-scale network datasets. Experimental results show how our approach allows to easily evaluate the obtained communities on the ground truth and to characterize the quality of community detection algorithms.
Article
Increasing use of IP enabled smart gadgets, rapidly increasing the formation of cyber centric social networks. Any human discussion reflects its emotions and ego centric thoughts among the masses which results the formation of pro and anti-groups. Identification of various categories of groups and communities in social network is very important to eliminate the chances of human created crisis. This paper predicts the migration of individuals from one community to other community and the person who bridges the two communities. The prediction of social networks is carried out by mapping various epidemic models on human created social network. The centrality measurement detects the bridging element between two communities.
Article
Full-text available
Community discovery has emerged during the last decade as one of the most challenging problems in social network analysis. Many algorithms have been proposed to find communities on static networks, i.e. networks which do not change in time. However, social networks are dynamic realities (e.g. call graphs, online social networks): in such scenarios static community discovery fails to identify a partition of the graph that is semantically consistent with the temporal information expressed by the data. In this work we propose Tiles, an algorithm that extracts overlapping communities and tracks their evolution in time following an online iterative procedure. Our algorithm operates following a domino effect strategy, dynamically recomputing nodes community memberships whenever a new interaction takes place. We compare Tiles with state-of-the-art community detection algorithms on both synthetic and real world networks having annotated community structure: our experiments show that the proposed approach is able to guarantee lower execution times and better correspondence with the ground truth communities than its competitors. Moreover, we illustrate the specifics of the proposed approach by discussing the properties of identified communities it is able to identify.
Article
Full-text available
Community discovery in complex networks is the task of organizing a network's structure by grouping together nodes related to each other. Traditional approaches are based on the assumption that there is a global-level organization in the network. However, in many scenarios, each node is the bearer of complex information and cannot be classified in disjoint clusters. The top-down global view of the partition approach is not designed for this. Here, we represent this complex information as multiple latent labels, and we postulate that edges in the networks are created among nodes carrying similar labels. The latent labels are the communities a node belongs to and we discover them with a simple local-first approach to community discovery. This is achieved by democratically letting each node vote for the communities it sees surrounding it in its limited view of the global system, its ego neighborhood, using a label propagation algorithm, assuming that each node is aware of the label it shares with each of its connections. The local communities are merged hierarchically, unveiling the modular organization of the network at the global level and identifying overlapping groups and groups of groups. We tested this intuition against the state-of-the-art overlapping community discovery and found that our new method advances in the chosen scenarios in the quality of the obtained communities. We perform a test on benchmark and on real-world networks, evaluating the quality of the community coverage by using the extracted communities to predict the metadata attached to the nodes, which we consider external information about the latent labels. We also provide an explanation about why real-world networks contain overlapping communities and how our logic is able to capture them. Finally, we show how our method is deterministic, is incremental, and has a limited time complexity, so that it can be used on real-world scale networks.
Conference Paper
Full-text available
The study of users' social behaviors has gained much research attention since the advent of various social media such as Facebook, Renren and Twitter. A major kind of applications is to predict a user's future activities based on his/her historical social behaviors. In this paper, we focus on a fundamental task: to predict a user's future activity levels in a social network, e.g. weekly activeness, active or inactive. This problem is closely related to Social Customer Relationship Management (Social CRM). Compared to traditional CRM, the three properties: user diversity, social influence, and dynamic nature of social networks, raise new challenges and opportunities to Social CRM. Firstly, the user diversity property implies that a global predictive model may not be precise for all users. On the other hand, historical data of individual users are too sparse to build precisely personalized models. Secondly, the social influence property suggests that relationships between users can be embedded to further boost prediction results on individual users. Finally, the dynamical nature of social networks means that users' behaviors may keep changing over time. To address these challenges, we develop a personalized and social regularized time-decay model for user activity level prediction. Experiments on the social media Renren validate the effectiveness of our proposed model compared with some baselines including traditional supervised learning methods and node classification methods in social networks.
Conference Paper
Full-text available
In service-based industries, churn poses a significant threat to the integrity of the user communities and profitability of the service providers. As such, research on churn prediction methods has been actively pursued, involving either intrinsic, user profile factors or extrinsic, social factors. However, existing approaches often address each type of factors separately, thus lacking a comprehensive view of churn behaviors. In this paper, we propose a new churn prediction approach based on collective classification (CC), which accounts for both the intrinsic and extrinsic factors by utilizing the local features of, and dependencies among, individuals during prediction steps. We evaluate our CC approach using real data provided by an established mobile social networking site, with a primary focus on prediction of churn in chat activities. Our results demonstrate that using CC and social features derived from interaction records and network structure yields substantially improved prediction in comparison to using conventional classification and user profile features only.
Conference Paper
Full-text available
Online social networks offer opportunities to analyze user behavior and social connectivity and leverage resulting insights for effective online advertising. We study the adoption of a paid product by members of a large and well-connected Instant Messenger (IM) network. This product is important to the business and poses unique challenges to advertising due to its low baseline adoption rate. We find that adoption by highly connected individuals is correlated with their social connections (friends) adopting after them. However, there is little evidence of social influence by these high degree individuals. Further, the spread of adoption remains mostly local to first-adopters and their immediate friends. We observe strong evidence of peer pressure wherein future adoption by an individual is more likely if the product has been widely adopted by the individual's friends. Social neighborhoods rich in adoptions also continue to add more new adoptions compared to those neighborhoods that are poor in adoption. Using these insights we build predictive models to identify individuals most suited for two types of marketing campaigns - direct marketing where individuals with highest propensity for future adoption are targeted with suitable ads and social neighborhood marketing which involves messaging to members of the social network who are most effective in using the power of their network to convince their friends to adopt. We identify the most desirable features for predicting future adoption of the PC To Phone product which can in turn be leveraged to effectively promote its adoption. Offline analysis shows that building predictive models for direct marketing and social neighborhood marketing outperforms several widely accepted marketing heuristics. Further, these models are able to effectively combine user features and social features to predict adoption better than using either user features or social features in isolation.
Article
This study integrates network and content analyses to examine exposure to cross-ideological political views on Twitter. We mapped the Twitter networks of 10 controversial political topics, discovered clusters subgroups of highly self-connected users and coded messages and links in them for political orientation. We found that Twitter users are unlikely to be exposed to cross-ideological content from the clusters of users they followed, as these were usually politically homogeneous. Links pointed at grassroots web pages (e.g.: blogs) more frequently than traditional media websites. Liberal messages, however, were more likely to link to traditional media. Last, we found that more specific topics of controversy had both conservative and liberal clusters, while in broader topics, dominant clusters reflected conservative sentiment.
Article
Similarity breeds connection. This principle - the homophily principle - structures network ties of every type, including marriage, friendship, work, advice, support, information transfer, exchange, comembership, and other types of relationship. The result is that people's personal networks are homogeneous with regard to many sociodemographic, behavioral, and intrapersonal characteristics. Homophily limits people's social worlds in a way that has powerful implications for the information they receive, the attitudes they form, and the interactions they experience. Homophily in race and ethnicity creates the strongest divides in our personal environments, with age, religion, education, occupation, and gender following in roughly that order. Geographic propinquity, families, organizations, and isomorphic positions in social systems all create contexts in which homophilous relations form. Ties between nonsimilar individuals also dissolve at a higher rate, which sets the stage for the formation of niches (localized positions) within social space. We argue for more research on: (a) the basic ecological processes that link organizations, associations, cultural communities, social movements, and many other social forms; (b) the impact of multiplex ties on the patterns of homophily; and (c) the dynamics of network change over time through which networks and other social entities co-evolve.
Article
Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of1, 2, 3, 4. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins5, 6, industrial sectors7 and groups of people8, 9) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.
Article
In the last few years many real-world networks have been found to show a so-called community structure organization. Much effort has been devoted in the literature to develop methods and algorithms that can efficiently highlight this hidden structure of the network, traditionally by partitioning the graph. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition it then extracts the communities that are able to reflect only some of the features of real communities. The aim of this survey is to provide a manual for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery based on their own definition of community. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on.