ArticlePDF Available

Extending Generative Models of Large Scale Networks

Authors:
  • Private Contractor

Abstract and Figures

Since the launch of Facebook in 2004 and Twitter in 2006, the amount of publicly available social network data has grown in both scale and complexity. This growth presents significant challenges to conventional network analysis methods that rely primarily on structure. In this paper, we describe a generative model that extends structure-based connection preference methods to include preferences based on agent similarity or homophily. We also discuss novel methods for extracting model parameters from existing large scale networks (e.g., Twitter) to improve model accuracy. We demonstrate the validity of our proposed extensions and parameter extraction methods by comparing model-generated networks with and without the extensions to real-life networks based on metrics for both structure and homophily. Finally we discuss the potential implications for including homophily in models of social networks and information propagation.
Content may be subject to copyright.
2351-9789 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of AHFE Conference
doi: 10.1016/j.promfg.2015.07.896
Procedia Manufacturing 3 ( 2015 ) 3868 3875
Available online at www.sciencedirect.com
ScienceDirect
6th International Conference on Applied Human Factors and Ergonomics (AHFE 2015) and the
Affiliated Conferences, AHFE 2015
Extending generative models of large scale networks
Corey Lofdahla,Eli Stickgolda,Bruce Skarinb, Ian Stewarta
aCharles River Analytics, Inc., 625 Mt Auburn Street, Cambridge, MA 02138, USA
bPrivate Contractor,25 Herricks Ln, Millbury, MA 01527, USA
Abstract
Since the launch of Facebook in 2004 and Twitter in 2006, the amount of publicly available social network data has grown in
both scale and complexity. This growth presents significant challenges to conventional network analysis methods that rely
primarily on structure. In this paper, we describe a generative model that extends structure-based connection preference methods
to include preferences based on agent similarity or homophily. We also discuss novel methods for extracting model parameters
from existing large scale networks (e.g.,Twitter) to improve model accuracy. We demonstrate the validity of our proposed
extensions and parameter extraction methods by comparing model-generated networks with and without the extensions to real-
life networks based on metrics for both structure and homophily. Finally we discuss the potential implications for including
homophily in models of social networks and information propagation.
© 2015The Authors.Published by Elsevier B.V.
Peer-review under responsibility of AHFE Conference.
Keywords:Social media; Social networks; Network analysis; Homophily; Agent-based modeling; Generative models; Anonymous networks
1. Introduction
The emergence of massive social media services such as Facebook and Twitterhas led to the proliferation of
massive, highly-connected social networks[1, 2]. Understanding modern age information flows requires
understanding the way these networks grow, evolve and decay over time[3]. By their nature however, large portions
of these networks are invisible: a strong link may exist between two real-world friends who rarely interact directly
on Twitter, but nevertheless have a strong impact on each other. This creates ethical risks and technical challenges
for direct study, even though hidden users are often qualitatively similar to visible users[4].
Early analysis attempts solved this problem by using anonymized datasets, which left the structural information
intact, but removed the personally identifying information. Unfortunately, it was quickly proved that even sparse
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of AHFE Conference
3869
Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
anonymized networks are vulnerable to de-anonymizing attacks with very little information required[5].Given these
risks and challenges, simulation models, and in particular agent-based models, can help provide a better analogue for
study and research. More realistic models also have the potential to improve our ability to predict information
propagation in modern social networks that can only be partially observed.
In this paper, we describe an agent-based model that leverages social science theory and existing network
modeling methods to produce synthetic directed, weighted social networks. This model is then used in a case study
focused on Twitter networks, which provides both model parameters and a ground-truth for conducting network
comparisons. Such a model may offer insight to analysts seeking to understand patterns in link-formation between
users, particularly in cases of news-related information diffusion.
We begin by explaining the need for features based on social science and how they affect the process of agents
forming and updating. In particular, we focus on the principle of homophily[6], which has been found to have strong
implications on the spread and interpretation of information within a network[7].We then describe our model for
reproducing homophily in synthetic networks. We follow the model with a discussion of the novel methods we used
for extracting parameters from existing network data. To demonstrate its success, we compare our model with the
new features to an existing Twitter network and a baseline synthetic network, and then discuss the implications for
further research and application. Our tests indicate that in the context of replicatingonlinesocial networks, the
addition of these new features provides measureable improvementsover a statistical network generation method.
2. Structure and homophily
There already are a variety of methods for generating networks of different structural archetypes like small-world,
random, and scale-free. Some common approaches include the Watts-6WURJDW] (UGĘVRényi, and Barabási
Albertmodels[8]. These have proven to be accurate analogs for the structure of some real-world networks on the
basis of metrics such as the degree distribution, diameter, and clustering coefficient.
Yet for some networks, in addition to structural features there are also social and functional features that affect
how networks create additional archetypes such as the polarized crowd, community clustered, customer support, and
broadcast networks as described in [9]. In cases such as these, a different approach is required to represent the
causes of social and functional features in network models. In addition, standard statistical models such as the
(UGĘVRényirandom graph algorithm are often inadequate in replicating the structure of human social networks[10]
and have been largely replaced by inquiry into more accurate models.
Research into more robust methods of modeling social networks includes latent space models, Hidden Markov
Models, and actor-oriented models [11]. One advantage of actor-oriented models is the ability to include behavior-
based theory instead of relying on a purely statistical approach. Actor-oriented models also enable extensions to
modeling other behaviors such as information propagation and network adaptation and evolution.
As a behavior related to social networks, the principle of homophily provides a useful heuristic for determining
the likelihood of link formation between two given actors. Homophily asserts that similar people interact more
frequently than dissimilar people[6].This does not imply that homophily is exactly equivalent to similarity, but rather
that similarity between people can help explain patterns of interaction. In cases such as language, the assertion of
homophily holds strongly [12], while in other cases there can be a wider degree of variation. Homophily has also
been found to have powerful implications on the information people receive and the attitudes they form[7]. For these
reasons, homophily is an important factor in generating realistic analogues of existing social networks.
3. Methodology: homophily in agent-based modeling
Agent-based modeling is an ideal formalism for a behavior-based approach to the generation of social networks
because it can flexibly represent diverse attributes and behaviors. In an agent-based model, individual actors can
incorporate a dynamic range of demographic information, social attributes,and user behaviors for use in determining
how links are formed between actors.
Demographic types that might be represented include categorical (e.g.,language), ordinal (e.g.,activity level), and
scalar (e.g.,latitude/longitude). Each demographic can be parameterized with a distribution for the number of
3870 Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
membersas well as by ranking their importance to link formation. Table 1provides several example demographics
that were used in our research.
Table 1. Example demographic attributes.
Name Type Example Entities/Range Use Importance
Language Categorical English, Spanish, French, etc. Both High
Gender Categorical Male, Female Both Moderate
Activity Ordinal High, Medium, Low Link formation Low
3.1. Modified preference function
To recreate the principle of homophily, each agent uses a modified version of the preference function commonly
used in structural approaches to network generation. The algorithm used here takes a weighted average of the
similarity between two given agents based on the relative importance of each demographic[3]. For each
demographic,Dp, of the type categorical, the similarity for any two agents iand jis assigned as such:
ܦ݌(݅,݆)= 1 if݅݌=݆݌
0 ݂݅ ݅݌്݆
݌
For ordinal and scalar demographic types, the similarity between two agents is simply the normalized Manhattan
distance, where ߜ݌represents the difference between the maximum and minimum values of the range that can be
assigned to a given demographic,Dp:
ܦ݌(݅,݆)= 1݅݌െ݆݌
ߜ݌(1)
The cumulative similarity Sbetween any two actors iand jis then the weighted arithmetic mean ofn
demographics. Where w(range 1-10) is the weight assigned in relation to each demographic’s importance to link
formation,which is based on the modularity that is measured in a given real-world network:
ܵ݅,݆=σݓ݌ܦ݌
݊
݌=1
σݓ݌
݊
݌=1
(2)
For the structural preference functions, we selected triadic closures (a.k.a.friend-of-a-friend) and degree
preference. The triadic closure preference Tbetween two actors iand jis formulated as,
ܶ݅,݆=|݅ת݆|
min (݅,݆)(3)
where min(݅,݆)represents the minimum number of edges in either actor’s network. Using the minimum ensures that
agents with a smaller number of edges are not penalized against agents with a large number. Degree preference Gis
then assigned by the following:
ܩ݅,݆=݆݀݁݃
max (݀݁݃݊)(4)
With this formulation, the degree preference of iis based on the out-degree of jrelative to the maximum out-
degree within the entire network. If jis the agent with the most followers, it will receive a degree preference of 1.
The cumulative structural preference Ris then returned as a weighted average with the weights w(range: 0.001-1)
set as part of the optimization routine discussed in Section 3.3:
ܴ݅,݆=ݓ݌ܶܶ݅,݆+ݓ݌ܩܩ݅,݆
σݓ݌
݊
݌=1
(5)
3871
Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
Finally, the total preference Cfor agent i to follow actor jis given by the weighted sum of the two cumulative
measures with theweights w(range:0.001-1)also set by the optimization routine discussed in Section 3.3:
ܥ݅,݆=ݓܴܴ݅,݆+ݓܵܵ݅,݆(6)
When an agent is selected to form a connection, this preference function provides the distribution of probabilities
needed to make a random selection for link formation. As discussed in detail in Section 4, the networks generated
with this approach continue to matchreal-world measures of structure (e.g.,averagepath length and degree
distribution)with the added ability of matching a measure of similarity as it relates to homophily.
3.2. Extracting ground-truth data and model parameters
In order to make use of these extensions, we needed a way to determine appropriate values for each parameter.
Because these values depend on the network being modeled, we developed a technique for extracting values from a
representative empirical network that could be used to determine accurate values for any of a number of possible
use-cases. To test our technique, we constructed a social network from a corpus of tweets mined for a week over a
region of Saharan Africa, ranging from southeast Nigeria to northeast Libya. We chose this area for its high user
diversity as compared to more homogenous regions,as well as the region’s relevance to international news.
Following data collection, the team converted the raw tweets into a directed network using the @-symbol as a
proxy for an edge between users. For example, the tweet “@userA hi there!” from user B would result in a directed
edge from B to A. This process yielded a network with 24,974 nodes and 31,575 edges.
The team chose this tweet-based approach to social network construction over a typical snowball-sampling
approach that selects one prominent network actor, collects all of their followers, and repeats the process on these
new actors. We made this decision for two reasons:
1. Scalability: it is easier to build a large network from tweets than query Twitter recursively for an ever-growing
network.
2. Information diffusion: the use of the @-symbol implies diffusion of information from one user to another, which
is a key area of interest for this study.
In addition to this basic structural information, we queried Twitter for user-specific data to provide node attributes
to the network. In contrast to similar social media studies [7], we only extracted self-reported attributes available
directly through the API rather than attempting to speculate on inaccessible attributes such as gender. This decision
limits the overall amount of user data available but more importantly ensures that the demographics for our model’s
agents are based as much as possible in reality and less on prior expectations.
Once the networks were built, a set of static and dynamic metrics were extracted to quantify basic network trends.
The key to choosing the metrics was generalizability across potential models, which limits us to statistics that can
differentiate a range of known network types. For instance, the ratio of edges to nodes is higher in small-world
networks than in scale-free (hierarchical) and random networks. Following similar model studies [13], the following
metrics were extracted as initialization parameters for the model:
xActor pool size: the total number of nodes in the network, intended as a cap to the model’s growth.
xNodes per hour: number of new nodes that appear in the network in a given hour.
xEdges per hour: number of new edges formed in the network in an hour.
xEdge adding rate: WKHDQGıRI every node’s edge-adding ratei.e.,edges/node/hour.
xActivity frequency: the distribution of all users’ frequency of posting statusesi.e.,statuses/day. We discretized
this distribution as a set of three clusters representing low, medium, and high activity, which roughly correspond
to previous studies of classifying Twitter users by “role” [7].
xLanguage split: percentages of the languages spoken by users in the network. All languages with percentages
below 5% were classified as “Other” to minimize data sparseness.
3872 Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
After extracting these parameters, we also extracted a number of structural statistics to set a baseline against
which to compare the synthetic networks’ properties (see Sections 4.1 and 4.2).
3.3. Network generation
The synthetic networks are generated by an agent-based model implemented in the AnyLogic development
environment. Model execution begins with the formation of a small seed network of nodes and edges that applies
the preference function to all possible agent pairings. Additional nodes and edges are then added at the hourly rate
determined from the real-world network. While this rate is not important in determining the final state of anetwork,
it does replicate another dimension ofreal-world data(i.e.,links over time).After initialization, the preference
function is also applied to only aselected subgroup (e.g.,language, activity).The subgroup chosen is based on the
weight assigned to the subgroup in Section 3.1.
When nodes are added, each is assigned demographics using the distributions extracted from the real-world
network, which in this study include activity frequency and language split. To form a new edge, an agent is first
randomly selected based on its relative preference to connect (follow) other agents:
ܲ=1݀݁݃݅
max (݀݁݃݊)(7)
Note that this function represents a kind of inverse of the degree preference, which is to say low-degree agents do
more following than the high-degree. Agents chosen to form linksthen select a sample distribution of other agents
from one of demographic groups of which it is a member. The demographic group (activity frequency or language)
is determined randomly based on the relative importance of the group. In this study, language was assigned high
importance while activity frequency was assigned low importance. Each agent in the sample is then assigned a
preference for connection based on the equations described in Section 3.1. Finally, the agent uses the assigned
preferences to randomly select another agent to follow (i.e.,the edge is directed to itself).
The remaining four parameters to be set in the agent-based model are determined by AnyLogic’s built-in
optimization engine. The first two are used to determine the initial number of nodes and edges for creating the seed
network, initial nodes and initial edges. The final parameters are the weights described earlier: in equation 5, the fof
weight (favoring attachment to nodes which are “friends of friends”); in equation 6, the structure weight (preference
for attaching to nodes with higher structural characteristics). Note that since the degree weight (the weight for
favoring attachment to nodes with a higher degree) in equation 5 is proportional to the fof weight, it is fixed at 0.5 so
that the optimizer can evaluate both relatively higher and lower values. The same is true for the demographic weight
(preference for attaching to nodes with similar demographic values) with respect to the structure weight in equation
6.
The objective function used to improve the fit ofthe simulated networkacross generations is simply anormalized
distance across a range of the network comparison measures including the emergent metricsdescribed in Section4.2.
The optimizer utilizes a form of scatter-search[14] that explores the parameter space by mixing successful parameter
sets to form new combinations, similar to genetic mutation. Scatter-search has the advantage of generating relatively
small but diverse sets of new parameter combinations, which reduces runtime and memory requirements while
allowing a wide exploration of the parameter space. This strategy can find local minima more quickly than
stochastic variation so achieves parameter convergence better after an initial period of exploration.
4. Results
Our results demonstrate the success of incorporating homophily into the network generation process. We find that
the model accounting for homophily in user demographics produces the most realistic network in terms of in-and
out-degree (4.1). In addition, our model yields realistic emergent properties, suchas a low average path length (4.2).
For our preliminary results, we used a set of three networks of nearly identical scale: (1) the ground-truth (GT)
network from our Africa tweets; (2) a synthetic scale-free network generated using the B model of the Barabási-
Albert (BA) algorithm[15]; and (3) a network generated using the structural preference and homophily model
3873
Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
outlined in Section 3(S+H).We chose to use the BA algorithm rather than similar network-generation algorithms as
our baseline because it is most flexible in matching our ground-truth networks’ basic structure (nodes and edges)
without excessive parameter experimentation. Each of the networksis summarized in Table 2. The reader should note
that the all networkshave a relatively low average degree, which is a product of each network’s power-law degree
distribution to be explained in 4.1.
Table 2.Basic network statistics.
GT BA S+H
Node count 24,974 24,974 24,937
Edge count 31,575 32,136 31,805
4.1. Degree distribution comparisons
To begin, we compared the out-degree distribution of all networks in order to verify the expected scale-free
nature of human social networks [11]. To clarify, the in-degree of a node nis equal to the number of edges incident
on n, while out-degree is equal to the number of edges leaving n. After calculating the raw degree counts and
normalizing, we arrive at the distribution of out-degree frequencies inFig. 1(a).
All three networks appear to follow the predicted power-law distribution, although with highly variable curves:
for instance, the GT network has the sharpest curve. However, the out-degree frequencies for the network generated
by the S+H model are consistently closer to the ground-truth frequencies than the BA network. This trend is
quantified in Fig. 1(b), which charts the absolute deviation of each model’s degree distribution from the ground-
truth distribution (e.g.,1-degree difference equals GT 1-degree frequency minus BA 1-degree frequency).
Note the lower deviation of the S+H in-degree across all categories. S+H differsfrom GT by more than 0.05 only
in the 0-degree category, as opposed to the BA network’s four categories over 0.05. The high deviations in the BA
network across all categories imply that it follows a different power-law distribution than GT network does. In
addition, the S+H network performs much better than the BA network in the 11+ category, indicating that it
accurately replicates the GT network’s heavier degree tail.
We can further quantify the relative success of S+H by calculating the mean differences, as well as the Pearson
correlation coefficient and total Euclidean distance from the ground-truth distribution to the distributions of the three
generated networks. Again, we found the best-fitting model to be the one incorporating homophily, as it maximizes
similarity and minimizes distance. The fitness statistics are summarized in Table 3, with the bold numbers
highlighting the S+H network’s close fit with the GT network (low difference and high correlation).
Fig.1.(a) Out-degree distributions; (b) Out-degree distributiondifferences from GT.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
01234567891011+
Normalized out-degree fr equency
Out-degree
Out-degree distributions
Ground-truth Barabási-Albert Structure+Homophily
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
01234567891011+
Absolute difference from GT out-degree distribution
Out-degree
Models' difference from GT out-degree
Barabási-Albert Structural+Homophily
(a) (b)
3874 Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
Table 3.Out-degree distribution comparisons.
Fitness statistics BA S + H
In-degree differences ı 6.91 ±11.3%1.21 ±2.01%
Euclidean distance 0.445 0.0787
R20.488 0.995
We also found a trend favoring the S+H model’s in-degree distribution, but we omitthe findingsfor brevity.
4.2. Emergent network properties
Moving beyond low-level degree measures, we show that the homophily-dependent model yields emergent
properties similar to ground-truth data. Following the approach of other studies that quantify network
similarity[13],we selected three metrics to showcase the similarity of the model’s emergent properties: triadic
closure, average path length, and giant component size. Triadic closure is calculated with equation 3 described in
Section 3.1, measured because of its correlation with short-range network cohesion[11]. Average path length is equal
to the mean of all shortest paths that connect the network’s nodes, and we measured it because of its relation to
clique-formation within a connected network[13]. The average path length only includes paths between nodes within
the giant component, which is the largest connected subgraph within the network and often includes the majority of
nodes and edges. Lastly, we measured giant component size because it provides a summary of large-scale network
cohesion[16],which can helppredict the breadth of information diffusion.
Table 4 outlines these key structural metrics. The bold numbers indicate the closest fit to the GT statistics.
Table 4.Emergent properties.
Emergent property GT BA S+H
Triadic closure (undirected) 9.0E-2 8.0E-5 1.0E-3
Average path length (undirected) 6.748.68 6.70
Giant component size (% of nodes included) 74.3%83.2% 79.1%
These metrics are more abstract to interpret but still demonstrate the success of our model in replicating ground-
truth data. First, the GT network has a somewhat higher rate of triadic closure, which is explained by the apparently
high influence of mutual connections on link formation. Although low, the S+H triadic closure still matched GT
better than the BA network, indicating that homophily encourages triangles more than structural preference alone.
Secondly, the somewhat low average path length of the GT network was matched almost exactlyby the S+H
network and far surpassed by the BA model. This suggests that the GT network tends toward denser clusters, a
structure which yields shorter path lengths. Furthermore, the success of the S+H modelimplies that this clustering
can be explained in part by homophily rather than structural preference alone. To contrast, the higher path length in
the BA network is likely related to the model’s formingfragmented tree-like componentsrather than denseclusters.
Thirdly, the giant component size was best matched by the S+H network and is a product of the network’s
balancing preferential attachment with homophily. For instance, a model only incorporating preferential attachment
(BA) tends to yield one core network rather than multiple networks [16]and results in a larger giant component.
While the structural preference function (equation 5) alonemay have led to a similar outcome, the addition of
homophily in the S+H model dampened that effect and reduced the giant component to a more realistic size.
Overall, these three emergent structural metrics demonstrate the success of our network generation process and
particularly the importance of homophily in yielding high-level patterns,in additionto degree distributions.
5. Conclusion
In this paper, we described novel techniques for improving generative agent-based models of modern social
media networks and demonstrated their effectiveness at improving synthetic network realism by comparing
produced networks to publicly available data, in this case via the social media service Twitter. The continued
improvement of this style of generative model, rooted in both network metrics and social science, is necessary for
long-term understanding of information propagation, especially in partially-visible or wholly dark networks.
3875
Corey Lofdahl et al. / Procedia Manufacturing 3 ( 2015 ) 3868 – 3875
Future research in this area will extend and improve these models by incorporating additional social science-
based agent-level parameters, including other traits not yet recognized. Further testing could also examine the ability
of these models to correctly predict the structure of partially-visible or dark networks by examining the propagation
of memetic information through partially-visible networks, such as Twitter, in comparison to the same kind of
propagation in synthetic networks.Following studies that suggest the predictability of private user traits [17], future
models will likely be able to model “dark” agents in social networks with little extra modification. This sort of
testing will provide support for agent-based generative models as a means of uncovering the dynamics of network
formation.
Acknowledgements
This work was performed under DARPA contract number W31P4Q-12-C-0235. The authors thank Dr. Rand
Waltzman for his significant technical support and eager engagement on this project. This work was funded in its
entirety by the Information Innovation Office (I20). The views expressed are those of the authors and do not reflect
the official policy or position of the Department of Defense or the U.S. Government. We also acknowledge the
contribution of Professor Frank Witmer who helped with the initial modeling efforts.
References
[1] A. Java, X. Song, T. Finin and B. Tseng, “Why we twitter: understanding microblogging usage and communities,” in Proceedings of the 9th
WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, San Jose, CA, 2007.
[2] B. Viswanath, A. Mislove, M. Cha and K. Gummadi, “On the evolution of user interaction in Facebook,” in Proceedings of the 2nd ACM
workshop on Online social networks, Barcelona, Spain, 2009.
[3] M. Pasta, Z. Jan, F. Zaidi and C. Rozenblat, “Demographic and structural characteristics to rationalize link formation in online social
networks,” in 2013 International Conference on Signal-Image Technology & Internet-Based Systems, Marrakech, Morocco, 2013.
[4] M. Madden, “Privacy management on social media sites,” Pew Research Center, Washington, D.C., 2012.
[5] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets.,” in IEEE Symposium on Security and Privacy,
Oakland, CA, 2008.
[6] M. McPherson, L. Smith-Lovin and J. Cook, “Birds of a feather: Homophily in social networks,” Annual Review of Sociology, pp. 415-444,
2001.
[7] M. De Choudhury, “Tie formation on Twitter: Homophily and structure of egocentric networks,” in 2011 IEEE third international conference
on social computing, Boston, 2011.
[8] M. Magnani and L. Rossi, “Formation of multiple networks,” in Social Computing, Behavioral-Cultural Modeling and Prediction,
Washington, D.C., 2013.
[9] M. Smith, L. Rainie, B. Shneiderman and I. Himelboim, “Mapping twitter topic networks: From polarized crowds to community clusters,”
Pew Research Internet Project, Washington, D.C., 2014.
[10] A. Mislove, M. Marcon, K. Gummadi, P. Druschel and B. Bhattacharjee, “Measurement and analysis of online social networks,” in
Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, New York, NY, 2007.
[11] T. Snijders, “Statistical models for social networks,” Annual Review of Sociology, vol. 37, pp. 131-153, 2011.
[12] S. Hale, “Global connectivity & multilinguals in the Twitter networks,” in Proceedings of the 32nd annual ACM conference on human
factors in computing systems, Toronto, 2014.
[13] A. Ali, H. Alvari, A. Hajibaheri, K. Lakkaraju and G. Sukthankar, “Synthetic generators for cloning social network data,” in ASE
BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Cambridge, MA, 2015.
[14] H. Eskandari, E. Mahmoodi, H. Fallah and C. Geiger, “Performance analysis of commercial simulation-based optimization packages:
OptQuest and Witness Optimizer,” in Proceedings of the 2011 Winter Simulation Conference, Phoenix, AZ, 2011.
[15] A. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, no. 286, pp. 509-512, 1999.
[16] M. Newman, “Assortative mixing in networks,” Physical review letters, vol. 89, no. 20, 2002.
[17] S. Peddinti, K. Ross and J. Cappos, “On the internet, nobody knows you're a dog: a Twitter case study of anonymity in social networks,” in
Proceedings of the second edition of the ACM conference on Online social networks, Dublin, Ireland, 2014.
... On the other hand, an estimate based on random networks may turn out to be inaccurate, thus leaving analysts in the dark for the actual resource needs of the algorithm. Consequently, our empirical measures are performed on the three structural archetypes most commonly cited (Lofdahl et al., 2015;Wang et al., 2019): small-world (i.e., high clustering and low average distance), random, and scale-free networks (i.e., power-law degree distribution). By identifying scaling bottlenecks for recent measures across these three sparse networks, our empirical analysis provides a list of algorithms that are prime candidates for approximation or parallelization in future studies. ...
... (Rossetti, 2017) This need for benchmarking often motivated the development of synthetic generators, particularly in fields where real-world instances were scarce (Giabbanelli, 2010;Sahraeian and Yoon, 2012;Pasta and Zaidi, 2018). Second, snapshots of large real-world networks are often only samples, hence they miss links that were not expressed within a specific time window or happen between two entities in one network instead of another (Lofdahl et al., 2015). Third, highly detailed network data destined for applications such as agent-based modeling (e.g., spread of word-of-mouth based on social ties and personal characteristics) may face privacy concern at the point of access (Qin et al., 2017) (e.g., retrieving and assembling contact lists across cellphone users) or include sensitive information, which can be difficult to anonymize (Lofdahl et al., 2015). ...
... Second, snapshots of large real-world networks are often only samples, hence they miss links that were not expressed within a specific time window or happen between two entities in one network instead of another (Lofdahl et al., 2015). Third, highly detailed network data destined for applications such as agent-based modeling (e.g., spread of word-of-mouth based on social ties and personal characteristics) may face privacy concern at the point of access (Qin et al., 2017) (e.g., retrieving and assembling contact lists across cellphone users) or include sensitive information, which can be difficult to anonymize (Lofdahl et al., 2015). ...
Article
Full-text available
Node centrality measures are among the most commonly used analytical techniques for networks. They have long helped analysts to identify “important” nodes that hold power in a social context, where damages could have dire consequences for transportation applications, or who should be a focus for prevention in epidemiology. Given the ubiquity of network data, new measures have been proposed, occasionally motivated by emerging applications or by the ability to interpolate existing measures. Before analysts use these measures and interpret results, the fundamental question is: are these measures likely to complete within the time window allotted to the analysis? In this paper, we comprehensively examine how the time necessary to run 18 new measures (introduced from 2005 to 2020) scales as a function of the number of nodes in the network. Our focus is on giving analysts a simple and practical estimate for sparse networks. As the time consumption depends on the properties in the network, we nuance our analysis by considering whether the network is scale-free, small-world, or random. Our results identify that several metrics run in the order of O ( nlogn ) and could scale to large networks, whereas others can require O ( n ² ) or O ( n ³ ) and may become prime targets in future works for approximation algorithms or distributed implementations.
... Recent results in the area of degree distribution analysis of complex networks arisen in different areas, such as sociology, physics, and biology, have shown that many networks exhibit comparable degree distributions [21]- [26]. It turned out that most of real networks have degree distributions that are scale-free [21]. ...
... It should be noted that real graphs that arise in different fields (economics, Internet, telecommunications, finance, medicine, biology, sociology) exhibit the degree distribution that follows the power-law model [21]- [26]. According to this model, the probability that a vertex has degree k (that is, there exist k edges originating from it) asymptotically follows P (k) ∝ k −γ or log P (k) ∝ −γ log k, which shows that this function has a linear dependence in the logarithmic scale. ...
... These two models compromise the scale-free nature of the graphs over other properties, e.g., hierarchical structure. The authors [18] introduced a growing agent based model that extends structured-based connection preference methods such as preferential attachment. The model takes a weighted average of the similarity between two agents to generate a graph. ...
Article
Full-text available
The study and analysis of real-world social, communication, information and citation networks for understanding their structure and identifying interesting patterns have cultivated the need for designing generative models for such networks. A generative model generates an artificial but a realistic-looking network with the same characteristics as that of a real network under study. In this paper, we propose a new generative model for generating realistic networks. Our proposed model is a blend of three key ideas namely preferential attachment, associativity of social links and randomness in real networks. We present a framework that first tests these ideas separately and then blends them into a mixed model based on the idea that a real-world graph could be formed by a mixture of these concepts. Our model can be used for generating static as well as time evolving graphs and this feature distinguishes it from previous approaches. We compare our model with previous methods for generating graphs and show that it outperforms in several aspects. We compare our graphs with real-world graphs across many metrics such as degree, clustering coefficient and path length distributions, assortativity, eigenvector centrality and modularity. In addition, we give both qualitative and quantitative results for clarity.
... Note that the last two decades have seen extensive research in the area of degree distribution analysis of complex networks arisen in sociology, physics, and biology. It has been shown that many networks have similar degree distributions [1]- [6]. It turned out that most real networks have degree distributions that are scale-free [1]. ...
... Recent years have seen a huge amount of papers examining degree distribution of real-world complex networks arisen in sociology, physics, economics and biology. It turns out that many of them exhibit scale-free property in which the degree distribution follows a power law [35][36][37][38][39][40]. ...
Article
Full-text available
This paper studies a complex network formed as a directed graph in which nodes represent the companies traded on the NYSE or NASDAQ while directed edges represent a connectedness measure between the financial assets. The directed edge weight between any two nodes is calculated with use of the value of ΔCoVaR, one of the most popular systemic risk measures proposed by M. Brunnermeier and T. Adrian in 2011. The value of ΔCoVaR measures the relationship between any two assets and is based not only on the yields of the assets, but take into account the mutual effect of its performance. In contrast with correlation coefficient, ΔCoVaR is asymmetric. The analysis is focused on the static model of the ΔCoVaR estimation. Moreover, this paper uses statistical testing procedures to assess the significance of the findings and interpretations based on this co-risk measure. We examine the intrinsic properties and regularities of stock market analyzing the directed complex network with more than 3700 stocks as nodes which have been traded on the NYSE and NASDAQ in recent years. We connect any two stock with a directed edge if the value of the corresponding ΔCoVaR is statistically significant and its normalized value is greater than a given threshold. We discuss both out-degree and in-degree distributions and find essential vertices in the network, which represent the leading stocks. We demonstrate that the network follows the power-law distribution and behaves scale-free. Moreover, we address the problem of finding influential spreaders, i.e. companies which are more likely to spread negative shocks in a large part of the network. In this paper we use three different measures (closeness centrality, betweenness centrality, PageRank) to determine the most influential stocks in the directed market graph.
Chapter
In this paper company co-mentions network is formed as a graph in which vertexes represent the world’s largest companies mentioned in financial and economic news flow. If two companies were mentioned in the same news report then the edge between two nodes is included in the co-mentions graph. The edge weight between any two nodes is calculated as the amount of news items that mentioned both companies in a certain period of time. We examine the changes of the structural properties of the company co-mentions graph over time. We analyze the distribution of the degrees of the vertices in this graph, the edge density of this graph as well as its connectivity. Based on the analysis, we make some conclusions regarding the dynamics of the evolution of the news flow.
Chapter
One of the major problem for recommendation services is commercial astroturfing. This work is devoted to constructing a model capable of detecting astroturfing in customer reviews based on network analysis. The model uses projecting a multipartite network to a unipartite graph, for which we detect communities and represent actors with falsified opinions.
Chapter
In network analysis, the importance of an object can be found by using different centrality metrics such that degree, closeness, betweenness, and so on. In our research we form a network, which we called company co-mention network. The network is constructed quite similar to social networks or co-citation networks. Each company is a node and news mentioning two companies establishes a link between them. Each company acquires a certain value based on the amount of news which is mentioned in the company. This research examines the network of companies by using companies’ co-mention news data. A matrix containing the number of co-mentioning news between pairs of companies is created for network analysis of companies, whose shares are traded on major financial markets. We used different types of SNA metrics (degree centrality, closeness centrality, betweenness centrality, eigenvector centrality, frequency) to find a key company in the network. Moreover, it was shown that distribution of degrees and clustering-degree relations for our network follows the power law, although with nontypical indicators of degree exponent. News analytics data have been employed to collect the companies co-mentioning news data, and R packages have been used for network analysis as well as network visualization.
Article
Full-text available
Recent years have seen tremendous growth of many online social networks such as Facebook, LinkedIn and MySpace. People connect to each other through these networks forming large social communities providing researchers rich datasets to understand, model and predict social interactions and behaviors. New contacts in these networks can be formed either due to an individual's demographic profile such as age group, gender, geographic location or due to network's structural dynamics such as triadic closure and preferential attachment, or a combination of both demographic and structural characteristics. A number of network generation models have been proposed in the last decade to explain the structure, evolution and processes taking place in different types of networks, and notably social networks. Network generation models studied in the literature primarily consider structural properties, and in some cases an individual's demographic profile in the formation of new social contacts. These models do not present a mechanism to combine both structural and demographic characteristics for the formation of new links. In this paper, we propose a new network generation algorithm which incorporates both these characteristics to model growth of a network.We use different publicly available Facebook datasets as benchmarks to demonstrate the correctness of the proposed network generation model.
Conference Paper
Full-text available
Recent years have seen tremendous growth of many online social networks such as Facebook, LinkedIn and MySpace. People connect to each other through these networks forming large social communities providing researchers rich datasets to understand, model and predict social interactions and behaviors. New contacts in these networks can be formed either due to an individual's demographic profile such as age group, gender, geographic location or due to network's structural dynamics such as triadic closure and preferential attachment, or a combination of both demographic and structural characteristics. A number of network generation models have been proposed in the last decade to explain the structure, evolution and processes taking place in different types of networks, and notably social networks. Network generation models studied in the literature primarily consider structural properties, and in some cases an individual's demographic profile in the formation of new social contacts. These models do not present a mechanism to combine both structural and demographic characteristics for the formation of new links. In this paper, we propose a new network generation algorithm which incorporates both these characteristics to model growth of a network. We use different publicly available Facebook datasets as benchmarks to demonstrate the correctness of the proposed network generation model.
Article
Full-text available
The objective of this study is to evaluate and compare two commercial simulation-based optimization packages, OptQuest and Witness Optimizer, to determine their relative performance based on the quality of obtained solutions in a reasonable computational effort. Two well-known benchmark problems, the pull manufacturing system and the inventory system, are used to evaluate and compare the performance of OptQuest and Witness Optimizer. Significant validation efforts are made to ensure that simulation models developed in Arena and Witness are identical. The experimental results indicate that both optimization packages have good performance on the given problems. Both packages found near-global optimal (or satisfactory) solutions in an acceptable computation time.
Conference Paper
Full-text available
While most research in Social Network Analysis has focused on single networks, the availability of complex on-line data about individuals and their mutual heterogenous connections has recently determined a renewed interest in multi-layer network analysis. To the best of our knowledge, in this paper we introduce the first network formation model for multiple networks. Network formation models are among the most popular tools in traditional network studies, because of both their practical and theoretical impact. However, existing models are not sufficient to describe the generation of multiple networks. Our model, motivated by an empirical analysis of real multi-layered network data, is a conservative extension of single-network models and emphasizes the additional level of complexity that we experience when we move from a single- to a more complete and realistic multi-network context.
Article
Twitter does not impose a Real-Name policy for usernames, giving users the freedom to choose how they want to be identified. This results in some users being Identifiable (disclosing their full name) and some being Anonymous (disclosing neither their first nor last name). In this work we perform a large-scale analysis of Twitter to study the prevalence and behavior of Anonymous and Identifiable users. We employ Amazon Mechanical Turk (AMT) to classify Twitter users as Highly Identifiable, Identifiable, Partially Anonymous, and Anonymous. We find that a significant fraction of accounts are Anonymous or Partially Anonymous, demonstrating the importance of Anonymity in Twitter. We then select several broad topic categories that are widely considered sensitive--including pornography, escort services, sexual orientation, religious and racial hatred, online drugs, and guns--and find that there is a correlation between content sensitivity and a user's choice to be anonymous. Finally, we find that Anonymous users are generally less inhibited to be active participants, as they tweet more, lurk less, follow more accounts, and are more willing to expose their activity to the general public. To our knowledge, this is the first paper to conduct a large-scale data-driven analysis of user anonymity in online social networks.
Article
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mech-anisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Conference Paper
Free, open-access copy available at http://www.scotthale.net/pubs/?chi2014 This article analyzes the global connectivity of the Twitter retweet and mentions network and the role of multilingual users engaging with content in multiple languages. The network is heavily structured by language with most mentions and retweets directed to users writing in the same language. Users writing in multiple languages are more active, authoring more tweets than monolingual users. These multilingual users play an important bridging role in the global connectivity of the network. The mean level of insularity from speakers in each language does not correlate straightforwardly with the size of the user base as predicted by previous research. Finally, the English language does play more of a bridging role than other languages, but the role played collectively by multilingual users across different languages is the largest bridging force in the network.
Article
Conversations on Twitter create networks with identifiable contours as people reply to and mention one another in their tweets. These conversational structures differ, depending on the subject and the people driving the conversation. Six structures are regularly observed: divided, unified, fragmented, clustered, and inward and outward hub and spoke structures. These are created as individuals choose whom to reply to or mention in their Twitter messages and the structures tell a story about the nature of the conversation.