Conference PaperPDF Available

Endogenous Social Networks from Large-Scale Agent-Based Models

Authors:

Figures

Content may be subject to copyright.
Endogenous Social Networks from Large-Scale Agent-Based Models
Eric Tatara
1,2
, Nicholson Collier
1,2
, Jonathan Ozik
1,2
, and Charles Macal
1,2
1
Global Security Sciences Division
Argonne National Laboratory
Argonne, IL, USA
{tatara,ncollier,jozik,macal}@anl.gov
2
Computation Institute
University of Chicago
Chicago, IL, USA
Abstract—We present a parallel computational method for
generating endogenous social networks from large-scale
simulation data from the Chicago Social Interaction Model
(chiSIM). The model scope aims to simulate the population of
the entire city of Chicago which includes approximately 2.9
million discrete individuals. Generated person collocation
networks contain more than 10
6
person nodes and more than 10
9
collocation edges. Analysis of such network structure can be
challenging when applied to urban scale population data due to
their size. A parallel logging implementation is described that
records person activity data via an agent-based model. The
social network generation method is demonstrated on a
distributed compute cluster. The person collocation network
analysis and visualization obtained via the parallel methodology
provides previously unreported characterizations of simulated
large-scale urban social network structure.
Keywords-social networks; agent-based modeling;
I. INTRODUCTION
Ongoing advances in high performance computing have
recently made possible realistic city-scale simulation models
of urban populations. These models typically use population
census data to generate realistic synthetic human populations
of agents whose demographics and daily activities closely
mirror an actual population [1]. Such models simulate the
movements and behaviors of millions of individual interacting
agents at a fine level of granularity in space and time. Social
interaction networks that connect populations of individuals
through contact links are central to these models [2, 3]. The
links in such networks can represent many kinds of social
relationships, such as collocation in space and time; familial
or friendship networks, or household social structure;
communication pathways and influence; and many others [4,
5]. The combination of a large simulated population with real
demographic data provides an opportunity to simulate
endogenous and emergent network structures previously
unobservable due to the enormous amount of empirical data
that would be required to reproduce them. The simulated
networks may provide further insights on the social processes
modeled.
Analysis of the model network structures and their
influence on simulation results can be challenging when
applied to urban scale population data due to the sheer size of
the networks [6]. Recent urban-scale simulation models
typically apply aggregate metrics and statistics such as disease
incidence to characterize the state of the population over time
[7, 8]. Examining the detailed model-generated social contact
network topology, however, may provide insight into social
interaction structures that are not captured via aggregate
methods. The structure of the interaction network may be
observed and analyzed by traditional graph theoretic
approaches that characterize network structure or via more
novel approaches such as community detection algorithms
that can capture emergent macro level characteristics of the
network. The large-scale nature of the networks generated are
sizable and present computational challenges for developing
efficient analytical processes.
In this paper, we describe the computational challenges
and methods developed for the analysis of networks generated
by a large-scale and fine-grained agent-based model of a
large-urban area, the Chicago Social Interaction Model
(chiSIM).
II. CHISIM: THE CHICAGO SOCIAL INTERACTION MODEL
The chiSIM model simulates the population of the entire
city of Chicago, which includes approximately 2.9 million
discrete individuals (represented as agents) and 1.2 million
places based on census data. The locations in chiSIM are
specifically characterized as geospatial since they correspond
to real locations in the Chicago area. chiSIM simulates
individual agents within fine-grained spatial location
compartments associated with daily activities and places, such
as homes, schools, and workplaces, and can even specify sub-
compartments such as classrooms. A daily schedule for each
person specifies the activity and associated location with one-
hour time resolution. At each simulation time step (1 hour)
each agent decides their next activity for that hour and the
associated location. Agents move from location to location
and interact with other agents at the new location. chiSIM is
an extension of an infectious disease transmission model that
was generalized to model any kind of social interaction [9].
The chiSIM model input data for the entire Chicago area
population consists of multiple files for activities, persons, and
locations totaling almost 800MB. chiSIM is implemented
using Repast HPC and has been highly parallelized so that a
single one-year simulation run for the entire Chicago
population at one-hour intervals requires only several minutes
of wall time on a modest size cluster (128 processes). Places
are distributed among compute processes, and agents are free
to move between processes, according to their decisions made
within the model, i.e., when their activity changes. A spatially
partitioned set of locations is developed that assigns locations
to compute processes with the objective of minimizing person
agent movement between processes [9].
Simulation data is typically logged at the agent level in an
agent-based model. The log records each agent’s set of
activities and state at each simulation time step as the
simulation progresses and is used to reconstruct agent-
interaction patterns in post-simulation analyses; however, this
adds both significant computational overhead and data storage
requirements. For example, the log can be used to reconstruct
all the agents that an agent had contact with over the course of
an epidemic simulation, and used to trace back to patient zero,
the agent who initiated the disease outbreak. The log
generated by chiSIM exceeds several terabytes for a typical
one-year simulation duration.
chiSIM uses an event-based logging approach which only
logs changes in person agent states such as when activity,
location, or in the described application, disease state,
changes. Considering that agent activity states change only
several times per day, the use of event-based logging reduces
both computational and storage costs dramatically. The
simulation event log contains the complete information
required to create a person collocation network with arbitrary
time granularity, e.g., hourly, daily, weekly or monthly
aggregates. Simulation log data are post-processed using
network synthesis and analysis in an R MPI implementation
suitable for distributed compute clusters
III. LOGGING IMPLEMENTATION
The event-based logging implementation records a log
entry each time a person agent changes activities. The basic
log entry data contains the start and stop times of the activity
and unique identification numbers for the person, activity and
location, which are stored as 4-byte unsigned integers. This
log schema stores the required information in the smallest
possible data type and individual integer values can range
from 0 to 2
32
-1 which is adequate numerical precision to
encode unique activities, people and locations for very large
scale simulations. This log format is also much smaller than
simply logging the associated activity, location, or agent state
descriptions as a string format. Log entries can be extended by
the addition of other integer entries to support the logging of
agent properties such as a disease state. The unique ID
numbers recorded in the log data can be cross-referenced to
the model input data for persons, activities and locations for
the purpose of looking up the string description for entries and
for filtering simulation results via queries on the input data,
e.g., to create a subset of results for persons matching certain
demographic criteria.
A static logger instance is created for each process in a
distributed cluster simulation such that each process logger is
responsible for logging activity changes that occur only in that
process. This architecture explicitly parallelizes the logging
framework across process CPUs, memory and disk I/O. Each
logger stores entries in memory in a cache that is implemented
as a 2D integer array. The log cache size is variable although
a nominal size of 10,000 log entries is used to store log data
in memory before it is written to disk. The cache size can be
adjusted depending on the available system memory and
simulation requirements. A smaller cache will reduce memory
usage but will result in more individual write operations,
which can be computationally expensive. In contrast, a larger
cache will require more memory but will provide a speed
tradeoff as fewer write operations are required.
Log output to disk is implemented using the serial HDF5
library [10] and operates with chunked data so that the entire
set of cached log entries is written all at once after the cache
is full. HDF was chosen as the data format due to the fast write
performance and compact size of the binary output files along
with fast index-based read performance which is helpful when
loading the files later for analysis. Furthermore, HDF5 is
widely available both on computing clusters for logging and
is also supported in many computational analysis packages
and libraries.
To provide context for the size of the log data created for
the chiSIM model, consider 2.9 million individuals changing
activities 5 times per day on average. Each log entry is 20
bytes in size, so the total log data for all individuals for a
simulated week is approximately 2 GB. Simulations for a one
year period typically produce combined simulation output in
the range of 100-200 GB, depending on the variability of the
daily activity schedule. By running chiSIM on a
computational cluster, however, the logging CPU overhead
along with disk write bandwidth is distributed across
processes and is minimal compared to the execution of the
model behaviors. For example, on a 64 process cluster the size
of a single process log file for one simulation week becomes
approximately 30 MB and sizes for a whole simulation year
would be around 1.5 GB. This scenario generates 64 log files
which can then be easily loaded by any type of analysis
package capable of loading HDF files in an iterative or batch
fashion.
IV. COLLOCATION NETWORK SYNTHESIS
Generating a network that connects individuals via
physical collocation at the same simulation location and time
requires processing the simulation log data for the time period
of interest, e.g., all contacts between persons over a week at
all locations. The simulation log data is compact but does not
directly lend itself to network generation because there are
many recorded instances of activities happening at different
times and places. Several straightforward data transformations
are used to create the collocation network from the log data.
A collocation network can be represented by a symmetric
adjacency matrix A whose elements are the weights between
persons in the network and the dimensions of A are p x p,
where p is the total number of persons in the simulation. A
nonzero value at an index i,j of A represents a connection
between individuals i and j at some time and the magnitude of
the value represents the number of time units these individuals
were collocated. Most values of A will in fact be zero when a
simulation is highly spatially segregated. Additionally, since
A is symmetric, it can be stored as a sparse triangular matrix
which provides significant memory and processing time
savings compared to using a full, dense matrix. It is
convenient to calculate the sparse matrix
=⋅
where x
is a sparse p x t collocation matrix that contains entries for all
times t when persons exist at a specific location l. The
weighted adjacency matrix A for the entire model population
across all places is the sum across adjacency matrices for each
location =

. The sparse collocation matrix x is
created by additively processing log entries in a simulation
output file and filling in values of 1 for the times a person is
doing an activity at the location. Each row in x corresponds to
a unique person ID for all p persons and uses the same index
values as the adjacency matrix. The elements of x are simply
binary values that indicate when each person row index was
present for each column time index. The multiplication ⋅
sums all of the times each person collocates with every other
person and stores this in
.
A. Parallel Implementation
The mathematical description of the network generation
process presented above is straightforward, however real-
world parallel performance depends heavily on
implementation. R was selected as the computing
environment to perform both network synthesis and analysis
due to the availability of useful third party libraries and
parallel capabilities on clusters. A description of the
implementation and specific libraries used are described here.
1) Data loading: R is first started in a single root process
to serially load and minimally process simulation output files
before dispatching processing tasks to worker processes.
Simulation HDF log files are read using the rHDF5 module
of the Bioconductor package for R, which provides the log
data as a data frame [11, 12]. The data frame schema mirrors
the structure of the simulation log file such that each row is a
log entry representing an activity change, with columns for
start and end times, and IDs for person, activity, and location.
The log data frame is subsequently converted to a data table
using the R data.table package [13] which provides the
capacity to perform extremely fast binary searches on the
table for the purpose of data sub-setting.
2) Collocation matrices creation: Once the log data are
in a data table, the process of creating the collocation matrices
requires first sub-setting the table into time slices, e.g., one
week, based on the start and stop times of the log entries.
This can be accomplished either by reading an entire
simulation log file, or a portion of a file if it is too large to fit
in memory. A list of place IDs that occur in the time slice are
obtained by retrieving all unique place ID values. These place
IDs are partitioned for processing in parallel such that each
worker process will receive a subset of activities based on the
places for which they will calculate collocation matrices. The
sub-setting step is extremely fast (seconds) performed
serially even on tables with millions of rows due to the
data.table implementation.
The SNOW R package [14] is used by the main process to
manage the worker processes that create the collocation
matrices from log entries. Workers can be initialized on a
single workstation using SNOW to create a “socket” type
cluster that will nominally result in a set of workers equal to
the number of available CPUs. For larger clusters the use of
an MPI backend through the Rmpi library [15] allows for
parallelization across a much larger number of processes.
Each worker iterates over the set of place IDs it has been
assigned and retrieves log entries corresponding to each ID.
From each set of entries, a sparse p x t collocation matrix x is
created for all p persons and all t times in the time slice using
the R Matrix package [16]. The collocation matrix for each
location is saved in a list and returned to the root process.
3) Collocation matrix list partitioning: The lists of
collocation matrices returned from the workers are combined
into a single list for the purpose of evenly partioning the list
according to the number of nonzero elements in each
collocation matrix. This step is crucial to achieve even load
balancing across workers that will generate the adjacency
matrices from the collocation matrices. The number of
nonzero elements in a collocation matrix is the amount of
collocated persons at that location. Without this balancing
step, some workers would sit idle while others would be
working for extended periods of time due to the variance in
the number of collocated persons at different locations, which
can range from a single individual to tens of thousands of
individuals. The partitioing creates a set of lists, one for each
of the worker processes such that the collocation matrices are
evenly distributed among workers based on the number of
nonzero elements in each matrix.
4) Adjacency matrices creation: The load-balanced list
of adjacency matrices is provided to the workers to create the
p x p adjacency matrix
=⋅
from each collocation
matrix x in the list. Since the collocation matrices are sparse,
the resulting symmetric adjacency matrices are also quite
sparse. Furthermore, since the collocation network graphs
are not directional, the adjacency matrix can be stored as an
upper triangular matrix that fully captures the collocation
time weights of the network edges. Each worker finally sums
the set of adjacency matrices it has created and returns a
single adjacency matrix to the root process which further
reduces the worker adjacency matrices to a single adjacency
matrix.
At this point, the resulting sparse triangular p x p
adjacency matrix fully defines the collocation network
structure with the nonzero elements representing the amount
of time each person was collocated with each other person
during the selected time slice, e.g., one week. The process for
generating a collocation network from the simulation log file
is applied to the log files sequentially such that a number of
adjacency matrices for each log file and for each time interval
are created. To generate the complete network across multiple
log files, the adjacency matrices are simply summed. The final
step uses the iGraph R package [17] to create a network graph
object from the adjacency matrix for analysis and
visualization.
V. NETWORK ANALYSIS
The collocation network synthesis is applied to a chiSIM
simulation run consisting of approximately 2.9 million
individuals, representing the entire Chicago population. The
simulation is executed in parallel on a cluster of 16 compute
nodes each with 16 processes (256 processes total) on the
Argonne Blues cluster consisting of Sandy Bridge Xeon E5-
2670 2.6GHz processors. The entire simulated time duration
is four weeks with a time step of 1 hour and requires a minimal
amount (1 minute) of wall time on the cluster. The simulation
generates a set of 256 output activity log files of
approximately 100MB each. A simulated 1-year time duration
would create 256 x 1.2GB log files.
The collocation network synthesis R script is executed on
the resulting log files to process only the fourth week of log
data in batches of 16 files at a time and each batch is run on a
64-process cluster on Blues, for a total cluster resource use of
1024 processes. Since the processing of log files is
independent across batches, each batch of 16 can be run as
separate batch jobs on Blues, potentially all running
simultaneously. Batching the processing this way also
facilitates execution on the cluster queue, since several
smaller jobs of 64 processes are generally processed more
quickly in the queue than one large job of 1024 processes.
Each batch job takes approximately 30 minutes to
complete. These may all complete simultaneously if the
cluster queue is free, but the completion of all batch jobs
typically requires 1-1.5 hours to process a single week of
simulation log data. The final aggregation step sums the
resulting adjacency matrices to produce a complete network
sparse adjacency matrix. The complete sparse triangular
adjacency matrix represents a network consisting of 2,927,761
vertices (persons) and 830,328,649 edges (collocations) and
requires approximately 10GB of memory to store when
loaded in R.
A. Visualization
Network visualization is useful in characterizing local
structural properties of the collocation network. Such
networks may typically contain striking local dense clusters of
individuals that are very highly connected along with
individuals that share connections to multiple clusters and act
as network bridges between the dense clusters. It is not
Figure 1. Sample collocation network for randomly sampled individual and all individuals within two degrees of separation. Lines represent collocation
between person nodes and node color represents vertex degree with darker nodes having a greater degree than lighter nodes.
practical nor likely useful to visualize a network with 10
6
nodes and 10
9
edges, however visualization of portions of the
entire collocation network can be performed. Local network
structures can be observed by selecting individuals and
finding all adjacent vertices to create set V
1
and then all
adjacent vertices to V
1
to create set V
2
. The union =
contains all vertices within a graph radius of two from the
original selected individual. A single workstation was used to
serially load and process the complete adjacency matrix. The
R iGraph package provides functions to find adjacent vertices
and generate sub-graphs from the original graph such that all
edges between nodes in the set V are preserved.
Figure 1 shows a network generated from the process
described above using a randomly selected individual from
the complete collocation network. The subgraph generated
from the vertex set V contains 2,529 nodes and 391,104 edges
and was exported from R using iGraph. The graph
visualization was generated using Gephi [18] with the “Force
Atlas 2” layout which is useful in spatializing Small-World
[4] and scale-free networks [19]. The positioning of nodes is
force-directed such that clusters of highly connected nodes are
positioned closer, as are nodes with greater edge weights. The
graph nodes are colored according to their degree – those with
Figure 2. Sample collocation network for randomly sampled individual and all individuals within two degrees of separation.
Lines represent collocation
between person nodes and node color represents vertex degree with darker nodes having a greater degree than lighter nodes.
more neighbors are darker in color than nodes with fewer
neighbors
Figure 2
shows a second example from another randomly
sampled person and all person nodes within a radius of two.
This subgraph contains 1,097 nodes and 41,372 edges. This
network demonstrates a much less dense configuration of
connections, showing many disparate clusters more diffusely
connected to each other when compared to those in Figure 1.
B. Quantitative metrics
The complete collocation network for all 2.9 million
persons was analyzed via network statistics computed on a
single workstation using the R iGraph package. The vertex
degree distribution for the entire network is shown on a log-
log scale plot in Figure 3. Each point in the figure is the vertex
degree distribution fraction, scaled by the total number of
persons. The number of persons with vertex degree values
between 1-7 are approximately each represented by just over
10
5
persons, followed by a rapid drop in the population for
larger vertex degree. Figure 3 also includes a dashed red line
representing the distribution of a scale-free (power-law)
network vertex connectivity, which is ()~

, where =
1.5 is the power law exponent. Networks that display scale-
free vertex connectivity have a degree distribution that follows
a line with values of a typically between 1 and 3 [5, 19].
While the vertex degree graph appears linear in log-log
scale for some vertex degree ranges, it is not linear over
multiple orders of magnitude and therefore does not strictly fit
a power-law distribution. A truncated power law distribution
in the form ()~

/
is sometimes used to fit degree
distributions such as power-law with rolled off tails [20],
where is the power law exponent and
is a cut-off value for
the vertex degree. Figure 3 includes a truncated power law
form with  = 1.25 and
=10
shown as a solid green line.
The truncated form does appear to better fit the tail end of the
vertex degree distribution, however the other nonlinearities in
the real population data are not captured. Finally, an
exponential form ()~
/
is plotted in Figure 3 as a
dash-dot black line. Again, the exponential form does capture
the tail roll off better but is still unable to capture the more
complex characteristics of the degree distribution.
The local clustering coefficient, or transitivity, is
calculated for each person vertex in the collocation network
and describes the local connectedness of each vertex’s
neighbors via the ratio of connected edge triangles and triples
centered on the vertex [21]. The histogram of the local vertex
clustering coefficient for the entire person collocation
network is shown in Figure 4. Many of the person nodes have
a clustering coefficient of 1 which indicates a high degree of
local clustering in the collocation network. Large clustering
coefficients are typically found in scale-free and Small-
World networks compared to random graphs [19, 22].
To better explore how the vertex degree distribution
varies with population demographics, the entire simulated
population was divided according to age groups. Vertex
degree distribution graphs for individual age group
demographics are shown in Figure 5. These figures represent
the within-group network connectedness such that only
collocation connections between persons within each age
group are considered and edges between age groups are
removed.
The 0-14 year age group demographic demonstrates the
largest deviation from power-law scaling in the vertex degree
distribution, and is nearly flat over two orders of magnitude
of vertex degree. This can likely be attributed to the fact that
children would primarily collocate with other children in
schools that would place constraints on the number of
connections due to school and class sizes.
The other age group demographics display somewhat
linear (in log-log scale) vertex degree distributions. The 15-
18 year old group also displays some flattening of the degree
distribution which again is likely due to school activities.
Groups of outlying points are visible in the 19-44 and the 65+
age groups which could be attributed to collocation at places
such as universities, prisons, or retirement communities and
hospitals.
Figure 3. Log-log plot of collocation network vertex degree frequency
distribution for the entire Chicago population for a single week (blue
points). The first 50 vertex degree points are connected with a solid blue
line for clarity. The red dashed line represents the power-law probability,
the green solid line is a truncated power law probability, and the black
dash-dot line is an exponential probability.
Figure 4. Histogram of the local vertex clustering coefficient for the
collocation network of the entire Chica
g
o
p
o
p
ulation for a sin
g
le week.
VI. CONCLUSIONS
A novel parallel method to generate and analyze large-
scale person collocation networks from simulated agent-
based models was demonstrated. The endogenous collocation
network structure arises from the simulation of the daily
activity schedules for every person in the city of Chicago.
While the daily schedules for each person are a priori inputs
to the ABM, the actual network structure is an emergent
property of the activity data that can only be realized by
executing the full-scale simulation.
The network analysis approach described uses the free
and open-source R programming environment and libraries
that provides cluster-based parallel execution via MPI.
Furthermore, this analysis approach is more accessible to data
analysts who may be familiar with R but not with distributed
parallel programming. The use of a compute cluster to run
the analysis in parallel was essential because a single
workstation would not be feasible due to the required run
time and memory footprint.
Analysis of the Chicago population collocation networks
via vertex degree distributions provides a useful metric to
characterize the social structure of the community. The
vertex degree distribution for the entire population shows
complex features that are not captured by power-law or
exponential distributions which are commonly used to fit
similarly structured networks. By disaggregating the
population according to age demographics, the vertex degree
distribution can be computed for each group. Some of the
group degree distributions follow the shape of the complete
network distribution while some differ significantly
suggesting different structural configurations of the
collocation network depending on age.
Aside from the direct examination of urban social
networks data via degree distribution statistics, a second
potential benefit of the analysis is the ability to generally
characterize such networks. Simulation models that use
networks as inputs, rather than produce them as emergent
outputs, would benefit. The capacity to properly characterize
the structure of the social network model input with real data,
rather than assume its structure, has the potential to
strengthen the applicability of the model.
An existing challenge, of course, is properly
characterizing the types of social interaction models used by
a simulation [6, 23]. Various methods exist for generating
random scale-free networks that may be superficially similar
in structure to those displayed by the chiSIM model [24].
Random synthetic networks could be a starting point for a
Figure 5. Vertex degree distribution for collocation networks of the entire Chicago population for a single week for each age groups.
realistic social interaction network model, but would need to
be tailored to capture the more complex structure in the
vertex degree distribution graphs presented in this paper. The
notion of using generated random scale-free or power-law
networks to represent social networks in theoretical
epidemiology simulation models also needs to be examined
in light of the differences between those networks and the
empirically-based networks presented here and in [1].
Furthermore, it is likely that an accurate characterization
of the real population social network will require that
synthetically generated networks also match the vertex
degree distributions for population sub-groups such as age or
location type, e.g., work or school. Further exploration of
this approach to generate realistic social network structures
will need to identify additional network statistics and their
relative contributions to the features of the network.
ACKNOWLEDGMENT
This work is supported by the National Science
Foundation (NSF) RAPID Award DEB-1516428 and the U.S.
Department of Energy under contract number DE-AC02-
06CH11357. This research used computing resources
provided on Blues, a high-performance computing cluster
operated by the Laboratory Computing Resource Center at
Argonne National Laboratory, which is a DOE Office of
Science User Facility.
REFERENCES
[1] S. Eubank, H. Guclu, V. S. Anil Kumar, M. V. Marathe, A.
Srinivasan, Z. Toroczkai, et al., "Modelling disease outbreaks
in realistic urban social networks," Nature, vol. 429, pp. 180-
184, 2004.
[2] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A.
Vespignani, "Epidemic processes in complex networks,"
Reviews of Modern Physics, vol. 87, pp. 925-979, 2015.
[3] N. C. Grassly and C. Fraser, "Mathematical models of
infectious disease transmission," Nature Reviews
Microbiology, vol. 6, pp. 477-487, 2008.
[4] D. J. Watts and S. H. Strogatz, "Collective dynamics of /`small-
world/' networks," Nature, vol. 393, pp. 440-442, 1998.
[5] S. H. Strogatz, "Exploring complex networks," Nature, vol.
410, pp. 268-276, 2001.
[6] S. Riley, K. Eames, V. Isham, D. Mollison, and P. Trapman,
"Five challenges for spatial epidemic models," Epidemics, vol.
10, pp. 68-71, 2015.
[7] C. M. Macal, M. J. North, N. Collier, V. M. Dukic, D. T.
Wegener, M. Z. David, et al., "Modeling the transmission of
community-associated methicillin-resistant Staphylococcus
aureus: a dynamic agent-based simulation," Journal of
Translational Medicine, vol. 12, p. 124, 2014.
[8] W. Yang, W. Zhang, D. Kargbo, R. Yang, Y. Chen, Z. Chen,
et al., "Transmission network of the 2014–2015 Ebola
epidemic in Sierra Leone," Journal of The Royal Society
Interface, vol. 12, 2015.
[9] N. Collier, J. Ozik, and C. M. Macal, "Large-Scale Agent-
Based Modeling with Repast HPC: A Case Study in
Parallelizing an Agent-Based Model," in Euro-Par 2015:
Parallel Processing Workshops: Euro-Par 2015 International
Workshops, Vienna, Austria, August 24-25, 2015, Revised
Selected Papers, S. Hunold, A. Costan, D. Giménez, A. Iosup,
L. Ricci, M. E. Gómez Requena, et al., Eds., ed Cham: Springer
International Publishing, 2015, pp. 454-465.
[10] The HDF Group. Available: https://www.hdfgroup.org/
[11] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M.
Dettling, S. Dudoit, et al., "Bioconductor: open software
development for computational biology and bioinformatics,"
Genome Biology, vol. 5, pp. R80-R80, 2004.
[12] Bioconductor. Available: https://www.bioconductor.org/
[13] M. Dowle, A. Srinivasan, J. Gorecki, T. Short, S. Lianoglou,
and E. Antonyan. data.table: Extension of 'data.frame'.
Available: https://cran.r-
project.org/web/packages/data.table/index.html
[14] L. Tierney, A. J. Rossini, N. Li, and H. Sevickova. snow:
Simple Network of Workstations. Available: https://cran.r-
project.org/web/packages/snow/index.html
[15] H. Yu. Rmpi: Interface (Wrapper) to MPI (Message-Passing
Interface). Available:
http://www.stats.uwo.ca/faculty/yu/Rmpi/
[16] D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix
Classes and Methods. Available: https://cran.r-
project.org/web/packages/Matrix/
[17] G. Csardi. igraph: Network Analysis and Visualization.
Available: https://cran.r-
project.org/web/packages/igraph/index.html
[18] M. Bastian, S. Heymann, and M. Jacomy, "Gephi: An Open
Source Software for Exploring and Manipulating Networks,"
presented at the International AAAI Conference on Web and
Social Media, 2009.
[19] A.-l. Barabási, R. Albert, and H. Jeong, "Mean-field theory for
scale-free random networks," Physica A, vol. 272, pp. 173-187,
1999.
[20] M. E. J. Newman, "The structure of scientific collaboration
networks," Proceedings of the National Academy of Sciences,
vol. 98, pp. 404-409, January 16, 2001 2001.
[21] S. Wasserman and K. Faust, Social Network Analysis: Methods
and Applications. Cambridge: Cambridge University Press,
1994.
[22] L. A. N. Amaral, A. Scala, M. Barthélémy, and H. E. Stanley,
"Classes of small-world networks," Proceedings of the
National Academy of Sciences, vol. 97, pp. 11149-11152, 2000.
[23] S. Riley, "Large-Scale Spatial-Transmission Models of
Infectious Disease," Science, vol. 316, pp. 1298-1301, 2007.
[24] C. Dangalchev, "Generation models for scale-free networks,"
Physica A: Statistical Mechanics and its Applications, vol. 338,
pp. 659-671, 2004.
... HPC-compliant agentbased codes, in which environment is modeled by networks, are broadly discussed in literature. RepastHPC [8], [9] and D-MASON [10] constitute two most popular general-purpose parallel and distributed agent-based systems (PDABS) suitable for such ABSS. Besides general-purpose PDABS, HPC community developed a number of domain-specific HPCcompliant ABSS codes for applications in epidemiology [11], [12], [13], [14],social networks modelling [15], economics and logistics [16], urban planning [17], [9], [18]. ...
... RepastHPC [8], [9] and D-MASON [10] constitute two most popular general-purpose parallel and distributed agent-based systems (PDABS) suitable for such ABSS. Besides general-purpose PDABS, HPC community developed a number of domain-specific HPCcompliant ABSS codes for applications in epidemiology [11], [12], [13], [14],social networks modelling [15], economics and logistics [16], urban planning [17], [9], [18]. Finally, in many occasions, such ABSS can be efficiently implemented on top of highly optimized multi-core and distributed graphparallel frameworks like PowerGraph, GraphX, GraphChi, and Ligra or parallel graph processing libraries like PBGL and SNAP [19]. ...
Conference Paper
Full-text available
With over 79 million people forcibly displaced, forced human migration becomes a common issue in the modern world and a serious challenge for the global community. The Flee is a validated agent-based social simulation framework for forecasting the population displacements in the armed conflict settings. In this paper, we present two schemes to parallelize Flee, analyze computational complexity of those schemes, and outline results for benchmarks of our parallel codes with the real-world and synthetic scenarios on four state-of-the-art systems including a new European pre-exascale system, Hawk. On all testbeds, we evidenced high scalability of our codes. It exceeds more than 16,384 cores in our largest benchmark with 100 million agents on Hawk. Parallelization schemes discussed in this work, can be extrapolated to a wide range of ABSS applications with frequent agent movement and lesser impact of direct communications between agents.
... The CRx ABM was implemented in C++ using the Repast for High-Performance Computing ABM toolkit and the Chicago Social Interaction Model (chiSIM) framework. [44][45][46][47] The CRx studies were approved by the University of Chicago Institutional Review Board. ...
Article
Full-text available
CommunityRx (CRx), an information technology intervention, provides patients with a personalized list of healthful community resources (HealtheRx). In repeated clinical studies, nearly half of those who received clinical “doses” of the HealtheRx shared their information with others (“social doses”). Clinical trial design cannot fully capture the impact of information diffusion, which can act as a force multiplier for the intervention. Furthermore, experimentation is needed to understand how intervention delivery can optimize social spread under varying circumstances. To study information diffusion from CRx under varying conditions, we built an agent-based model (ABM). This study describes the model building process and illustrates how an ABM provides insight about information diffusion through in silico experimentation. To build the ABM, we constructed a synthetic population (“agents”) using publicly-available data sources. Using clinical trial data, we developed empirically-informed processes simulating agent activities, resource knowledge evolution and information sharing. Using RepastHPC and chiSIM software, we replicated the intervention in silico, simulated information diffusion processes, and generated emergent information diffusion networks. The CRx ABM was calibrated using empirical data to replicate the CRx intervention in silico. We used the ABM to quantify information spread via social versus clinical dosing then conducted information diffusion experiments, comparing the social dosing effect of the intervention when delivered by physicians, nurses or clinical clerks. The synthetic population (N = 802,191) exhibited diverse behavioral characteristics, including activity and knowledge evolution patterns. In silico delivery of the intervention was replicated with high fidelity. Large-scale information diffusion networks emerged among agents exchanging resource information. Varying the propensity for information exchange resulted in networks with different topological characteristics. Community resource information spread via social dosing was nearly 4 fold that from clinical dosing alone and did not vary by delivery mode. This study, using CRx as an example, demonstrates the process of building and experimenting with an ABM to study information diffusion from, and the population-level impact of, a clinical information-based intervention. While the focus of the CRx ABM is to recreate the CRx intervention in silico, the general process of model building, and computational experimentation presented is generalizable to other large-scale ABMs of information diffusion.
... Besides focusing on thorough tests and benchmarks for the HDF5 extension, we intend to look at algorithmic ways to improve I/O performance. In particular, we plan to analyse formats for storing evolution of the synthetic population that require less physical space than naïve serialization of the whole population (Tatara, Collier, Ozik, & Macal, 2017). ...
Experiment Findings
Full-text available
With this deliverable we continue to capture the status of new MTMs developed and planned by CoeGSS in WP3 (tasks T3.1–T3.6) and how WP3 has been utilized by the pilots in WP4. We proceed through the main areas covered by the six tasks of WP3, followed by a chapter on how WP3 and WP4 interact.
... Figure 5 shows simulated disease transmission as a result of personal contact between individuals. Because the log files can be enormous in size, collecting all the agent event data over the simulation period, special techniques are required to record and analyze such large-scale datasets (Tatara et al. 2017). Figure 6 shows the results of such an analysis, a co-location network for all agents in the model who have two degrees of separation. ...
Article
Forced displacement of people worldwide, for example, due to violent conflicts, is common in the modern world, and today more than 82 million people are forcibly displaced. This puts the problem of migration at the forefront of the most important problems of humanity. The Flee simulation code is an agent-based modeling tool that can forecast population displacements in civil war settings, but performing accurate simulations requires nonnegligible computational capacity. In this article, we present our approach to Flee parallelization for fast execution on multicore platforms, as well as discuss the computational complexity of the algorithm and its implementation. We benchmark parallelized code using supercomputers equipped with AMD EPYC Rome 7742 and Intel Xeon Platinum 8268 processors and investigate its performance across a range of alternative rule sets, different refinements in the spatial representation, and various numbers of agents representing displaced persons. We find that Flee scales excellently to up to 8192 cores for large cases, although very detailed location graphs can impose a large initialization time overhead.
Article
Full-text available
Gephi is an open source software for graph and network analysis. It uses a 3D render engine to display large networks in real-time and to speed up the exploration. A flexible and multi-task architecture brings new possibilities to work with complex data sets and produce valuable visual results. We present several key features of Gephi in the context of interactive exploration and interpretation of networks. It provides easy and broad access to network data and allows for spatializing, filtering, navigating, manipulating and clustering. Finally, by presenting dynamic features of Gephi, we highlight key aspects of dynamic network visualization.
Article
Full-text available
Understanding the growth and spatial expansion of (re)emerging infectious disease outbreaks, such as Ebola and avian influenza, is critical for the effective planning of control measures; however, such efforts are often compromised by data insufficiencies and observational errors. Here, we develop a spatial-temporal inference methodology using a modified network model in conjunction with the ensemble adjustment Kalman filter, a Bayesian inference method equipped to handle observational errors. The combined method is capable of revealing the spatial-temporal progression of infectious disease, while requiring only limited, readily compileddata.Weuse thismethodtoreconstruct the transmission network of the 2014-2015 Ebola epidemic in Sierra Leone andidentify source and sink regions. Our inference suggests that, in Sierra Leone, transmission within the network introduced Ebola to neighbouring districts and initiated self-sustaining local epidemics; two of the more populous and connected districts, Kenema and Port Loko, facilitated two independent transmission pathways. Epidemic intensity differed by district,was highly correlated with population size (r = 0.76, p = 0.0015) and a critical window of opportunity for containing local Ebola epidemics at the source (ca one month) existed. This novel methodology can be used to help identify and contain the spatial expansion of future (re)emerging infectious disease outbreaks. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Article
Full-text available
Infectious disease incidence data are increasingly available at the level of the individual and include high-resolution spatial components. Therefore, we are now better able to challenge models that explicitly represent space. Here, we consider five topics within spatial disease dynamics: the construction of network models; characterising threshold behaviour; modelling long-distance interactions; the appropriate scale for interventions; and the representation of population heterogeneity. Copyright © 2014 The Authors. Published by Elsevier B.V. All rights reserved.
Conference Paper
We present a case study for parallelizing a large-scale epidemiologic ABM developed with Repast HPC, the Chicago Social Interaction Model (chiSIM). The original serial model is a CA-MRSA model which tracks CA-MRSA transmission dynamics and infection in Chicago, and represents the spread of CA-MRSA in the population of Chicago. We utilize both within compute node parallelization using the OpenMP toolkit and distributed parallelization across multiple processes using MPI. The combined approach yields a 1350 % increase in run time performance utilizing 128 compute nodes.
Article
Spatiotemporal social network analysis can show relationships among people at a particular time and location. This paper presents an algorithm that mines text for person and location words and creates connections among words. It shows how this algorithm output, when chunked by time intervals, may be visualized by third-party social network analysis software in the form of standard network pin diagrams or geographic maps. Its data sample comes from newspaper articles concerning the 2006 Darfur crisis in Sudan. Given an immense data sample, it would be possible to use the algorithm to detect trends that would predict the next geographic center(s) of influence and types of actors (foreign dignitaries or domestic leaders, for example). This algorithm should be widely generalizable to many text domains as long as the external resources are modified accordingly.
Article
In recent years the research community has accumulated overwhelming evidence for the emergence of complex and heterogeneous connectivity patterns in a wide range of biological and socio-technical systems. The complex properties of real world networks have a profound impact on the behavior of equilibrium and non-equilibrium phenomena occurring in various systems, and the study of epidemic spreading is central to our understanding of the unfolding of dynamical processes in complex networks. The theoretical analysis of epidemic spreading in heterogeneous networks requires the development of novel analytical frameworks, and it has produced results of conceptual and practical relevance. Here we present a coherent and comprehensive review of the vast research activity concerning epidemic processes, detailing the successful theoretical approaches as well as making their limits and assumptions clear. Physicists, epidemiologists, computer and social scientists share a common interest in studying epidemic spreading and rely on very similar models for the description of the diffusion of pathogens, knowledge, and innovation. For this reason, while we focus on the main results and the paradigmatic models in infectious disease modeling, we also present the major results concerning generalized social contagion processes. Finally we outline the research activity at the forefront in the study of epidemic spreading in co-evolving and time-varying networks.