ArticlePDF Available

JANUS: A hypothesis-driven Bayesian approach for understanding edge formation in attributed multigraphs

Authors:
  • GESIS - Leibniz Institute of the Social Sciences

Abstract and Figures

Understanding edge formation represents a key question in network analysis. Various approaches have been postulated across disciplines ranging from network growth models to statistical (regression) methods. In this work, we extend this existing arsenal of methods with JANUS, a hypothesis-driven Bayesian approach that allows to intuitively compare hypotheses about edge formation in multigraphs. We model the multiplicity of edges using a simple categorical model and propose to express hypotheses as priors encoding our belief about parameters. Using Bayesian model comparison techniques, we compare the relative plausibility of hypotheses which might be motivated by previous theories about edge formation based on popularity or similarity. We demonstrate the utility of our approach on synthetic and empirical data.JANUS is relevant for researchers interested in studying mechanisms explaining edge formation in networks from both empirical and methodological perspectives.
This content is subject to copyright. Terms and conditions apply.
Applied Network Science
Espín-Noboa et al. Applied Network Science (2017) 2:16
DOI 10.1007/s41109-017-0036-1
RESEARCH Open Access
JANUS: A hypothesis-driven Bayesian
approach for understanding edge formation in
attributed multigraphs
Lisette Espín-Noboa1,2* , Florian Lemmerich1,2, Markus Strohmaier1,2 and Philipp Singer1,2
*Correspondence:
Lisette.Espin@gesis.org
This article extends a previous
workshop publication (Espín-Noboa
et al. 2016). The main novelties in
this manuscript include the
extension to dyad-attributed
networks (such as as multiplex
networks), additional experimental
results, and a comparison of our
approach to alternative methods.
1GESIS - Leibniz Institute for the
Social Sciences, Unter
Sachsenhausen 6-8, 50667 Cologne,
Germany
2University of Koblenz-Landau,
Universitätstraße 1, 56070 Koblenz,
Germany
Abstract
Understanding edge formation represents a key question in network analysis. Various
approaches have been postulated across disciplines ranging from network growth
models to statistical (regression) methods. In this work, we extend this existing arsenal
of methods with JANUS, a hypothesis-driven Bayesian approach that allows to
intuitively compare hypotheses about edge formation in multigraphs. We model the
multiplicity of edges using a simple categorical model and propose to express
hypotheses as priors encoding our belief about parameters. Using Bayesian model
comparison techniques, we compare the relative plausibility of hypotheses which
might be motivated by previous theories about edge formation based on popularity or
similarity. We demonstrate the utility of our approach on synthetic and empirical data.
JANUS is relevant for researchers interested in studying mechanisms explaining edge
formation in networks from both empirical and methodological perspectives.
Keywords: Edge formation, Bayesian inference, Attributed multigraphs, Multiplex,
HypTrails
Introduction
Understanding edge formation in networks is a key interest of our research commu-
nity. For example, social scientists are frequently interested in studying relations between
entities within social networks, e.g., how social friendship ties form between actors and
explain them based on attributes such as a person’s gender, race, political affiliation or
age in the network (Sampson 1968). Similarly, the complex networks community suggests
a set of generative network models aiming at explaining the formation of edges focus-
ing on the two core principles of popularity and similarity (Papadopoulos et al. 2012).
Thus, a series of approaches to study edge formation have emerged including statistical
(regression) tools (Krackhardt 1988; Snijders et al. 1995) and model-based approaches
(Snijders 2011; Papadopoulos et al. 2012; Karrer and Newman 2011) specifically estab-
lished in the physics and complex networks communities. Other disciplines such as the
computer sciences, biomedical sciences or political sciences use these tools to answer
empirical questions; e.g., co-authorship networks (Martin et al. 2013), wireless networks
of biomedical sensors (Schwiebert et al. 2001), or community structures of political blogs
(Adamic and Glance 2005).
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 2 of 20
Problem illustration Consider for example the network depicted in Fig. 1. Here, nodes
represent authors, and (multiple) edges between them refer to co-authored scientific
articles. Node attributes provide additional information on the authors, e.g., their home
country and gender. In this setting, an exemplary research question could be: “Can
co-authorship be better explained by a mechanism that assumes more collaborations
between authors from the same country or by a mechanism that assumes more collabora-
tions between authors with the same gender?”. These and similar questions motivate the
main objective of this work, which is to provide a Bayesian approach for understanding
how edges emerge in networks based on some characteristics of the nodes or dyads.
While several methods for tackling such questions have been proposed, they come
with certain limitations. For example, statistical regression methods based on QAP
(Hubert and Schultz 1976) or mixed-effects models (Shah and Sinha 1989) do not scale
to large-scale data and results are difficult to interpret. For network growth models
(Papadopoulos et al. 2012), it is necessary to find the appropriate model for a given
hypothesis about edge formation and thus, it is often not trivial to intuitively compare
competing hypotheses. Consequently, we want to extend the methodological toolbox for
studying edge formation in networks by proposing a first step towards a hypothesis-
driven generative Bayesian framework.
Approach and methods We focus on understanding edge formation in attributed multi-
graphs. We are interested in modeling and understanding the multiplicity of edges based
on additional network information, i.e., given attributes for the nodes or dyads in the net-
work. Our approach follows a generative storyline. First, we define the model that can
characterize the edge formation at interest. We focus on the simple categorical model,
from which edges are independently drawn from. Motivated by previous work on sequen-
tial data (Singer et al. 2015), the core idea of our approach is to specify generative
hypotheses about how edges emerge in a network. These hypotheses might be motivated
by previous theories such as popularity or similarity (Papadopoulos et al. 2012)—e.g.,
for Fig. 1 we could hypothesize that authors are more likely to collaborate with each
Fig. 1 Example: This example illustrates an unweighted attributed multigraph. aShows a multigraph where
nodes represent academic researchers, and edges scientific articles in which they have collaborated
together. bShows the adjacency matrix of the graph, where every cell represents the total number of edges
between two nodes. cDecodes some attribute values per node. For instance, node D shows information
about an Austrian researcher who started his academic career in 2001. One main objective of JANUS is to
compare the plausibility of mechanisms derived from attributes for explaining the formation of edges in the
graph. For example, here, a hypothesis that researchers have more collaborations if they are from the same
country might be more plausible than one that postulates that the multiplicity of edges can be explained
based on the relative popularity of authors
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 3 of 20
other if they are from the same country. Technically, we elicit these types of hypothe-
ses as beliefs in parameters of the underlying categorical model and encode and integrate
them as priors into the Bayesian framework. Using Bayes factors with marginal likelihood
estimations allows us to compare the relative plausibility of expressed hypotheses as they
are specifically sensitive to the priors. The final output is a ranking of hypotheses based
on their plausibility given the data.
Contributions The main contributions of this work are:
1. We present a first step towards a Bayesian approach for comparing generative
hypotheses about edge formation in networks.
2. We provide simple categorical models based on local and global scenarios allowing
the comparison of hypotheses for multigraphs.
3. We show that JANUS can be easily extended to dyad-attributed multigraphs when
multiplex networks are provided.
4. We demonstrate the applicability and plausibility of JANUS based on experiments
on synthetic and empirical data, as well as by comparing it to the state-of-the-art
QAP.
5. We make an implementation of this approach openly available on the Web
(Espín-Noboa 2016).
Structure This paper is structured as follows: First, we start with an overview of
some existing research on modeling and understanding edge formation in networks in
Section “Related work”. We present some background knowledge required in this work
in Section “Background” to then explain step-by-step JANUS in Section “Approach”.
Next, we show JANUS in action and the interpretation of results, by running four dif-
ferent experiments on synthetic and empirical data in Section “Experiments”. In Section
“Discussion” we suggest a fair comparison of JANUS with the Quadratic Assignment
Procedure (QAP) for testing hypotheses on dyadic data. We also highlight some impor-
tant caveats for further improvements. Finally, we conclude in Section “Conclusions” by
summarizing the contributions of our work.
Related work
We provide a broad overview of research on modeling and understanding edge formation
in networks; i.e., edge formation models and hypothesis testing on networks.
Edge formation models A variety of models explaining underlying mechanisms of
network formation have been proposed. Here, we focus on models explaining linkage
between dyads beyond structure by incorporating node attribute information. Promi-
nently, the stochastic blockmodel (Karrer and Newman 2011) aims at producing and
explaining communities by accounting for node correlation based on attributes. The
attributed graph (Pfeiffer III et al. 2014) models network structure and node attributes by
learning the attribute correlations in the observed network. Furthermore, the multiplica-
tive attributed graph (Kim and Leskovec 2011) takes into account attribute information
from nodes to model network structure. This model defines the probability of an edge as
the product of individual attribute link formation affinities. Exponential random graph
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 4 of 20
models (Robins et al. 2007) (also called the pclass of models) represent graph dis-
tributions with an exponential linear model that uses feature-structure counts such as
reciprocity, k-stars and k-paths. In this line of research, p1 models (Holland and Leinhardt
1981) consider expansiveness (sender) and popularity (receiver) as fixed effects associ-
ated with unique nodes in the network (Goldenberg et al. 2010) in contrast to the p2
models (Robins et al. 2007) which account for random effects and assume dyadic indepen-
dence conditionally to node-level attributes. While many of these works focus on binary
relationships, (Xiang et al. 2010) proposes an unsupervised model to estimate continuous-
valued relationship strength for links from interaction activity and user similarity in social
networks. Recently, the work in (Kleineberg et al. 2016) has shown that connections in
one layer of a multiplex can be accurately predicted by utilizing the hyperbolic distances
between nodes from another layer in a hidden geometric space.
Hypothesis testing on networks Previous works have implemented different tech-
niques to test hypotheses about network structure. For instance, the work in (Moreno
and Neville 2013) proposes an algorithm to determine whether two observed networks
are significantly different. Another branch of research has specifically focused on dyadic
relationships utilizing regression methods accounting for interdependencies in network
data. Here, we find Multiple Regression Quadratic Assignment Procedure (MRQAP)
(Krackhardt 1988) and its predecessor QAP (Hubert and Schultz 1976) which permute
nodes in such a way that the network structure is kept intact; this allows to test for signif-
icance of effects. Mixed-effects models (Shah and Sinha 1989) add random effects to the
models allowing for variation to mitigate non-independence between responses (edges)
from the same subject (nodes) (Winter 2013). Based on the quasi essential graph the work
in (Nguyen 2012) proposes to compare two graphs (i.e., Bayesian networks) by testing
and comparing multiple hypotheses on their edges. Recently, generalized hypergeometric
ensembles (Casiraghi et al. 2016) have been proposed as a framework for model selec-
tion and statistical hypothesis testing of finite, directed and weighted networks that allow
to encode several topological patterns such as block models where homophily plays an
important role in linkage decision. In contrast to our work, neither of these approaches
is based on Bayesian hypothesis testing, which avoids some fundamental issues of classic
frequentist statistics.
Background
In this paper, we focus on both node-attributed and dyad-attributed multigraphs with
unweighted edges without own identity. That means, each pair of nodes or dyad can be
connected by multiple indistinguishable edges, and there are features for the individual
nodes or dyads available.
Node-attributed multigraphs We formally define this as: Let G=(V,E,F)be an
unweighted attributed multigraph with V=(v1,...,vn)being a list of nodes, E=
{(vi,vj)}∈V×Va multiset of either directed or undirected edges, and a set of fea-
ture vectors F=(f1,...,fn). Each feature vector fi=(fi[ 1] , ..., fi[c])Tmaps a node
vito c(numeric or categorical) attribute values. The graph structure is captured by an
adjacency matrix Mn×n=(mij ),wheremij is the multiplicity of edge (vi,vj)in E(i.e.,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 5 of 20
number of edges between nodes viand vj). By definition, the total number of multiedges
is l=|E|=ij mij.
Figure 1a shows an example unweighted attributed multigraph: nodes represent
authors, and undirected edges represent co-authorship in scientific articles. The adja-
cency matrix of this graph—counting for multiplicity of edges—is shown in Fig. 1b.
Feature vectors (node attributes) are described in Fig. 1c. Thus, for this particular case,
we account for n=4nodes,l=44 multiedges, and c=6attributes.
Dyad-attributed networks As an alternative to attributed nodes, we also consider
multigraphs, in which each dyad (pair of nodes) is associated with a set of features ˆ
F=
(ˆ
f11,...,ˆ
fnn). Each feature vector ˆ
fij =(ˆ
fij[ 1] , ..., ˆ
fij[c])Tmapsthepairofnode(vi,vj)to
c(numeric or categorical) attribute values. The values of each feature can be represented
in a separate n×nmatrix. As an important special case of dyad-attributed networks, we
study multiplex networks. In these networks, all dyad features are integer-valued. Thus,
each feature can be interpreted as (or can be derived from) a separate multigraph over the
same set of nodes. In our setting, the main idea is then to try and explain the occurrence
of a multiset of edges Ein one multigraph Gwith nodes Vby using other multigraphs ˆ
G
on the same node set.
Bayesian hypothesis testing Our approach compares hypotheses on edge formation
based on techniques from Bayesian hypothesis testing (Kruschke 2014; Singer et al. 2015).
The elementary Bayes’ theorem states for parameters θ,givendataDand a hypothesis H
that:
posterior

P|D,H)=
likelihood

P(D|θ,H)
prior

P|H)
P(D|H)

marginal likelihood
(1)
As observed data D, we use the adjacency matrix M, which encodes edge counts. θ
refers to the model parameters, which in our scenario correspond to the probabilities of
individual edges. Hdenotes a hypothesis under investigation. The likelihood describes,
how likely we observe data Dgiven parameters θand a hypothesis H.Theprior is
the distribution of parameters we believe in before seeing the data; in other words, the
prior encodes our hypothesis H.Theposterior represents an adjusted distribution of
parameters after we observe D. Finally, the marginal likelihood (also called evidence)
represents the probability of the data Dgiven a hypothesis H.
In our approach, we exploit the sensitivity of the marginal likelihood on the prior to
compare and rank different hypotheses: more plausible hypotheses imply higher evidence
for data D. Formally, Bayes Factors can be employed for comparing two hypotheses. These
are computed as the ratio between the respective marginal likelihood scores. The strength
of a Bayes factor can be judged using available interpretation tables (Kass and Raftery
1995). While in many cases determining the marginal likelihood is computationally chal-
lenging and requires approximate solutions, we can rely on exact and fast-to-compute
solutions in the models employed in this paper.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 6 of 20
Approach
In this section, we describe the main steps towards a hypothesis-driven Bayesian
approach for understanding edge formation in unweighted attributed multigraphs. To
that end, we propose intuitive models for edge formation (Section “Generative edge for
mation models”), a flexible toolbox to formally specify belief in the model parameters
(Section “Constructing belief matrices”), a way of computing proper (Dirichlet) priors
from these beliefs (Section “Eliciting a Dirichlet prior”), computation of the marginal like-
lihood in this scenario (Section “Computation of the marginal likelihood”), and guidelines
on how to interpret the results (Section “Application of the method and interpretation of
results”). We subsequently discuss these issues one-by-one.
Generative edge formation models
We propose two variations of our approach, which employ two different types of
generative edge formation models in multigraphs.
Global model First, we utilize a simple global model, in which a fixed number of graph
edges are randomly and independently drawn from the set of all potential edges in
the graph Gby sampling with replacement. Each edge (vi,vj)is sampled from a cat-
egorical distribution with parameters θij,1 in,1 jn,ij :ij θij =1:
(vi,vj)Categoricalij ). This means that each edge is associated with one probability
θij of being drawn next. Figure 2a shows the maximum likelihood global model for the
network shown in Fig. 1. Since this is an undirected graph, inverse edges can be ignored
resulting in n(n+1)/2 potential edges/parameters.
Local models As an alternative, we can also focus on a local level. Here, we model to
which other node a specific node vwill connect giventhatanynewedgestartingfrom
vis formed. We implement this by using a set of nseparate models for the outgoing
edges of the ego-networks (i.e., the 1-hop neighborhood) of each of the nnodes. The
ego-network model for node viis built by drawing randomly and independently a num-
ber of nodes vjby sampling with replacement and adding an edge from vito this node.
Each node vjis sampled from a categorical distribution with parameters θij ,1 i
n,1 jn,i:jθij =1: vjCategoricalij ). The parameters θij can be writ-
ten as a matrix; the value in cell (i,j)specifies the probability that a new formed edge
Fig. 2 Multigraph models: This figure shows two ways of modeling the undirected multigraph shown in
Fig. 1. That is, aglobal or graph-based model models the whole graph as a single distribution. bLocal or
neighbour-based model models each node as a separate distribution
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 7 of 20
with source node viwill have the destination node vj. Thus, all values within one row
always sum up to one. Local models can be applied for undirected and directed graphs
(cf. also in Section “Discussion”). In the directed case, we model only the outgoing edges
of the ego-network. Figure 2b depicts the maximum likelihood local models for our
introductory example.
Hypothesis elicitation
The main idea of our approach is to encode our beliefs in edge formation as Bayesian
priors over the model parameters. As a common choice, we employ Dirichlet distribu-
tionsastheconjugate priors of the categorical distribution. Thus, we assume that the
model parameters θare drawn from a Dirichlet distribution with hyperparameters α:
θDir(α). Similar to the model parameters themselves, the Dirichlet prior (or multiple
priors for the local models) can be specified in a matrix. We will choose the parameters
αin such a way that they reflect a specific belief about edge formation. For that pur-
pose, we first specify matrices that formalize these beliefs, then we compute the Dirichlet
parameters αfrom these beliefs.
Constructing belief matrices
We specify hypotheses about edge formation as belief matrices B =bij .Thesearen×n
matrices, in which each cell bij IR represents a belief of having an edge from node vito
node vj. To express a belief that an edge occurs more often (compared to other edges) we
set bij to a higher value.
Node-attributed multigraphs In general, users have a large freedom to generate belief
matrices. However, typical construction principles are to assume that nodes with spe-
cific attributes are more popular and thus edges connecting these attributes receive
higher multiplicity, or to assume that nodes that are similar with respect to one or more
attributes are more likely to form an edge, cf. (Papadopoulos et al. 2012). Ideally, the
elicitation of belief matrices is based on existing theories.
For example, based on the information shown in Fig. 1, one could “believe” that two
authors collaborate more frequently together if: (1) they both are from the same country,
(2) they share the same gender, (3) they have high positions, or (4) they are popular in
terms of number of articles and citations. We capture each of these beliefs in one matrix.
One implementation of the matrices for our example beliefs could be:
B1(same country): bij :=0.9 if fi[country]=fj[country]and 0.1 otherwise
B2(same gender): bij :=0.9 if fi[gender]=fj[gender]and 0.1 otherwise
B3(hierarchy): bij :=fi[position]·fj[position]
B4(popularity): bij :=fi[articles]+fj[articles]+fi[citations]+fj[citations]
Figure 3a shows the matrix representation of belief B1, and Fig. 3b its respective
row-wise normalization for the local model case. While belief matrices are identically
structured for local and global models, the ratio between parameters in different rows is
crucial for the global model, but irrelevant for local ones.
Dyad-attributed networks For the particular case of Dyad-Attributed networks, beliefs
are described using the underlying mechanisms of secondary multigraphs. For instance, a
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 8 of 20
Fig. 3 Prior belief: This figure illustrates the three main phases of prior elicitation. That is, aamatrix
representation of belief B1, where authors are more likely to collaborate with each other if they are from the
same country. bB1normalized row-wise using the local model interpretation. cPrior elicitation for κ=4; i.e.,
αij =bij
Z×κ+1
co-authorship network—where every node represents an author with no additional infor-
mation or attribute—could be explained by a citation network under the hypothesis that
if two authors frequently cite each other, they are more likely to also co-author together.
Thus, the adjacency (feature) matrices (ˆ
F)of secondary multigraphs can be directly used
as belief matrices B=(bij). However, we can express additional beliefs by transforming
the matrices. As an example, we can formalize the belief that the presence of a feature
tends to inhibit the formation of edges in the data by setting bij :=−sigm(fij),wheresigm
is a sigmoid function such as the logistic function.
Eliciting a Dirichlet prior.
In order to obtain the hyperparameters αof a prior Dirichlet distribution, we utilize
the pseudo-count interpretation of the parameters αij of the Dirichlet distribution, i.e.,
a value of αij can be interpreted as αij 1 previous observations of the respective
event for αij 1. We distribute pseudo-counts proportionally to a belief matrix. Con-
sequently, the hyperparameters can be expressed as: αij =bij
Z×κ+1, where κis
the concentration parameter of the prior. The normalization constant Zis computed
as the sum of all entries of the belief matrix in the global model, and as the respec-
tive row sum in the local case. We suggest to set κ=n×kfor the local models,
κ=n2×kfor the directed global case, κ=n(n+1)
2×kfor the undirected global case, and
k={0, 1, ..., 10}. A high value of κexpresses a strong belief in the prior parameters. A sim-
ilar alternative method to obtain Dirichlet priors is the trial roulette method (Singer et al.
2015). For the global model variation, all αvalues are parameters for the same Dirichlet
distribution, whereas in the local model variation, each row parametrizes a separate
Dirichlet distribution. Figure 3 (c) shows the prior elicitation of belief B1forkappa =4
using the local model.
Computation of the marginal likelihood
For comparing the relative plausibility of hypotheses, we use the marginal likelihood. This
is the aggregated likelihood over all possible values of the parameters θweighted by the
Dirichlet prior. For our set of local models we can calculate them as:
P(D|H)=
n
i=1
n
j=1αij
n
j=1αij +mij
n
j=1
(αij +mij)
(αij)(2)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 9 of 20
Recall, αij encodes our prior belief connecting nodes viand vjin G,andmij are the
actual edge counts. Since we evaluate only a single model in the global case, the product
over rows iof the adjacency matrix can be removed, and we obtain:
P(D|H)=
n
i=1n
j=1αij
n
i=1n
j=1αij +mij
n
i=1
n
j=1
αij +mij
(αij)(3)
Section “Computation of the marginal likelihood” holds for directed networks. In the
undirected case, indices jgo from ito naccounting for only half of the matrix including
the diagonal to avoid inconsistencies. For a detailed derivation of the marginal likelihood
given a Dirichlet-Categorical model see (Tu 2014; Singer et al. 2014). For both models we
focus on the log-marginal likelihoods in practice to avoid underflows.
Bayes factor Formally, we compare the relative plausibility of hypotheses by using so-
called Bayes factors (Kass and Raftery 1995), which simply are the ratios of the marginal
likelihoods for two hypotheses H1and H2. If it is positive, the first hypothesis is judged as
more plausible. The strength of the Bayes factor can be checked in an interpretation table
provided by Kass and Raftery (1995).
Application of the method and interpretation of results
We now showcase an example application of our approach featuring the network shown
in Fig. 1, and demonstrate how results can be interpreted.
Hypotheses We compare four hypotheses (represented as belief matrices) B1,B2,B3,
and B4elaborated in Section “Hypothesis elicitation”. Additionally, we use the uniform
hypothesis as a baseline. It assumes that all edges are equally likely, i.e., bij =1 for all
i,j. Hypotheses that are not more plausible than the uniform cannot be assumed to cap-
ture relevant underlying mechanisms of edge formation. We also use the data hypothesis
as an upper bound for comparison, which employs the observed adjacency matrix as
belief: bij =mij.
Calculation and visualization For each hypothesis Hand every κ, we can elicit the
Dirichlet priors (cf. Section “Hypothesis elicitation”), determine the aggregated marginal
likelihood (cf. Section “Computation of the marginal likelihood)”, and compare the plausi-
bility of hypotheses compared to the uniform hypothesis at the same κby calculating the
logarithm of the Bayes factor as log(P(D|H))log(P(D|Huniform)). We suggest two ways of
visualizing the results, i.e., plotting the marginal likelihood values, and showing the Bayes
factors on the y-axis as shown in Fig. 4a and 4b respectively for the local model. In both
cases, the x-axis refers to the concentration parameter κ. While the visualization show-
ing directly the marginal likelihoods carries more information, visualizing Bayes factors
makes it easier to spot smaller differences between the hypotheses.
Interpretation Every line in Fig. 4a to 4d represents a hypothesis using the local (top) and
global models (bottom). In Fig. 4a and 4c, higher evidence values mean higher plausibility.
Similarly, in Fig. 4b and 4d positive Bayes factors mean that for a given κ, the hypothesis is
judged to be more plausible than the uniform baseline hypothesis; here, the relative Bayes
factors also provide a ranking. If evidences or Bayes factors are increasing with κ,wecan
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 10 of 20
Fig. 4 Ranking of hypotheses for the introductory example. a,bRepresent results using the local model and
c,dresults of the global model. Rankings can be visualized using a,cthe marginal likelihood or evidence
(y-axis), or b,dusing Bayes factors (y-axis) by setting the uniform hypothesis as a baseline to compare with;
higher values refer to higher plausibility. The x-axis depicts the concentration parameter κ. For this example,
from an individual perspective (local model) authors from the multigraph shown in Fig. 1 appear to prefer to
collaborate more often with researchers of the same country rather than due to popularity (i.e., number of
articles and citations). In this particular case, the same holds for the global model. Note that all hypotheses
outperform the uniform, meaning that they all are reasonable explanations of edge formation for the given
graph
interpret this as further evidence for the plausibility of expressed hypothesis as this means
that the more we believe in it, the higher the Bayesian approach judges its plausibility. As
a result for our example, we see that the hypothesis believing that two authors are more
likely to collaborate if they are from the same country is the most plausible one (after
the data hypothesis). In this example, all hypotheses appear to be more plausible than
the baseline in both local and global models, but this is not necessarily the case in all
applications.
Experiments
We demonstrate the utility of our approach on both synthetic and empirical networks.
Synthetic node-attributed multigraph
We start with experiments on a synthetic node-attributed multigraph. Here, we control
the underlying mechanisms of how edges in the network emerge and thus, expect these
also to be good hypotheses for our approach.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 11 of 20
Network The network contains 100 nodes where each node is assigned one of two colors
with uniform probability. For each node, we then randomly drew 200 undirected edges
where each edge connects randomly with probability p=0.8 to a different node of the
same color, and with p=0.2 to a node of the opposite color. The adjacency matrix of this
graph is visualized in Fig. 5a.
Hypotheses In addition to the uniform baseline hypothesis, we construct two intu-
itive hypotheses based on the node color that express belief in possible edge formation
mechanics. First, the homophily hypothesis assumes that nodes of the same color are
more likely to have more edges between them. Therefore, we arbitrary set belief values
bij to 80 when nodes viand vjare of the same color, and 20 otherwise. Second, the het-
erophily hypothesis expresses the opposite behavior; i.e., bij =80 if the color of nodes vi
and vjare different, and 20 otherwise. An additional selfloop hypothesis only believes in
self-connections (i.e., diagonal of adjacency matrix).
Results Figure 5b and 5c show the ranking of hypotheses based on their Bayes factors
compared to the uniform hypothesis for the local and global models respectively. Clearly,
Fig. 5 Ranking of hypotheses for synthetic attributed multigraph. In a, we show the adjacency matrix of a
100-node 2-color random multigraph with a node correlation of 80% for nodes of the same color and 20%
otherwise. One can see the presence of homophily based on more connections between nodes of the same
color; the diagonal is zero as there are no self-connections. In b,cwe show the ranking of hypotheses based
on Bayes factors when compared to the uniform hypothesis for the local and global models respectively. As
expected, in general the homophily hypothesis explains the edge formation best (positive Bayes factor and
close to the data curve), while the heterophily and selfloop hypotheses provide no good explanations for
edge formation in both local and global cases—they show negative Bayes factors
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 12 of 20
in both models the homophily hypothesis is judged as the most plausible. This is expected
and corroborates the fact that network connections are biased towards nodes of the same
color. The heterophily and selfloop hypotheses show negative Bayes factors; thus, they are
not good hypotheses about edge formation in this network. Due to the fact that the multi-
graph lacks of selfloops, the selfloop hypothesis decreases very quickly with increasing
strength of belief κ.
Synthetic multiplex network
In this experiment, we control the underlying mechanisms of how edges in a dyad-
attributed multigraph emerge using multiple multigraphs that share the same nodes with
different link structure (i.e., multiplex) and thus, expect these also to be good hypotheses
for JANUS.
Network The network is an undirected configuration model graph (Newman 2003) with
parameters n=100 (i.e., number of nodes) and degree sequence
k=kidrawn from
a power law distribution of length nand exponent 2.0, where kiis the degree of node vi.
The adjacency matrix of this graph is visualized in Fig. 6a.
Fig. 6 Ranking of hypotheses for synthetic multiplex network. In awe show the adjacency matrix of a
configuration model graph of 100 nodes and power-law distributed degree sequence. In b,cthe ranking of
hypotheses is shown for the local and global model respectively. As expected, hypotheses are ranked from
small to big values of since small values represent only a few changes in the original adjacency matrix of
the configuration graph. Both models show that when the original graph changes at least 70% of its edges
the new graph cannot be explained better than random (i.e., uniform)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 13 of 20
Hypotheses Besides the uniform hypothesis, we include ten more hypotheses derived
from the original adjacency matrix of the configuration model graph where only certain
percentage of edges get shuffled. The bigger the the less plausible the hypothesis since
more shuffles can modify drastically the original network.
Results Figure 6b and 6c show the ranking of hypotheses based on their Bayes fac-
tors compared to the uniform hypothesis for the local and global model respectively. In
general, hypotheses are ranked as expected, from small to big values of . For instance,
the epsilon10p hypothesis explains best the configuration model graph—represented in
Fig. 6a—since it only shuffles 10% of all edges (i.e., 10 edges). On the other hand the
epsilon100p hypothesis shows the worst performance (i.e., Bayes factor is negative and far
from the data curve) since it shuffles all edges, therefore it is more likely to be different
than the original network.
Empirical node-attributed multigraph
Here, we focus on a real-world contact network based on wearable sensors.
Network We study a network capturing interactions of 5 households in rural Kenya
between April 24 and May 12, 2012 (Sociopatterns; Kiti et al. 2016). The undirected
unweighted multigraph contains 75 nodes (persons) and 32, 643 multiedges (contacts)
which we aim to explain. For each node, we know information such as gender and age
(encoded into 5 age intervals). Interactions exist within and across households. Figure 7a
shows the adjacency matrix (i.e., number of contacts between two people) of the network.
Household membership of nodes (rows/columns) is shown accordingly.
Hypotheses We investigate edge formation by comparing—next to the uniform base-
line hypothesis—four hypotheses based on node attributes as prior beliefs. (i) The similar
age hypothesis expresses the belief that people of similar age are more likely to inter-
act with each other. Entries bij of the belief matrix Bare set to the inverse age distance
between members: 1
1+abs(fi[age]fj[age]). (ii) The same household hypothesis believes that
people are more likely to interact with people from the same household. We arbitrarily
set bij to 80 if person viand person vjbelong to the same household, and 20 otherwise.
(iii) With the same gender hypothesis we hypothesize that the number of same-gender
interactions is higher than the different-gender interactions. Therefore, every entry bij of
Bissetto80ifpersonsviand vjare of the same gender, and 20 otherwise. Finally, (iv)
the different gender hypothesis believes that it is more likely to find different-gender than
same-gender interactions; bij issetto80ifpersonvihas the opposite gender of person vj,
and 20 otherwise.
Results Results shown in Fig. 7b and 7c show the ranking of hypotheses based on Bayes
factors using the uniform hypothesis as baseline for the local and global model respec-
tively. The local model Fig. 7b indicates that the same household hypothesis explains the
data the best, since it has been ranked first and it is more plausible than the uniform. The
similar age hypothesis also indicates plausibility due to positive Bayes factors. Both the
same and different gender hypotheses show negative Bayes factors when compared to the
uniform hypothesis suggesting that they are not good explanations of edge formation in
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 14 of 20
Fig. 7 Ranking of hypotheses for Kenya contact network. aShows the adjacency matrix of the network with
node ordering according to household membership. Darker cells indicate more contacts. b,cDisplay the
ranking of hypotheses based on Bayes factors, using the uniform hypothesis as baseline for the local and
global model respectively. Using the local model bthe same household hypothesis ranks highest followed by
the similar age hypothesis which also provides positive Bayes Factors. On the other hand, the same and
different gender hypotheses are less plausible than the baseline (uniform edge formation) in both the local
and global case. In the global case call hypotheses are bad representations of edge formation in the Kenya
contact network. This is due to the fact that interactions are very sparse, even within households. Results are
consistent for all κ
this network. This gives us a better understanding of potential mechanisms producing
underlying edges. People prefer to contact people from the same household and similar
age, but not based on gender preferences. Additional experiments could further refine
these hypotheses (e.g., combining them). In the general case of the global model in Fig. 7c
all hypotheses are bad explanations of the Kenya network. However, the same-household
hypothesis tends to go upfront the uniform for higher values of κ,butstillfarformthe
data curve. This happens due to the fact that the interaction network is very sparse (even
within same households), thus, any hypothesis with a dense belief matrix will likely fall
below or very close to the uniform.
Empirical multiplex network
This empirical dataset consists of four real-world social networks, each of them extracted
from Twitter interactions of a particular set of users.
Network We obtained the Higgs Twitter dataset from SNAP (SNAP Higgs Twitter
datasets). This dataset was built upon the interactions of users regarding the discovery
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 15 of 20
of a new particle with the features of the elusive Higgs boson on the 4th of July
2012 (De Domenico et al. 2013). Specifically, we are interested on characterizing edge for-
mation in the reply network, a directed unweighted multigraph which encodes the replies
that a person visent to a person vjduring the event. This graph contains 38, 918 nodes
and 36, 902 multiedges (if all edges from the same dyad are merged it accounts for 32, 523
weighted edges).
Hypotheses We aim to characterize the reply network by incorporating other
networks—sharing the same nodes but different network structure—as prior beliefs. In
this way we can learn whether the interactions present in the reply network can be better
explained by a retweet or mentioning or following (social) network. The retweet hypoth-
esis expresses our belief that the number of replies is proportional to the number of
retweets. Hence, beliefs bij aresettothenumberoftimesuserviretweeted a post from
user vj. Similar as before, the mention hypothesis states that the number of replies is
proportional to the number of mentions. Therefore, every entry bij is set to the num-
ber of times user vimentioned user vjduring the event. The social hypothesis captures
our belief that users are more likely to reply to their friends (in the Twitter jargon: fol-
lowees or people they follow) than to the rest of users. Thus, we set bij to 1 if user vi
follows user vjand 0 otherwise. Finally, we combine all the above networks to construct
the retweet-mention-social hypothesis which captures all previous hypotheses at once. In
other words, it reflects our belief that users are more likely to reply to their friends and (at
the same time) the number of replies is proportional to the number of retweets and men-
tions. Therefore the adjacency matrix for this hypothesis is simply the sum of the three
networks described above.
Results The results shown in Fig. 8 suggest that the mention hypothesis explains the
reply network very well, since it has been ranked first and it is very close to the data curve,
in both Fig. 8a and 8b for the local and global models, respectively. The retweet-mention-
social hypothesis also indicates plausibility since it outperforms the uniform (i.e., positive
Bayes factors). However, if we look at each hypothesis individually, we can see that the
combined hypothesis is dominated mainly by the mention hypothesis. The social hypoth-
esis is also a good explanation of the number of replies since it outperforms the uniform
hypothesis. Retweets and Self-loops on the other hand show negative Bayes factors, sug-
gesting that they are not good explanations of edge formation in the reply network. Note
that the retweet curve in the local model has a very strong tendency to go below the
uniform for higher numbers of κ. These results suggest us that the number of replies
is proportional to the number of mentions and that usually people prefer to reply other
users within their social network (i.e., followees).
Discussion
Next, we discuss some aspects and open questions related to the proposed approach.
Comparison to existing method While we have already demonstrated the plausibility of
JANUS based on synthetic datasets, we want to discuss how our results compare to exist-
ing state-of-the-art methods. A simple alternative approach to evaluate the plausibility
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 16 of 20
Fig. 8 Ranking of hypotheses for Reply Higgs Network. a,bRanking of hypotheses based on Bayes factors
when compared to the uniform hypothesis using multiplexes for the local and global models respectively. In
both cases, the mention hypothesis explains best the reply network, since it is ranked first and very close to
the data curve. This might be due to the fact that replies inherit a user mention from whom a tweet was
originally posted. We can see that the combined retweet-mention-social hypothesis is the second best
explanation of the reply network. This is mainly due to the mention hypothesis which performs extremely
better than the other two (social and retweet). The social hypothesis can also be considered a good
explanation since it outperforms the uniform. The retweet hypothesis tends to perform worse than the
uniform in both cases for increasing number of κ. Similarly, the selfloop hypothesis drops down below the
uniform since there are only very few selfloops in the reply network data
of beliefs as expressed by the belief matrices is to compute a Pearson correlation coeffi-
cient between the entries in the belief matrix and the respective entries in the adjacency
matrix of the network. To circumvent the difficulties of correlating matrices, they can be
flattened to vectors that are then passed to the correlation calculation. Then, hypotheses
can be ranked according to their resulting correlation against the data. However, by flat-
tening the matrices, we disregard the direct relationship between nodes in the matrix and
introduce inherent dependencies to the individual data points of the vectors used for Pear-
son calculation. To tackle this issue, one can utilize the Quadratic Assignment Procedure
(QAP) as mentioned in Section “Related work”. QAP is a widely used technique for testing
hypotheses on dyadic data (e.g., social networks). It extends the simple Pearson correla-
tion calculation step by a significance test accounting for the underlying link structure in
the given network using shuffling techniques. For a comparison with our approach, we
executed QAP for all datasets and hypotheses presented in Section “Experiments” using
the qaptest function included in the statnet (Handcock et al. 2008; Handcock et al.
2016) package in R(R Core Team 2016).
Overall, we find in all experiments strong similarities between the ranking provided by
the correlation coefficients of QAP and our rankings according to JANUS. Exemplary,
Table 1 shows the correlation coefficients and p-values obtained with QAP for each
hypothesis tested on the synthetic multiplex described in Section “Syntheticmultiplex
network” as well as the ranking of hypotheses obtained from JANUS for the local
and global model (leaving the uniform hypothesis out). However, in other datasets
minor differences in the ordering of the hypotheses could be observed between the two
approaches.
Compared to QAP, JANUS yields several advantages, but also some disadvantages. First,
by utilizing our belief matrix as priors over parameter configurations instead of fixed
parameter configurations themselves, we allow for tolerance in the parameter specifi-
cation. Exploring different values of tolerance expressed by our parameter κallows for
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 17 of 20
Table 1 QAP on synthetic dyad-attributed network (multiplex): List of correlation coefficients for
each hypothesis tested. Last two columns show ranking of hypotheses according to JANUS for the
local and global models. By omitting the uniform hypothesis in JANUS (rank 7) we can see that the
ranking of hypotheses by correlation aligns with the rankings given by JANUS for the multiplex given
in Section “Synthetic multiplex network”
Hypothesis Correlation Coefficient P-Value JANUS Ranking Local JANUS Ranking Global
Epsilon10p 0.939 0.0** 1 1
Epsilon20p 0.863 0.0** 2 2
Epsilon30p 0.787 0.0** 3 3
Epsilon40p 0.704 0.0** 4 4
Epsilon50p 0.636 0.0** 5 5
Epsilon60p 0.461 0.0** 6 6
Epsilon70p 0.352 0.0** 8 8
Epsilon80p 0.242 0.0** 9 9
Epsilon90p 0.142 0.0** 10 10
Epsilon100p 0.010 0.238 11 11
Statistically highly significant p-values (p<0.001) are marked by (**)
more fine-grained and advanced insights into the relative plausibility of hypotheses. Con-
trary, simple correlation takes the hypothesis as it is and calculates a single correlation
coefficient that does not allow for tolerances.
Second, by building upon Bayesian statistics, the significance (or decisiveness) of results
in our approach is determined by Bayes factors, a Bayesian alternative to traditional
p-value testing. Instead of just measuring evidence against one null hypothesis, Bayes
Factors allow to directly gather evidence in favor of a hypothesis compared to another
hypothesis, which is arguably more suitable for ranking.
Third, QAP and MRQAP, and subsequently correlation and regression, are subject to
multiple assumptions which our generative Bayesian approach circumvents. Currently,
we employ QAP with simplistic linear Pearson correlation coefficients. However, one
could argue that count data (multiplicity of edges) warrants advanced generalized linear
models such as Poisson regression or Negative Binomial regression models.
Furthermore, our approach intuitively allows to model not only the overall network, but
also the ego-networks of the individual nodes using the local models presented above.
Finally, correlation coefficients cannot be applied for all hypotheses. Specifically, it is not
possible to compute it for the uniform hypothesis since in this case all values in the flat-
ten vector are identical. However, our method currently does not sufficiently account
for dependencies within the network as it is done by specialized QAP significance tests.
Exploring this issue and extending our Bayesian approach into this direction will be a key
subject of future work.
Runtime performance A typical concern often associated with Bayesian procedures
are the excessive runtime requirements, especially if calculating marginal likelihoods
is necessary. However, the network models employed for this paper allow to calculate
the marginal likelihoods—and consequently also the Bayes factors—efficiently in closed
form. This results in runtimes, which are not only competitive with alternative methods
such as QAP and MRQAP, but could be calculated up to 400 times faster than MRQAP
in our experiments as MRQAP requires many data reshuffles and regression fits. Fur-
thermore, the calculation (of Bayesian evidence) could easily be distributed onto several
computational units, cf. (Becker et al. 2016).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 18 of 20
Local vs global model In this paper, we presented two variations of our approach,
i.e., a local and a global model. Although both model substantially different generation
processes (an entire network vs. a set of ego-networks), our experiments have shown that
hypothesesintheglobalscenarioarerankedmostlythesameastheonesusingthelocal
model. This is also to be expected to some degree since the constructed hypotheses did
not explicitly expressed a belief that outgoing links are more likely for some nodes.
Inconsistency of local model For directed networks, the local ego-network models can
assemble a full graph model by defining a probability distribution of edges for every source
node. For undirected networks, this is not directly possible as e.g., the ego-network model
for vAgenerated an edge from vAto vB, but the ego-network model for node vBdid not
generate any edge to vA. Note that this does not affect our comparison of hypotheses as
we characterize the network.
Single Edges As mentioned in Section “Background”, JANUS focuses on multigraphs,
meaning that edges might appear more than once. This is because we assume that a given
node vi,withsomeprobabilitypij , will be connected multiple times to any other node vj
in the local models. The same applies to the global model where we assume that a given
edge (vi,vj)will appear multiple times within the graph with some probability pij.For
the specific case of single edges (i.e., unweighted graphs), where mij ∈{0, 1}, one might
consider other probabilistic models to represent such graphs.
Sparse data-connections Most real networks exhibit small world properties such as
high clustering coefficient and fat-tailed degree distributions meaning that the adjacency
matrices are sparse. While comparison still relatively judges the plausibility, all hypothe-
ses perform weak compared to the data curve as shown in Fig. 7. As an alternative, one
might want to limit our beliefs to only those edges that exist in the network, i.e., we would
then only build hypotheses on how edge multiplicity varies between edges.
Other limitations and future work The main intent of this work is the introduction of
a hypothesis-driven Bayesian approach for understanding edge formation in networks.
To that end, we showcased this approach on simple categorical models that warrant
extensions, e.g., by incorporating appropriate models for other types of networks such as
weighted or temporal networks. We can further investigate how to build good hypothe-
ses by leveraging all node attributes, and infer subnetworks that fit best each of the given
hypotheses. In the future, we also plan an extensive comparison to other methods such
as mixed-effects models and pmodels. Ultimately, our models also warrant extensions
to adhere to the degree sequence in the network, e.g., in the direction of multivariate
hypergeometric distributions as recently proposed in (Casiraghi et al. 2016).
Conclusions
In this paper, we have presented a Bayesian framework that facilitates the understand-
ing of edge formation in node-attributed and dyad-attributed multigraphs. The main idea
is based on expressing hypotheses as beliefs in parameters (i.e., multiplicity of edges),
incorporate them as priors, and utilize Bayes factors for comparing their plausibility. We
proposed simple local and global Dirichlet-categorical models and showcased their utility
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 19 of 20
on synthetic and empirical data. For illustration purposes our examples are based on small
networks. We tested our approach with larger networks obtaining identical results. We
briefly compare JANUS with existing methods and discuss some advantages and disad-
vantages over the state-of-the-art QAP. In future, our concepts can be extended to further
models such as models adhering to fixed degree sequences. We hope that our work
contributes new ideas to the research line of understanding edge formation in complex
networks.
Acknowledgements
This work was partially funded by DFG German Science Fund research projects “KonSKOE” and “PoSTs II”.
Availability of data and materials
The data sets supporting the results of this article are openly available on the Web. The source code and data for
toy-example and synthetic experiments can be found on GitHub: https://github.com/lisette-espin/JANUS. The rest of
data sets can be found in their respective project websites: Kenya contact network in http://www.sociopatterns.org/
datasets/kenyan-households- contact-network/, and the Higgs Twitter dataset in https://snap.stanford.edu/data/higgs-
twitter.html.
Authors’ contributions
LE, PS and FL conceived and designed the experiments. LE, performed the experiments. LE, PS and FL analyzed the data.
LE, PS and FL contributed reagents/materials/analysis tools: LE PS FL. LE, PS, FL and MS wrote the paper. All authors read
and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Received: 18 March 2017 Accepted: 25 May 2017
References
Adamic LA, Glance N (2005) The political blogosphere and the 2004 us election: divided they blog. In: Proceedings of the
3rd Int. Workshop on Link Discovery. ACM, New York. pp 36–43. doi:10.1145/1134271.1134277
Becker M, Mewes H, Hotho A, Dimitrov D, Lemmerich F, Strohmaier M (2016) Sparktrails: A mapreduce implementation of
hyptrails for comparing hypotheses about human trails. In: Proceedings of the 25th International Conference
Companion on World Wide Web. International World Wide Web Conferences Steering Committee, Republic and
Canton of Geneva. pp 17–18. doi:10.1145/2872518.2889380
Casiraghi G, Nanumyan V, Scholtes I, Schweitzer F (2016) Generalized hypergeometric ensembles: Statistical hypothesis
testing in complex networks. CoRR abs/1607.02441. arXiv:1607.02441
De Domenico M, Lima A, Mougel P, Musolesi M (2013) The anatomy of a scientific rumor. Sci Rep 3:2980 EP
Espín-Noboa L (2016) JANUS. https://github.com/lisette-espin/JANUS. Accessed 10 Mar 2017
Espín-Noboa L, Lemmerich F, Strohmaier M, Singer P (2017) A hypotheses-driven bayesian approach for understanding
edge formation in attributed multigraphs. In: International Workshop on Complex Networks and Their Applications.
Springer, Cham. pp 3–16. doi:10.1007/978-3-319-50901-3_1
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2010) A survey of statistical network models. Found Trends® Mach
Learn 2(2):129–233
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M (2008) statnet: Software tools for the representation,
visualization, analysis and simulation of network data. J Stat Softw 24(1):1–11
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Bender-deMoll S, Morris M (2016) Statnet: Software Tools
for the Statistical Analysis of Network Data. The Statnet Project (http://www.statnet.org). The Statnet Project (http://
www.statnet.org). R package version 2016.4. CRAN.R-project.org/package=statnet. Accessed 31 May 2017
Holland PW, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. J Am Stat Assoc
76(373):33–50
Hubert L, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29(2):190–241
Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kim M, Leskovec J (2011) Modeling social networks with node attributes using the multiplicative attribute graph model.
In: UAI 2011, Barcelona, Spain, July 14–17, 2011. pp 400–409
Kiti MC, Tizzoni M, Kinyanjui TM, Koech DC, Munywoki PK, Meriac M, Cappa L, Panisson A, Barrat A, Cattuto C, et al (2016)
Quantifying social contacts in a household setting of rural kenya using wearable proximity sensors. EPJ Data Sci 5(1):1
Kleineberg KK, Boguñ (á M, Serrano MÁ, Papadopoulos F (2016) Hidden geometric correlations in real multiplex
networks. Nature Physics 12:1076–1081. http://dx.doi.org/10.1038/nphys3812
Krackhardt D (1988) Predicting with networks: Nonparametric multiple regression analysis of dyadic data. Soc Netw
10(4):359–381
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Espín-Noboa et al. Applied Network Science (2017) 2:16 Page 20 of 20
Kruschke J (2014) Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press, Boston
Martin T, Ball B, Karrer B, Newman M (2013) Coauthorship and citation patterns in the physical review. Phys Rev E
88(1):012814
Moreno S, Neville J (2013) Network hypothesis testing using mixed kronecker product graph models. In: Data Mining
(ICDM), Dallas, Texas. IEEE. pp 1163–1168
Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256
Nguyen HT (2012) Multiple hypothesis testing on edges of graph: a case study of bayesian networks. https://hal.archives-
ouvertes.fr/hal-00657166
Papadopoulos F, Kitsak M, Serrano MÁ, Boguná M, Krioukov D (2012) Popularity versus similarity in growing networks.
Nature 489(7417):537–540
Pfeiffer III JJ, Moreno S, La Fond T, Neville J, Gallagher B (2014) Attributed graph models: Modeling network structure with
correlated attributes. In: WWW. ACM, New York. pp 831–842
R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria. R Foundation for Statistical Computing. https://www.R-project.org/. Accessed 31 May 2017
Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social
networks. Soc Netw 29(2):173–191
Sampson SF (1968) A Novitiate in a Period of Change: An Experimental and Case Study of Social Relationships. Cornell
University, Ithaca
Schwiebert L, Gupta SK, Weinmann J (2001) Research challenges in wireless networks of biomedical sensors. In:
Proceedings of the 7th Annual International Conference on Mobile Computing and Networking. ACM, New York.
pp 151–165
Shah KR, Sinha BK (1989) Mixed Effects Models. In: Theory of Optimal Designs. Springer, New York. pp 85–96
Singer P, Helic D, Taraghi B, Strohmaier M (2014) Detecting memory and structure in human navigation patterns using
markov chain models of varying order. PloS One 9(7):102070
Singer P, Helic D, Hotho A, Strohmaier M (2015) Hyptrails: A bayesian approach for comparing hypotheses about human
trails on the web. WWW, International World Wide Web Conferences Steering Committee, Republic and Canton of
Geneva. pp 1003–1013. doi:10.1145/2736277.2741080
SNAP Higgs Twitter datasets. https://snap.stanford.edu/data/higgs-twitter.html. Accessed 15 Aug 2016
Snijders T, Spreen M, Zwaagstra R (1995) The use of multilevel modeling for analysing personal networks: Networks of
cocaine users in an urban area. J Quant Anthropol 5(2):85–105
Snijders TA (2011) Statistical models for social networks. Rev Sociol 37:131–153
Sociopatterns. http://www.sociopatterns.org/datasets/kenyan-households- contact-network/. Accessed 26 Aug 2016
Tu S (2014) The dirichlet-multinomial and dirichlet-categorical models for bayesian inference. Computer Science Division,
UC Berkeley
Winter B (2013) Linear models and linear mixed effects models in r with linguistic applications. arXiv:1308.5499
Xiang R, Neville J, Rogati M (2010) Modeling relationship strength in online social networks. In: WWW. ACM, New York.
pp 981–990. doi:10.1145/1772690.1772790
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Network science is now mature enough to move towards analyzing and mining networks with much richer information than plain complex networks, modeled as feature-rich complex networks. Different extensions and future research directions can be identified: These include hybrid combinations of attributed, multiplex, and dynamic networks, e. g., [38], adaptations of Bayesian approaches for modeling and mining feature-rich networks [7,13], extending and adapting network embedding methods and deep learning approaches for feature-rich networks, c. f., [15,24,37,40] for some initial directions, and also including semantics into feature-rich network modeling and mining e. g., [28][29][30][31]. ...
Conference Paper
In the field of web mining and web science, as well as data science and data mining there has been a lot of interest in the analysis of (social) networks. With the growing complexity of heterogeneous data, feature-rich networks have emerged as a powerful modeling approach: They capture data and knowledge at different scales from multiple heterogeneous data sources, and allow the mining and analysis from different perspectives. The challenge is to devise novel algorithms and tools for the analysis of such networks. This tutorial provides a unified perspective on feature-rich networks, focusing on different modeling approaches, in particular multiplex and attributed networks. It outlines important principles, methods, tools and future research directions in this emerging field.
Article
Full-text available
Creation and exchange of knowledge depends on collaboration. Recent work has suggested that the emergence of collaboration frequently relies on geographic proximity. However, being co-located tends to be associated with other dimensions of proximity, such as social ties or a shared organizational environment. To account for such factors, multiple dimensions of proximity have been proposed, including cognitive, institutional, organizational, social and geographical proximity. Since they strongly interrelate, disentangling these dimensions and their respective impact on collaboration is challenging. To address this issue, we propose various methods for measuring different dimensions of proximity. We then present an approach to compare and rank them with respect to the extent to which they indicate co-publications and co-inventions. We adapt the HypTrails approach, which was originally developed to explain human navigation, to co-author and co-inventor graphs. We evaluate this approach on a subset of the German research community, specifically academic authors and inventors active in research on artificial intelligence (AI). We find that social proximity and cognitive proximity are more important for the emergence of collaboration than geographic proximity.
Conference Paper
Academic conferences are a backbone for the exchange of ideas in scientific communities. However, so far little is known about the communication networks emerging at those venues. Besides personal knowledge, network homophily has been identified as a driving factor for establishing contacts and followerships in social networks, i.e., people are more likely to engage with others if they are similar with respect to certain attributes. In this paper, we describe work in progress on investigating homophily at four academic conferences based on face-to-face (F2F) contact data collected using wearable sensors between conference participants. In particular, we study which personal attributes are predictive for face-to-face contacts. For that purpose, we obtained diverse personal attributes from online sources in order to elicit a variety of hypotheses, which can then be compared using descriptive statistics and a Bayesian method for comparing hypotheses in networks. Our results suggest that personal knowledge (as derived from DBLP and ResearchGate networks) and homophilic behavior with respect to several attributes, e.g., gender or country of origin, are important factors for contacts at academic conferences.
Article
Full-text available
Statistical ensembles define probability spaces of all networks consistent with given aggregate statistics and have become instrumental in the analysis of relational data on networked systems. Their numerical and analytical study provides the foundation for the inference of topological patterns, the definition of network-analytic measures, as well as for model selection and statistical hypothesis testing. Contributing to the foundation of these important data science techniques, in this article we introduce generalized hypergeometric ensembles, a framework of analytically tractable statistical ensembles of finite, directed and weighted networks. This framework can be interpreted as a generalization of the classical configuration model, which is commonly used to randomly generate networks with a given degree sequence or distribution. Our generalization rests on the introduction of dyadic link propensities, which capture the degree-corrected tendencies of pairs of nodes to form edges between each other. Studying empirical and synthetic data, we show that our approach provides broad perspectives for community detection, model selection and statistical hypothesis testing.
Article
Full-text available
Electronic supplementary material: The online version of this article (doi:10.1140/epjds/s13688-016-0084-2) contains supplementary material.
Article
Full-text available
Link prediction appears as a central problem of network science, as it calls for unfolding the mechanisms that govern the micro-dynamics of the network. In this work, we are interested in ego-networks, that is the mere information of interactions of a node to its neighbors, in the context of social relationships. As the structural information is very poor, we rely on another source of information to predict links among egos' neighbors: the timing of interactions. We define several features to capture different kinds of temporal information and apply machine learning methods to combine these various features and improve the quality of the prediction. We demonstrate the efficiency of this temporal approach on a cellphone interaction dataset, pointing out features which prove themselves to perform well in this context, in particular the temporal profile of interactions and elapsed time between contacts.
Conference Paper
HypTrails is a bayesian approach for comparing different hypotheses about human trails on the web. While a standard implementation exists, it exposes performance issues when working with large-scale data. In this paper, we propose a distributed implementation of HypTrails based on Apache Spark taking advantage of several structural properties inherent to HypTrails. The performance improves substantially. Our implementation is publicly available.
Conference Paper
Understanding edge formation represents a key question in network analysis. Various approaches have been postulated across disciplines ranging from network growth models to statistical (regression) methods. In this work, we extend this existing arsenal of methods with a hypotheses-driven Bayesian approach that allows to intuitively compare hypotheses about edge formation on attributed multigraphs. We model the multiplicity of edges using a simple categorical model and propose to express hypotheses as priors encoding our belief about parameters. Using Bayesian model comparison techniques, we compare the relative plausibility of hypotheses which might be motivated by previous theories about edge formation based on popularity or similarity. We demonstrate the utility of our approach on synthetic and empirical data. This work is relevant for researchers interested in studying mechanisms explaining edge formation in networks.
Article
Real networks often form interacting parts of larger and more complex systems. Examples can be found in different domains, ranging from the Internet to structural and functional brain networks. Here, we show that these multiplex systems are not random combinations of single network layers. Instead, they are organized in specific ways dictated by hidden geometric correlations between the layers. We find that these correlations are significant in different real multiplexes, and form a key framework for answering many important questions. Specifically, we show that these geometric correlations facilitate the definition and detection of multidimensional communities, which are sets of nodes that are simultaneously similar in multiple layers. They also enable accurate trans-layer link prediction, meaning that connections in one layer can be predicted by observing the hidden geometric space of another layer. And they allow efficient targeted navigation in the multilayer system using only local knowledge, outperforming navigation in the single layers only if the geometric correlations are sufficiently strong.
Chapter
In most design set-ups such as block designs or row-column designs classification effects such as block effects or row (column) effects are regarded fixed. When these are also considered random variables, we have one or more additional sources of information for estimating treatment effect parameters. Such models are known as mixed effects models. In the experimental designs these were first introduced by Yates (1939).
Article
Article Outline: Glossary Definition of the Subject Introduction Notation and Terminology Dependence Hypotheses Bernoulli Random Graph (Erdös-Rényi) Models Dyadic Independence Models Markov Random Graphs Simulation and Model Degeneracy Social Circuit Dependence: Partial Conditional Dependence Hypotheses Social Circuit Specifications Estimation Goodness of Fit and Comparisons with Markov Models Further Extensions and Future Directions Bibliography Exponential random graph modelsExponential random graph model, also known as p∗ models, constitute a family of statistical models for socialnetworks. The importance of this modeling framework lies in its capacity to represent social structural effects commonly observed in many human socialnetworks, including general degree-based effects as well as reciprocity and transitivity, and at the node-level, homophily and attribute-basedactivity and popularity effects.The models can be derived from explicit hypotheses about dependencies among network ties. They are parametrized in termsof the prevalence of small subgraphs (configurations) in the network and can be interpreted as describing the combinations of local social processes fromwhich a given network emerges. The models are estimable from data and readily simulated.Versions of the models have been proposed for univariateand multivariate networks, valued networks, bipartite graphs and for longitudinal network data. Nodal attribute data can be incorporated in socialselection models, and through an analogous framework for social influence models. The modeling approach was first proposed in the statistical literature in the mid-1980s, building on previous work in the spatial statistics andstatistical mechanics literature. In the 1990s, the models were picked up and extended by the social networks research community. In this century, withthe development of effective estimation and simulation procedures, there has been a growing understanding of certain inadequacies in the originalform of the models. Recently developed specifications for these models have shown a substantial improvement in fitting real social network data, tothe point where for many network data sets a large number of graph features can be successfully reproduced by the fitted models. © 2012 Springer Science+Business Media, LLC. All rights reserved.