Conference Paper

A Hypotheses-driven Bayesian Approach for Understanding Edge Formation in Attributed Multigraphs

Authors:
  • GESIS - Leibniz Institute of the Social Sciences
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Understanding edge formation represents a key question in network analysis. Various approaches have been postulated across disciplines ranging from network growth models to statistical (regression) methods. In this work, we extend this existing arsenal of methods with a hypotheses-driven Bayesian approach that allows to intuitively compare hypotheses about edge formation on attributed multigraphs. We model the multiplicity of edges using a simple categorical model and propose to express hypotheses as priors encoding our belief about parameters. Using Bayesian model comparison techniques, we compare the relative plausibility of hypotheses which might be motivated by previous theories about edge formation based on popularity or similarity. We demonstrate the utility of our approach on synthetic and empirical data. This work is relevant for researchers interested in studying mechanisms explaining edge formation in networks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Clearly, this approach does not exploit all the information available in the data, and therefore may produce sub-optimal results [1]. A step forward in the analysis of weighted networks has recently been proposed in [6]. The authors introduce a Bayesian approach to edge formation, which allows to encode a broad class of hypothesis for the formation of weighted edges in complex networks. ...
Article
We introduce a statistical method to investigate the impact of dyadic relations on complex networks generated from repeated interactions. It is based on generalised hypergeometric ensembles, a class of statistical network ensembles developed recently. We represent different types of known relations between system elements by weighted graphs, separated in the different layers of a multiplex network. With our method we can regress the influence of each relational layer, the independent variables, on the interaction counts, the dependent variables. Moreover, we can test the statistical significance of the relations as explanatory variables for the observed interactions. To demonstrate the power of our approach and its broad applicability, we will present examples based on synthetic and empirical data.
Article
Full-text available
Understanding edge formation represents a key question in network analysis. Various approaches have been postulated across disciplines ranging from network growth models to statistical (regression) methods. In this work, we extend this existing arsenal of methods with JANUS, a hypothesis-driven Bayesian approach that allows to intuitively compare hypotheses about edge formation in multigraphs. We model the multiplicity of edges using a simple categorical model and propose to express hypotheses as priors encoding our belief about parameters. Using Bayesian model comparison techniques, we compare the relative plausibility of hypotheses which might be motivated by previous theories about edge formation based on popularity or similarity. We demonstrate the utility of our approach on synthetic and empirical data.JANUS is relevant for researchers interested in studying mechanisms explaining edge formation in networks from both empirical and methodological perspectives.
Article
Full-text available
Statistical ensembles define probability spaces of all networks consistent with given aggregate statistics and have become instrumental in the analysis of relational data on networked systems. Their numerical and analytical study provides the foundation for the inference of topological patterns, the definition of network-analytic measures, as well as for model selection and statistical hypothesis testing. Contributing to the foundation of these important data science techniques, in this article we introduce generalized hypergeometric ensembles, a framework of analytically tractable statistical ensembles of finite, directed and weighted networks. This framework can be interpreted as a generalization of the classical configuration model, which is commonly used to randomly generate networks with a given degree sequence or distribution. Our generalization rests on the introduction of dyadic link propensities, which capture the degree-corrected tendencies of pairs of nodes to form edges between each other. Studying empirical and synthetic data, we show that our approach provides broad perspectives for community detection, model selection and statistical hypothesis testing.
Article
Full-text available
Electronic supplementary material: The online version of this article (doi:10.1140/epjds/s13688-016-0084-2) contains supplementary material.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Full-text available
One of the most frequently used models for understanding human navigation on the Web is the Markov chain model, where Web pages are represented as states and hyperlinks as probabilities of navigating from one page to another. Predominantly, human navigation on the Web has been thought to satisfy the memoryless Markov property stating that the next page a user visits only depends on her current page and not on previously visited ones. This idea has found its way in numerous applications such as Google's PageRank algorithm and others. Recently, new studies suggested that human navigation may better be modeled using higher order Markov chain models, i.e., the next page depends on a longer history of past clicks. Yet, this finding is preliminary and does not account for the higher complexity of higher order Markov chain models which is why the memoryless model is still widely used. In this work we thoroughly present a diverse array of advanced inference methods for determining the appropriate Markov chain order. We highlight strengths and weaknesses of each method and apply them for investigating memory and structure of human navigation on the Web. Our experiments reveal that the complexity of higher order models grows faster than their utility, and thus we confirm that the memoryless model represents a quite practical model for human navigation on a page level. However, when we expand our analysis to a topical level, where we abstract away from specific page transitions to transitions between topics, we find that the memoryless assumption is violated and specific regularities can be observed. We report results from experiments with two types of navigational datasets (goal-oriented vs. free form) and observe interesting structural differences that make a strong argument for more contextual studies of human navigation in future work.
Article
Full-text available
This text is a conceptual introduction to mixed effects modeling with linguistic applications, using the R programming environment. The reader is introduced to linear modeling and assumptions, as well as to mixed effects/multilevel modeling, including a discussion of random intercepts, random slopes and likelihood ratio tests. The example used throughout the text focuses on the phonetic analysis of voice pitch data.
Article
Full-text available
Directed graph (or digraph) data arise in many fields, especially in contemporary research on structures of social relationships. We describe an exponential family of distributions that can be used for analyzing such data. A substantive rationale for the general model is presented, and several special cases are discussed along with some possible substantive interpretations. A computational algorithm based on iterative scaling procedures for use in fitting data is described, as are the results of a pilot simulation study. An example using previously reported empirical data is worked out in detail. An extension to multiple relationship data is discussed briefly.
Article
Full-text available
The principle that 'popularity is attractive' underlies preferential attachment, which is a common explanation for the emergence of scaling in growing networks. If new connections are made preferentially to more popular nodes, then the resulting distribution of the number of connections possessed by nodes follows power laws, as observed in many real networks. Preferential attachment has been directly validated for some real networks (including the Internet), and can be a consequence of different underlying processes based on node fitness, ranking, optimization, random walks or duplication. Here we show that popularity is just one dimension of attractiveness; another dimension is similarity. We develop a framework in which new connections optimize certain trade-offs between popularity and similarity, instead of simply preferring popular nodes. The framework has a geometric interpretation in which popularity preference emerges from local optimization. As opposed to preferential attachment, our optimization framework accurately describes the large-scale evolution of technological (the Internet), social (trust relationships between people) and biological (Escherichia coli metabolic) networks, predicting the probability of new links with high precision. The framework that we have developed can thus be used for predicting new links in evolving networks, and provides a different perspective on preferential attachment as an emergent phenomenon.
Article
Full-text available
Statistical models for social networks as dependent variables must represent the typical network dependencies between tie variables such as reciprocity, homophily, transitivity, etc. This review first treats models for single (cross-sectionally observed) networks and then for network dynamics. For single networks, the older literature concentrated on conditionally uniform models. Various types of latent space models have been developed: for discrete, general metric, ultrametric, Euclidean, and partially ordered spaces. Exponential random graph models were proposed long ago but now are applied more and more thanks to the non-Markovian social circuit specifications that were recently proposed. Modeling network dynamics is less complicated than modeling single network observations because dependencies are spread out in time. For modeling network dynamics, continuous-time models are more fruitful. Actor-oriented models here provide a model that can represent many dependencies in a flexible way. Strong model development is now going on to combine the features of these models and to extend them to more complicated outcome spaces.
Article
Full-text available
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics. Comment: 96 pages, 14 figures, 333 references
Article
Full-text available
Implanted biomedical devices have the potential to revolutionize medicine. Smart sensors, which are created by combining sensing materials with integrated circuitry , are being considered for several biomedical applications suchas a glucose level monitor or a retina prosthesis. These devices require the capability to communicate with an external computer system (base station) via a wireless interface. The limited power and computational capabilities of smart sensor based biological implants present research challenges in several aspects of wireless networking due to the need for having a bio-compatible, fault-tolerant, energy-efficient, and scalable design. Further, em bedding these sensors in humans add additional requirements. For example, the wireless networking solutions should be ultra-safe and reliable, work trouble-free in different geographical locations (although implants are typically not expected to move; they shouldn't restrict the movements of their human host), and require minimal maintenance. This necessitates application-specific solutions which are vastly different from traditional solutions. In this paper, we describe the potential of biomedical smart sensors. We then explain the challenges for wireless networking of human-embedded smart sensor arrays and our preliminary approach for wireless networking of a retina prosthesis. Our aim is to motivate vigorous research in this area by illustrating the need for more application-specific and novel approaches toward developing wireless networking solutions for human-implanted smart sensors.
Chapter
In most design set-ups such as block designs or row-column designs classification effects such as block effects or row (column) effects are regarded fixed. When these are also considered random variables, we have one or more additional sources of information for estimating treatment effect parameters. Such models are known as mixed effects models. In the experimental designs these were first introduced by Yates (1939).
Article
Article Outline: Glossary Definition of the Subject Introduction Notation and Terminology Dependence Hypotheses Bernoulli Random Graph (Erdös-Rényi) Models Dyadic Independence Models Markov Random Graphs Simulation and Model Degeneracy Social Circuit Dependence: Partial Conditional Dependence Hypotheses Social Circuit Specifications Estimation Goodness of Fit and Comparisons with Markov Models Further Extensions and Future Directions Bibliography Exponential random graph modelsExponential random graph model, also known as p∗ models, constitute a family of statistical models for socialnetworks. The importance of this modeling framework lies in its capacity to represent social structural effects commonly observed in many human socialnetworks, including general degree-based effects as well as reciprocity and transitivity, and at the node-level, homophily and attribute-basedactivity and popularity effects.The models can be derived from explicit hypotheses about dependencies among network ties. They are parametrized in termsof the prevalence of small subgraphs (configurations) in the network and can be interpreted as describing the combinations of local social processes fromwhich a given network emerges. The models are estimable from data and readily simulated.Versions of the models have been proposed for univariateand multivariate networks, valued networks, bipartite graphs and for longitudinal network data. Nodal attribute data can be incorporated in socialselection models, and through an analogous framework for social influence models. The modeling approach was first proposed in the statistical literature in the mid-1980s, building on previous work in the spatial statistics andstatistical mechanics literature. In the 1990s, the models were picked up and extended by the social networks research community. In this century, withthe development of effective estimation and simulation procedures, there has been a growing understanding of certain inadequacies in the originalform of the models. Recently developed specifications for these models have shown a substantial improvement in fitting real social network data, tothe point where for many network data sets a large number of graph features can be successfully reproduced by the fitted models. © 2012 Springer Science+Business Media, LLC. All rights reserved.
Article
Graph is one of important interactive visualization tools. In machine learning, it can be built from observational data, to represent pictorially the characteristics of complex systems. Normally, the di erence between graphs can be used for predicting the variance of systems. However, with a small data system, it is hard to describe the real di erence. Therefore, ensemble methods proposed to use multiple models to obtain better predictive performance. In fact, they combine multiple hypotheses to form a better hypothesis that will make good predictions with a particular problem. We propose in this work a new ensemble approach for graph data: multiple hypothesis testing on edges of graph. This paper describes how to use this approach to deal with the problem of comparison of two sets of graph-based models. In order to perform the interests of proposed approach, we experimented on two sets of simulated Bayesian networks.
Conference Paper
Online social networks have become ubiquitous to today's society and the study of data from these networks has improved our understanding of the processes by which relationships form. Research in statistical relational learning focuses on methods to exploit correlations among the attributes of linked nodes to predict user characteristics with greater accuracy. Concurrently, research on generative graph models has primarily focused on modeling network structure without attributes, producing several models that are able to replicate structural characteristics of networks such as power law degree distributions or community structure. However, there has been little work on how to generate networks with real-world structural properties and correlated attributes. In this work, we present the Attributed Graph Model (AGM) framework to jointly model network structure and vertex attributes. Our framework learns the attribute correlations in the observed network and exploits a generative graph model, such as the Kronecker Product Graph Model (KPGM) and Chung Lu Graph Model (CL), to compute structural edge probabilities. AGM then combines the attribute correlations with the structural probabilities to sample networks conditioned on attribute values, while keeping the expected edge probabilities and degrees of the input graph model. We outline an efficient method for estimating the parameters of AGM, as well as a sampling method based on Accept-Reject sampling to generate edges with correlated attributes. We demonstrate the efficiency and accuracy of our AGM framework on two large real-world networks, showing that AGM scales to networks with hundreds of thousands of vertices, as well as having high attribute correlation.
Conference Paper
The recent interest in networks-social, physical, communication, information, etc.-has fueled a great deal of research on the analysis and modeling of graphs. However, many of the analyses have focused on a single large network (e.g., a sub network sampled from Facebook). Although several studies have compared networks from different domains or samples, they largely focus on empirical exploration of network similarities rather than explicit tests of hypotheses. This is in part due to a lack of statistical methods to determine whether two large networks are likely to have been drawn from the same underlying graph distribution. Research on across-network hypothesis testing methods has been limited by (i) difficulties associated with obtaining a set of networks to reason about the underlying graph distribution, and (ii) limitations of current statistical models of graphs that make it difficult to represent variations across networks. In this paper, we exploit the recent development of mixed-Kronecker Product Graph Models, which accurately capture the natural variation in real world graphs, to develop a model-based approach for hypothesis testing in networks.
Article
In a 1935 paper and in his book Theory of Probability, Jeffreys developed a methodology for quantifying the evidence in favor of a scientific theory. The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half. Although there has been much discussion of Bayesian hypothesis testing in the context of criticism of P-values, less attention has been given to the Bayes factor as a practical tool of applied statistics. In this article we review and discuss the uses of Bayes factors in the context of five scientific applications in genetics, sports, ecology, sociology, and psychology. We emphasize the following points:
Article
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
Article
An introduction to doing Bayesian data analysis This full-day tutorial shows you how to do Bayesian data analysis, hands on. The software is free. The intended au-dience is graduate students and other researchers who want a ground-floor introduction to Bayesian data analysis. No mathematical expertise is presumed. If you can handle a few minutes of summation notation like ∑ i x i and integral notation like x dx, you're good to go. Complete computer programs will be worked through, step by step. Topics • Familiarization with software: R, BRugs, BUGS. See in-stallation instructions before arriving at the tutorial. • Uncertainty and Bayes' rule: Application to the rational estimation of parameters and models, given data. • Markov chain Monte Carlo: Why it's needed, how it works, and doing it in BUGS. • Hierarchical models: Flexibility for modeling individual differences, group effects, repeated measures, etc. • Bayesian (multiple) linear regression: Bayesian inference reveals trade-offs in credible regression coefficients. • Bayesian analysis of variance: Encourages thorough multi-ple comparisons, with no need for balanced designs. • Bayesian power analysis and replication probability: Straight forward meaning and computation.
Article
A large number of published studies have examined the properties of either networks of citation among scientific papers or networks of coauthorship among scientists. Here we study an extensive data set covering more than a century of physics papers published in the Physical Review, which allows us to construct both citation and coauthorship networks for the same set of papers. We analyze these networks to gain insight into temporal changes in citation and collaboration over the long time period of the data, as well as correlations and interactions between the two. Among other things, we investigate the change over time in the number of publishing authors, the number of papers they publish, and the number of others with whom they collaborate, changes in the typical number of citations made and received, the extent to which individuals tend to cite themselves or their collaborators more than others, the extent to which they cite themselves or their collaborators more quickly after publication, and the extent to which they tend to return the favor of a citation from another scientist.
Article
This article provides an introductory summary to the formulation and application of exponentialrandomgraphmodels for socialnetworks. The possible ties among nodes of a network are regarded as random variables, and assumptions about dependencies among these random tie variables determine the general form of the exponentialrandomgraphmodel for the network. Examples of different dependence assumptions and their associated models are given, including Bernoulli, dyad-independent and Markov randomgraphmodels. The incorporation of actor attributes in social selection models is also reviewed. Newer, more complex dependence assumptions are briefly outlined. Estimation procedures are discussed, including new methods for Monte Carlo maximum likelihood estimation. We foreshadow the discussion taken up in other papers in this special edition: that the homogeneous Markov randomgraphmodels of Frank and Strauss [Frank, O., Strauss, D., 1986. Markov graphs. Journal of the American Statistical Association 81, 832–842] are not appropriate for many observed networks, whereas the new model specifications of Snijders et al. [Snijders, T.A.B., Pattison, P., Robins, G.L., Handock, M. New specifications for exponentialrandomgraphmodels. Sociological Methodology, in press] offer substantial improvement.
Article
The quadratic assignment paradigm developed in operations research is discussed as a general approach to data analysis tasks characterized by the use of proximity matrices. Data analysis problems are first classified as being either static or nonstatic. The term "static" implies the evaluation of a detailed substantive hypothesis that is posited without the aid of actual data. Alternatively, the term "nonstatic" suggests a search for a particular type of relational structure within the obtained proximity matrix and without prior statement of a specific conjecture. Although the static class of problems is directly related to several inference procedures commonly used in classical statistics, the major emphases in this paper are on applying a general computational heuristic to attack the nonstatic problem and on using the quadratic assignment orientation to discuss a variety of research tactics of importance in the behavioral sciences and, particularly, in psychology. An extensive set of numerical examples is given illustrating the application of the search procedure to hierarchical clustering, the identification of homogeneous object subsets, linear and circular seriation, and a discrete version of multidimensional scaling. (79 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In this paper, we study the linking patterns and discussion topics of political bloggers. Our aim is to measure the degree of interaction between liberal and conservative blogs, and to uncover any differences in the structure of the two communities. Specifically, we analyze the posts of 40 "A-list" blogs over the period of two months preceding the U.S. Presidential Election of 2004, to study how often they referred to one another and to quantify the overlap in the topics they discussed, both within the liberal and conservative communities, and also across communities. We also study a single day snapshot of over 1,000 political blogs. This snapshot captures blogrolls (the list of links to other blogs frequently found in sidebars), and presents a more static picture of a broader blogosphere. Most significantly, we find differences in the behavior of liberal and conservative blogs, with conservative blogs linking to each other more frequently and in a denser pattern.
Article
This paper argues that the quadratic assignment procedure (QAP) is superior to OLS for testing hypothesis in both simple and multiple regression models based on dyadic data, such as found in network analysis. A model of autocorrelation is proposed that is consistent with the assumptions of dyadic data. Results of Monte Carlo simulations indicate that OLS analysis is statistically biased, with the degree of bias varying as a function of the amount of structural autocorrelation. On the other hand, the simulations demonstrate that QAP is relatively unbiased. The Sampson data are used to illustrate the QAP multiple regression procedure and a general method of testing whether the results are statistically biased.
Conference Paper
Networks arising from social, technological and natural domains exhibit rich connectivity patterns and nodes in such networks are often labeled with attributes or features. We address the question of modeling the structure of networks where nodes have attribute information. We present a Multiplicative Attribute Graph (MAG) model that considers nodes with categorical attributes and models the probability of an edge as the product of individual attribute link formation affinities. We develop a scalable variational expectation maximization parameter estimation method. Experiments show that MAG model reliably captures network connectivity as well as provides insights into how different attributes shape the network structure.
Conference Paper
Previous work analyzing social networks has mainly focused on binary friendship relations. However, in online social networks the low cost of link formation can lead to networks with heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together). In this case, the binary friendship indicator pro- vides only a coarse representation of relationship information. In this work, we develop an unsupervised model to estimate relationship strength from interaction activity (e.g., communication, tagging) and user similarity. More specifically, we formulate a link-based latent variable model, along with a coordinate ascent op- timization procedure for the inference. We evaluate our approach on real-world data from Facebook, showing that the estimated link weights result in higher au- tocorrelation and lead to improved classification accuracy.
Article
Stochastic blockmodels have been proposed as a tool for detecting community structure in networks as well as for generating synthetic networks for use as benchmarks. Most blockmodels, however, ignore variation in vertex degree, making them unsuitable for applications to real-world networks, which typically display broad degree distributions that can significantly affect the results. Here we demonstrate how the generalization of blockmodels to incorporate this missing element leads to an improved objective function for community detection in complex networks. We also propose a heuristic algorithm for community detection using this objective function or its non-degree-corrected counterpart and show that the degree-corrected version dramatically outperforms the uncorrected one in both real-world and synthetic networks.
Article
Includes bibliographical references (leaves 554-575). Thesis (Ph. D.)--Cornell University, 1968. Microfilm. s
The dirichlet-multinomial and dirichlet-categorical models for bayesian inference
  • S Tu
Tu, S.: The dirichlet-multinomial and dirichlet-categorical models for bayesian inference. Computer Science Division, UC Berkeley (2014)