Article

Extended stochastic block models with application to criminal networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Reliably learning group structures among nodes in network data is challenging in several applications. We are particularly motivated by studying covert networks that encode relationships among criminals. These data are subject to measurement errors, and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of routinely-used community detection algorithms, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation and uncertainty quantification. To cover these gaps, we develop a new class of extended stochastic block models (ESBM) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses many realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates the inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with criminal networks. A collapsed Gibbs sampler is proposed for the whole ESBM class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. The ESBM performance is illustrated in realistic simulations and in an application to an Italian mafia network, where we unveil key complex block structures, mostly hidden from state-of-the-art alternatives.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Here we propose a variant of this move to better match our framework. Indeed, as a prior distribution for the clustering variables and number of clusters, we adopt a MFM model, along with the supervision idea introduced by Legramanti et al. (2022). In this case, the AE move can further facilitate the estimation of clusters, but such a step can only be defined if the framework allows for the existence of empty clusters. ...
... Some exogenous node attributes may be available when analyzing real datasets, and, in this case, we leverage the supervision idea proposed by Legramanti et al. (2022) to account for such information in the modeling. We use c = {c 1 , c 2 , . . . ...
... We leverage conjugate prior distributions to marginalize a number of model parameters from the posterior distribution in (13). This methodology is also known as "collapsing" and has already been exploited in, for example, McDaid et al. (2012), Wyse and Friel (2012), Ryan et al. (2017), Rastelli, Latouche, et al. (2018), Legramanti et al. (2022), and Lu et al. (2024). By proposing the conjugate prior distributions: ...
Preprint
The latent position network model (LPM) is a popular approach for the statistical analysis of network data. A central aspect of this model is that it assigns nodes to random positions in a latent space, such that the probability of an interaction between each pair of individuals or nodes is determined by their distance in this latent space. A key feature of this model is that it allows one to visualize nuanced structures via the latent space representation. The LPM can be further extended to the Latent Position Cluster Model (LPCM), to accommodate the clustering of nodes by assuming that the latent positions are distributed following a finite mixture distribution. In this paper, we extend the LPCM to accommodate missing network data and apply this to non-negative discrete weighted social networks. By treating missing data as ``unusual'' zero interactions, we propose a combination of the LPCM with the zero-inflated Poisson distribution. Statistical inference is based on a novel partially collapsed Markov chain Monte Carlo algorithm, where a Mixture-of-Finite-Mixtures (MFM) model is adopted to automatically determine the number of clusters and optimal group partitioning. Our algorithm features a truncated absorb-eject move, which is a novel adaptation of an idea commonly used in collapsed samplers, within the context of MFMs. Another aspect of our work is that we illustrate our results on 3-dimensional latent spaces, maintaining clear visualizations while achieving more flexibility than 2-dimensional models. The performance of this approach is illustrated via two carefully designed simulation studies, as well as four different publicly available real networks, where some interesting new perspectives are uncovered.
... Our simulation studies in Section 4 show that this fully Bayesian perspective provides remarkable performance improvements over both heuristic and model-based solutions for learning hierarchical multiscale structures in network data, including state-of-the-art generalizations of stochastic block models (e.g., Clauset et al., 2008). These gains are confirmed in applications to multiple corrupted measurements of a Mafia-type criminal network (Calderoni et al., 2017;Legramanti et al., 2022) (see Section 5.1), and to replicated brain network data observed from different individuals (Craddock et al., 2013;Kiar et al., 2017) (see Section 5.2). In the former case, phylnet unveils a previously-unexplored tree-based reconstruction of the organizational architecture of the Mafia group under analysis, and highlights criminals with highly peculiar positions within the hierarchy. ...
... of V = 84 criminals to 47 monitored summits of the criminal organization, as reported in the judicial documents.Legramanti et al. (2022) andLu et al. (2025) map this information into a single network with edges denoting either dichotomized(Legramanti et al., 2022) or weighted(Lu et al., 2025) co-attendances among each pair of criminals to the monitored summits. Motivated by the above considerations, we consider instead an hypothetical scenario in which each of M = 10 law ...
... of V = 84 criminals to 47 monitored summits of the criminal organization, as reported in the judicial documents.Legramanti et al. (2022) andLu et al. (2025) map this information into a single network with edges denoting either dichotomized(Legramanti et al., 2022) or weighted(Lu et al., 2025) co-attendances among each pair of criminals to the monitored summits. Motivated by the above considerations, we consider instead an hypothetical scenario in which each of M = 10 law-enforcement agencies has monitored a different subset of 35 summits sampled a random and with replacement from the total of 47. ...
Preprint
Latent space models for network data characterize each node through a vector of latent features whose pairwise similarities define the edge probabilities among pairs of nodes. Although this formulation has led to successful implementations and impactful extensions, the overarching focus has been on directly inferring node embeddings through the latent features rather than learning the generative process underlying the embedding. This focus prevents from borrowing information among the features of different nodes and fails to infer complex higher-level architectures regulating the formation of the network itself. For example, routinely-studied networks often exhibit multiscale structures informing on nested modular hierarchies among nodes that could be learned via tree-based representations of dependencies among latent features. We pursue this direction by developing an innovative phylogenetic latent space model that explicitly characterizes the generative process of the nodes' feature vectors via a branching Brownian motion, with branching structure parametrized by a phylogenetic tree. This tree constitutes the main object of interest and is learned under a Bayesian perspective to infer tree-based modular hierarchies among nodes that explain heterogenous multiscale patterns in the network. Identifiability results are derived along with posterior consistency theory, and the inference potentials of the newly-proposed model are illustrated in simulations and two real-data applications from criminology and neuroscience, where our formulation learns core structures hidden to state-of-the-art alternatives.
... Although there has been a recent adoption of sophisticated statistical methods to study criminal networks (e.g., Malm et al., 2017;Charette and Papachristos, 2017;Calderoni et al., 2017;Bright et al., 2019;Diviák et al., 2019;Gollini et al., 2020;Cavallaro et al., 2020;Legramanti et al., 2022), none of the currently-available solutions provides a generative model that can flexibly incorporate, and infer, core structures of illicit organizations not only in the strength of the observed ties among criminals, but also in systematic sparsity patterns, to ultimately unveil the nature of the efficiency-security tradeoffs. In fact, popular link-prediction methods mainly rely on descriptive solutions applied to a dichotomized version of the observed weighted criminal networks (Berlusconi et al., 2016;Calderoni et al., 2020). ...
... A noteworthy attempt to move towards more structured model-based representations of the complex group interactions in modern criminal networks can be found in the extended stochastic block model of Legramanti et al. (2022). Albeit providing important advancements in inference on redundancy patterns relative to previous studies relying on community detection algorithms (Girvan and Newman, 2002;Newman, 2006;Blondel et al., 2008) and spectral clustering methods (Von Luxburg, 2007), this perspective still focuses on dichotomized versions of weighted criminal networks. ...
... In fact, in order to infer an excess of zero ties pointing towards systematic obscuration mechanisms it is necessary to possess a benchmark distribution for the weighted interactions which allows one to quantify the extent to which the total number of observed zero connections is unusual relative to those expected under such a distribution. As clarified in Section 2, the proposed ZIP-SBM addresses these challenges via a novel generalization of extended stochastic block models (Legramanti et al., 2022) in the context of weighted and sparse criminal networks where the focus is to quantify efficiencysecurity tradeoff architectures. This is accomplished by avoiding data dichotomization prior to statistical modeling while relying on block-specific zero-inflated Poisson distributions -rather than Bernoulli ones -for the ties among groups of redundant criminals. ...
Preprint
Criminal networks arise from the unique attempt to balance a need of establishing frequent ties among affiliates to facilitate the coordination of illegal activities, with the necessity to sparsify the overall connectivity architecture to hide from law enforcement. This efficiency-security tradeoff is also combined with the creation of groups of redundant criminals that exhibit similar connectivity patterns, thus guaranteeing resilient network architectures. State-of-the-art models for such data are not designed to infer these unique structures. In contrast to such solutions we develop a computationally-tractable Bayesian zero-inflated Poisson stochastic block model (ZIP-SBM), which identifies groups of redundant criminals with similar connectivity patterns, and infers both overt and covert block interactions within and across such groups. This is accomplished by modeling weighted ties (corresponding to counts of interactions among pairs of criminals) via zero-inflated Poisson distributions with block-specific parameters that quantify complex patterns in the excess of zero ties in each block (security) relative to the distribution of the observed weighted ties within that block (efficiency). The performance of ZIP-SBM is illustrated in simulations and in a study of summits co-attendances in a complex Mafia organization, where we unveil efficiency-security structures adopted by the criminal organization that were hidden to previous analyses.
... Side colors refer to layer (locali) division. See also Section 5. membership information as a categorical attribute, and then rely on available attribute-assisted methods for single-layered networks (e.g., Tallberg, 2004;Xu et al., 2012;Yang et al., 2013;Sweet, 2015;Newman and Clauset, 2016;Zhang et al., 2016;Binkiewicz et al., 2017;Stanley et al., 2019;Yan and Sarkar, 2021;Mele et al., 2022;Legramanti et al., 2022). Although these solutions provide useful extensions of community detection algorithms, spectral clustering and SBMs, none of them jointly addresses the above desiderata. ...
... The extensive simulation studies (Section 4) and the application to a multilayer criminal network (Section 5) showcase the practical gains in point estimation (including empirical evidence of frequentist posterior consistency as V grows), uncertainty quantification and prediction of pEx-SBMs, when compared to state-of-the-art competitors (Blondel et al., 2008;Zhang et al., 2016;Binkiewicz et al., 2017;Côme et al., 2021;Legramanti et al., 2022). Finally, as highlighted in Section 6, although the focus is on the ubiquitous binary undirected node-colored networks, suitable adaptations of the novel modeling framework underlying pEx-SBMs facilitate the inclusion of other relevant network settings. ...
... The proposed pEx-SBMs crucially address this shortcoming by embedding the layers' division information into the prior for the allocation vector z. Conversely, single-layered SBMs, which combine (1) with Dirichlet-multinomial (Nowicki and Snijders, 2001), Dirichlet process (Kemp et al., 2006), mixture-offinite-mixture (Geng et al., 2019) or general unsupervised Gibbs-type (Legramanti et al., 2022) priors for z, would be conceptually and practically suboptimal, since these priors are not designed to incorporate structure from layer division. Recalling Section 1 and Figure 1, one expects nodes in the same layer to be more likely to exhibit similar connectivity patterns, with these patterns possibly varying both within and across layers. ...
Preprint
Multilayer networks generalize single-layered connectivity data in several directions. These generalizations include, among others, settings where multiple types of edges are observed among the same set of nodes (edge-colored networks) or where a single notion of connectivity is measured between nodes belonging to different pre-specified layers (node-colored networks). While progress has been made in statistical modeling of edge-colored networks, principled approaches that flexibly account for both within and across layer block-connectivity structures while incorporating layer information through a rigorous probabilistic construction are still lacking for node-colored multilayer networks. We fill this gap by introducing a novel class of partially exchangeable stochastic block models specified in terms of a hierarchical random partition prior for the allocation of nodes to groups, whose number is learned by the model. This goal is achieved without jeopardizing probabilistic coherence, uncertainty quantification and derivation of closed-form predictive within- and across-layer co-clustering probabilities. Our approach facilitates prior elicitation, the understanding of theoretical properties and the development of yet-unexplored predictive strategies for both the connections and the allocations of future incoming nodes. Posterior inference proceeds via a tractable collapsed Gibbs sampler, while performance is illustrated in simulations and in a real-world criminal network application. The notable gains achieved over competitors clarify the importance of developing general stochastic block models based on suitable node-exchangeability structures coherent with the type of multilayer network being analyzed.
... On this part, we introduce our new shrinkage prior. Our prior draws on insights from the network literature and, in particular, is based on a stochastic block model ((SBM), see, e.g., Legramanti et al., 2022). ...
... We use an SSVS prior (George and McCulloch, 1993) to shrink the elements of the precision matrix to zero. However, instead of assuming that the indicators that control whether a given precision parameter should be forced to zero or not arise from a Bernoulli with a common (and often fixed) prior inclusion probability, we model the latter using a SBM (see, e.g., Holland et al., 1983;Nowicki and Snijders, 2001;Legramanti et al., 2022). The resulting model endogenously detects clusters and thus introduces shrinkage on the VAR precision matrix while taking possible within-and cross-cluster linkages into account. ...
... A binary representation of the graph can be obtained through the adjacency matrix ∆ which has elements δ i,j = 1 if an edge exists between nodes i and j and δ i,j = 0 otherwise. It is worth stressing that, in contrast to papers such as Legramanti et al. (2022), our adjacency matrix is latent and controls whether the elements in Ω are shrunk to zero or not. ...
Preprint
Full-text available
Commonly used priors for Vector Autoregressions (VARs) induce shrinkage on the autoregressive coefficients. Introducing shrinkage on the error covariance matrix is sometimes done but, in the vast majority of cases, without considering the network structure of the shocks and by placing the prior on the lower Cholesky factor of the precision matrix. In this paper, we propose a prior on the VAR error precision matrix directly. Our prior, which resembles a standard spike and slab prior, models variable inclusion probabilities through a stochastic block model that clusters shocks into groups. Within groups, the probability of having relations across group members is higher (inducing less sparsity) whereas relations across groups imply a lower probability that members of each group are conditionally related. We show in simulations that our approach recovers the true network structure well. Using a US macroeconomic data set, we illustrate how our approach can be used to cluster shocks together and that this feature leads to improved density forecasts.
... Specifically, we model individual ant-to-ant interaction networks via stochastic block models (Nowicki and Snijders, 2001), which are a variant of the mixture model in (1). See Kemp et al. (2006); Geng et al. (2019); Legramanti et al. (2022) for other applications of discrete nonparametric priors in community detection tasks. ...
... By modeling the latent partition via the Dirichlet process prior pr(Π n,s | α) as in equation (2), we can flexibly find a grouping of the nodes with a similar edge distribution and thus infer the number of ant worker communities without pre-specifying an upper bound to the number of clusters. See Legramanti et al. (2022) and references therein for a description of the posterior sampling algorithm. ...
... As is evident from the leftmost panel of Figure 4, the Stirling-gamma prior enables additional vagueness to K n as a direct consequence of choosing b = 0.2. We obtain the posterior partition in each model by running a collapsed Gibbs sampler as in Legramanti et al. (2022) for 40, 000 iterations, treating the first 10, 000 as burn-in. The full conditional for α in both Stirling-gamma processes is provided by Theorem 4, setting N = 4 and n = 149. ...
Preprint
Full-text available
Dirichlet process mixtures are particularly sensitive to the value of the so-called precision parameter, which controls the behavior of the underlying latent partition. Randomization of the precision through a prior distribution is a common solution, which leads to more robust inferential procedures. However, existing prior choices do not allow for transparent elicitation, due to the lack of analytical results. We introduce and investigate a novel prior for the Dirichlet process precision, the Stirling-gamma distribution. We study the distributional properties of the induced random partition, with an emphasis on the number of clusters. Our theoretical investigation clarifies the reasons of the improved robustness properties of the proposed prior. Moreover, we show that, under specific choices of its hyperparameters, the Stirling-gamma distribution is conjugate to the random partition of a Dirichlet process. We illustrate with an ecological application the usefulness of our approach for the detection of communities of ant workers.
... Even traditional model-based solutions present some critical issues, namely in the specification of the number of clusters and the incorporation of node attributes. Hence, to analyze the considered brain network, we opt for the extended stochastic block model by Legramanti et al. (2022b), which allows to infer the number of clusters and to incorporate node attributes. Abstract Le reti cerebrali mostrano tipicamente gruppi di nodi con connettività simili. ...
... Anche le tradizionali soluzioni modellistiche presentano delle criticità, in particolare nello specificare il numero di gruppi e nell'incorporare gli attributi dei nodi. Per analizzare la rete cerebrale considerata, optiamo quindi per il modello a blocchi stocastici esteso di Legramanti et al. (2022b), che permette di inferire il numero di gruppi e di incorporare gli attributi dei nodi. ...
... The extended stochastic block model (ESBM) proposed by Legramanti et al. (2022b) addresses these issues through a model-based framework that: (i) quantifies uncertainty in the inferred clustering through a Bayesian approach; (ii) allows the number of clusters to be fixed or random, and asymptotically finite or infinite, depending on the application; (iii) facilitates the incorporation of node attributes, favoring clusters that are homogeneous with respect to such attributes. ...
Conference Paper
Full-text available
Brain networks typically exhibit clusters of nodes with similar connectivity patterns. Moreover, for each node (brain region), attributes are available in the form of hemisphere and lobe memberships. Clustering brain regions based on their connectivity patterns and their attributes is then of substantial statistical interest when analyzing brain networks. However, the algorithms available for this task lack uncertainty quantification. Even traditional model-based solutions present some critical issues, namely in the specification of the number of clusters and the incorporation of node attributes. Hence, to analyze the considered brain network, we opt for the extended stochastic block model by Legramanti et al. (2022b), which allows to infer the number of clusters and to incorporate node attributes.
... Gao et al. (2020) provide posterior concentration rates for the edge probabilities and show that their posterior mean achieves the minimax rate. Legramanti et al. (2022b) employ Gibbs-type partition priors which generalise both the CRP and the mixture with random number of components. ...
... In general, these approaches require also specification of prior edge inclusion probabilities jointly with the block structure prior. For instance, Kemp et al. (2006), Geng et al. (2018) and Legramanti et al. (2022b) place Beta distributions on the edge probabilities, and Reyes and Rodríguez (2016) and Jiang and Tokdar (2021) add structure by using different priors for within-and between-block edge probabilities, while Tan and De Iorio (2019) use a DP to build a joint prior on the partition of nodes and edge probabilities. Additionally, they extend the model to a degree-corrected blockmodel, i.e. they introduce a popularity parameter for each node. ...
... Note that the second equality uses the property that the prior on all remaining parameters in the model such as β i are the same under M and M in such a way that we recover the same model specification as M when z = z in M. Now, an estimate B for B is obtained by plugging in the usual (in terms of sample frequency) estimate of p(z | Y ) derived from the MCMC chain while p(z ) is readily computed by numerical quadrature: a ν , b ν ) and (e.g. Legramanti et al., 2022b) ...
Article
Full-text available
Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data. Inference is often focused on estimating individual edges in the latent graph. Nonetheless, there is increasing interest in inferring more complex structures, such as communities, for multiple reasons, including more effective information retrieval and better interpretability. Stochastic blockmodels offer a powerful tool to detect such structure in a network. We thus propose to exploit advances in random graph theory and embed them within the graphical models framework. A consequence of this approach is the propagation of the uncertainty in graph estimation to large-scale structure learning. We consider Bayesian nonparametric stochastic blockmodels as priors on the graph. We extend such models to consider clique-based blocks and to multiple graph settings introducing a novel prior process based on a Dependent Dirichlet process. Moreover, we devise a tailored computation strategy of Bayes factors for block structure based on the Savage-Dickey ratio to test for presence of larger structure in a graph. We demonstrate our approach in simulations as well as on real data applications in finance and transcriptomics.
... Geng et al. (2018) place a mixture of finite mixtures prior on the partition and obtain posterior consistency results for the number of blocks. Legramanti et al. (2022) employ Gibbs-type partition priors which generalise both the CRP and the mixture of finite mixtures. In general, these approaches require also specification of prior edge inclusion probabilities jointly with the block structure prior. ...
... In general, these approaches require also specification of prior edge inclusion probabilities jointly with the block structure prior. For instance, Kemp et al. (2006);Geng et al. (2018); Legramanti et al. (2022) place Beta distributions on the edge probabilities, while Tan and De Iorio (2019) use a DP to build a joint prior on the partition of nodes and edge probabilities. Additionally, they extend the model to a degree-corrected blockmodel, i.e., they introduce a popularity parameter for each node. ...
... More in details, the Bayes factor of the relative evidence of z = z (model M ) over (e.g., Legramanti et al., 2022) p ...
Preprint
Full-text available
Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data. Inference is often focused on estimating individual edges in the latent graph. Nonetheless, there is increasing interest in inferring more complex structures, such as communities, for multiple reasons, including more effective information retrieval and better interpretability. Stochastic blockmodels offer a powerful tool to detect such structure in a network. We thus propose to exploit advances in random graph theory and embed them within the graphical models framework. A consequence of this approach is the propagation of the uncertainty in graph estimation to large-scale structure learning. We consider Bayesian nonparametric stochastic blockmodels as priors on the graph. We extend such models to consider clique-based blocks and to multiple graph settings introducing a novel prior process based on a dependent Dirichlet process. Moreover, we devise a tailored computation strategy of Bayes factors for block structure based on the Savage-Dickey ratio to test for presence of larger structure in a graph. We demonstrate our approach in simulations as well as on real data applications in finance and transcriptomics.
... Any Gibbs-type process can be represented hierarchically, involving a suitable prior distribution for the key parameters H, α, and γ. Prior distributions for H in the σ < 0 case are discussed in Gnedin (2010) andDe Blasi et al. (2013); see also Miller and Harrison (2018) for applications to mixture models and Legramanti et al. (2022) for employment in stochastic block models. In the σ = 0 case, a popular choice is the semi-conjugate Gamma prior for α as in Escobar and West (1995), with the Stirling-gamma prior of Zito et al. (2024) a recent alternative. ...
Preprint
Full-text available
Statistical inference on biodiversity has a rich history going back to RA Fisher. An influential ecological theory suggests the existence of a fundamental biodiversity number, denoted α\alpha, which coincides with the precision parameter of a Dirichlet process (DP). In this paper, motivated by this theory, we develop Bayesian nonparametric methods for statistical inference on biodiversity, building on the literature on Gibbs-type priors. We argue that σ\sigma-diversity is the most natural extension of the fundamental biodiversity number and discuss strategies for its estimation. Furthermore, we develop novel theory and methods starting with an Aldous-Pitman (AP) process, which serves as the building block for any Gibbs-type prior with a square-root growth rate. We propose a modeling framework that accommodates the hierarchical structure of Linnean taxonomy, offering a more refined approach to quantifying biodiversity. The analysis of a large and comprehensive dataset on Amazon tree flora provides a motivating application.
... We use publicly available static network data, specifically: (1) the London juvenile gang network [19], (2) the 'Ndrangheta network [20], (3) the New York cocaine trafficking ring [21], and (4) the Madrid train bombing terrorist networks [22]. These networks were selected to provide diversity in size, topology, and organizational goals, allowing for a comparative examination of intervention effectiveness (see Table 1). ...
Preprint
Full-text available
Criminal networks such as human trafficking rings are threats to the rule of law, democracy and public safety in our global society. Network science provides invaluable tools to identify key players and design interventions for Law Enforcement Agencies (LEAs), e.g., to dismantle their organisation. However, poor data quality and the adaptiveness of criminal networks through self-organization make effective disruption extremely challenging. Although there exists a large body of work building and applying network scientific tools to attack criminal networks, these work often implicitly assume that the network measurements are accurate and complete. Moreover, there is thus far no comprehensive understanding of the impacts of data quality on the downstream effectiveness of interventions. This work investigates the relationship between data quality and intervention effectiveness based on classical graph theoretic and machine learning-based approaches. Decentralization emerges as a major factor in network robustness, particularly under conditions of incomplete data, which renders attack strategies largely ineffective. Moreover, the robustness of centralized networks can be boosted using simple heuristics, making targeted attack more infeasible. Consequently, we advocate for a more cautious application of network science in disrupting criminal networks, the continuous development of an interoperable intelligence ecosystem, and the creation of novel network inference techniques to address data quality challenges.
... Compared to the Dirichlet process, Gibbs-type priors exhibit a predictive distribution which involves more information, that is, sample size and number of clusters (refer to the sufficientness postulates for Gibbs-type priors of Bacallado et al., 2017). The class of Gibbs-type priors encompasses BNP processes which are widely used, for instance in species sampling problems (Arbel et al., 2017;Cesari et al., 2014;Favaro et al., 2009Favaro et al., , 2012Lijoi et al., 2007b), survival analysis (Jara et al., 2010), network inference (Caron & Fox, 2017;Legramanti et al., 2022), linguistics (Teh & Jordan, 2010) and mixture modeling (Ishwaran & James, 2001;Lijoi et al., 2007a;Lijoi, Mena, & Prünster, 2005). Miller and Harrison (2018); Frühwirth-Schnatter et al. (2021); Argiento and De Iorio (2022) study the connection between the mixtures of finite mixtures and BNP mixtures with Gibbs-type priors. ...
Article
Full-text available
Bayesian nonparametric mixture models are common for modeling complex data. While these models are well‐suited for density estimation, recent results proved posterior inconsistency of the number of clusters when the true number of components is finite, for the Dirichlet process and Pitman–Yor process mixture models. We extend these results to additional Bayesian nonparametric priors such as Gibbs‐type processes and finite‐dimensional representations thereof. The latter include the Dirichlet multinomial process, the recently proposed Pitman–Yor, and normalized generalized gamma multinomial processes. We show that mixture models based on these processes are also inconsistent in the number of clusters and discuss possible solutions. Notably, we show that a postprocessing algorithm introduced for the Dirichlet process can be extended to more general models and provides a consistent method to estimate the number of components.
... Random partitions are integral to a variety of Bayesian clustering methods, with applications to text analysis (Blei et al. 2003;Blei 2012) genetics (Pritchard et al. 2000;Falush et al. 2003), entity resolution (Binette and Steorts 2022) and community detection (Legramanti et al. 2022), to name but a few. The most widely used random partition models are those based on Dirichlet processes and Pitman-Yor processes (Antoniak 1974;Sethuraman 1994;Ishwaran and James 2003), most notably the famed Chinese Restaurant Process (CRP; Aldous 1985). ...
Article
Full-text available
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
... In biology, networks encode which pairs of genes or proteins are co-expressed or are involved in the same pathways (Kovács et al. 2019). In the social sciences, networks arise naturally in the form of social network data (Granovetter 1973;Traud et al. 2012;Legramanti et al. 2022). ...
Preprint
Full-text available
Latent space models play an important role in the modeling and analysis of network data. Under these models, each node has an associated latent point in some (typically low-dimensional) geometric space, and network formation is driven by this unobserved geometric structure. The random dot product graph (RDPG) and its generalization (GRDPG) are latent space models under which this latent geometry is taken to be Euclidean. These latent vectors can be efficiently and accurately estimated using well-studied spectral embeddings. In this paper, we develop a minimax lower bound for estimating the latent positions in the RDPG and the GRDPG models under the two-to-infinity norm, and show that a particular spectral embedding method achieves this lower bound. We also derive a minimax lower bound for the related task of subspace estimation under the two-to-infinity norm that holds in general for low-rank plus noise network models, of which the RDPG and GRDPG are special cases. The lower bounds are achieved by a novel construction based on Hadamard matrices.
... (2019) extends this model by the use of a finite regime Pitman-Yor (PY) process (Pitman and Yor, 1997), which is reviewed in the online Appendix A.3. A further generalization is in Legramanti et al. (2022), in which the Gibbs-type prior formulation is used instead of the Pitman-Yor prior, for an application to criminal networks. ...
Preprint
Full-text available
Random graphs have been widely used in statistics, for example in network and social interaction analysis. In some applications, data may contain an inherent hierarchical ordering among its vertices, which prevents any directed edge between pairs of vertices that do not respect this order. For example, in bibliometrics, older papers cannot cite newer ones. In such situations, the resulting graph forms a Directed Acyclic Graph. In this article, we propose an extension of the popular Stochastic Block Model (SBM) to account for the presence of a latent hierarchical ordering in the data. The proposed approach includes a topological ordering in the likelihood of the model, which allows a directed edge to have positive probability only if the corresponding pair of vertices respect the order. This latent ordering is treated as an unknown parameter and endowed with a prior distribution. We describe how to formalize the model and perform posterior inference for a Bayesian nonparametric version of the SBM in which both the hierarchical ordering and the number of latent blocks are learnt from the data. Finally, an illustration with a real-world dataset from bibliometrics is presented. Additional supplementary materials are available online.
... Compared to the Dirichlet process, Gibbs-type priors exhibit a predictive distribution which involves more information, that is, sample size and number of clusters (refer to the sufficientness postulates for Gibbs-type priors of Bacallado et al., 2017). The class of Gibbs-type priors encompasses BNP processes which are widely used, for instance in species sampling problems (Arbel et al., 2017;Cesari et al., 2014;Favaro et al., 2009Favaro et al., , 2012Lijoi et al., 2007b), survival analysis (Jara et al., 2010), network inference (Caron & Fox, 2017;Legramanti et al., 2022), linguistics (Teh & Jordan, 2010) and mixture modeling (Ishwaran & James, 2001;Lijoi et al., 2007a;Lijoi, Mena, & Prünster, 2005). Miller and Harrison (2018); Frühwirth-Schnatter et al. (2021); Argiento and De Iorio (2022) study the connection between the mixtures of finite mixtures and BNP mixtures with Gibbs-type priors. ...
Preprint
Full-text available
Bayesian nonparametric mixture models are common for modeling complex data. While these models are well-suited for density estimation, their application for clustering has some limitations. Miller and Harrison (2014) proved posterior inconsistency in the number of clusters when the true number of clusters is finite for Dirichlet process and Pitman--Yor process mixture models. In this work, we extend this result to additional Bayesian nonparametric priors such as Gibbs-type processes and finite-dimensional representations of them. The latter include the Dirichlet multinomial process and the recently proposed Pitman--Yor and normalized generalized gamma multinomial processes. We show that mixture models based on these processes are also inconsistent in the number of clusters and discuss possible solutions. Notably, we show that a post-processing algorithm introduced by Guha et al. (2021) for the Dirichlet process extends to more general models and provides a consistent method to estimate the number of components.
Article
Criminal networks arise from the attempt to balance a need of establishing frequent ties among affiliates to facilitate coordination of illegal activities, with the necessity to sparsify the overall connectivity architecture to hide from law enforcement. This efficiency-security trade-off is also combined with the creation of groups of redundant criminals that exhibit similar connectivity patterns, thus guaranteeing resilient network architectures. State-of-the-art models for such data are not designed to infer these unique structures. In contrast to such solutions, we develop a tractable Bayesian zero-inflated Poisson stochastic block model (ZIP–SBM), which identifies groups of redundant criminals having similar connectivity patterns, and infers both overt and covert block interactions within and across these groups. This is accomplished by modelling the weighted ties (corresponding to counts of interactions among pairs of criminals) via zero-inflated Poisson distributions with block-specific parameters that quantify complex patterns in the excess of zero ties in each block (security) relative to the distribution of the observed weighted ties within that block (efficiency). The performance of ZIP–SBM is illustrated in simulations and in a study of summit co-attendances in a complex Mafia organization, where we unveil efficiency-security structures adopted by the criminal organization that were hidden to previous analyses.
Article
Advances in next‐generation sequencing technology have enabled the high‐throughput profiling of metagenomes and accelerated microbiome studies. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co‐occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high‐dimensional and compositional, suffering from uneven sampling depth, over‐dispersion, and zero‐inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To study the microbiome co‐occurrence network and perform community detection, we propose a generalized Bayesian stochastic block model that is tailored for microbiome data analysis where the data are transformed using the recently developed modified centered‐log ratio transformation. Our model also allows us to leverage taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non‐informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women. To the best of our knowledge, this is the first time the urinary microbiome co‐occurrence network structure in postmenopausal women has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies.
Article
Complex networks enable to represent and characterize the interactions between entities in various complex systems which widely exist in the real world and usually generate vast amounts of data about all the elements, their behaviors and interactions over time. The studies concentrating on new network analysis approaches and methodologies are vital because of the diversity and ubiquity of complex networks. The stochastic block model (SBM), based on Bayesian theory, is a statistical network model. SBMs are essential tools for analyzing complex networks since SBMs have the advantages of interpretability, expressiveness, flexibility and generalization. Thus, designing diverse SBMs and their learning algorithms for various networks has become an intensively researched topic in network analysis and data mining. In this paper, we review, in a comprehensive and in-depth manner, SBMs for different types of networks (i.e., model extensions), existing methods (including parameter estimation and model selection) for learning optimal SBMs for given networks and SBMs combined with deep learning. Finally, we provide an outlook on the future research directions of SBMs.
Article
Full-text available
The interplay between (criminal) organizations and (law enforcement) disruption strategies is critical in criminology and social network analysis. Like legitimate businesses, criminal enterprises thrive by fulfilling specific demands and navigating their unique challenges, including balancing operational visibility and security. This study aims at comprehending criminal networks’ internal dynamics, resilience to law enforcement interventions, and robustness to changes in external conditions. Using a model based on evolutionary game theory, we analyze these networks as collaborative assemblies of roles, considering expected costs, potential benefits, and the certainty of expected outcomes. Here, we show that criminal organizations exhibit strong hysteresis effects, with increased resilience and robustness once established, challenging the effectiveness of traditional law enforcement strategies focused on deterrence through increased punishment. The hysteresis effect defines optimal thresholds for the formation or dissolution of criminal organisation. Our findings indicate that interventions of similar magnitude can lead to vastly different outcomes depending on the existing state of criminality. This result suggests that the relationship between stricter punishment and its deterrent effect on organized crime is complex and sometimes non-linear. Furthermore, we demonstrate that network structure, specifically interconnectedness (link density) and assortativity of specialized skills, significantly influences the formation and stability of criminal organizations, underscoring the importance of considering social connections and the accessibility of roles in combating organized crime. These insights contribute to a deeper understanding of the systemic nature of criminal behavior from an evolutionary perspective and highlight the need for adaptive, strategic approaches in policy-making and law enforcement to disrupt criminal networks effectively.
Chapter
In analyzing brain networks, it is of notable interest to cluster together nodes, representing brain regions, that share the same connectivity patterns, i.e., common parameters for the generative process of the edges, which in turn represent connections among brain regions. Based on the neuroscience theory that neighboring regions are more likely to connect, the anatomical coordinates of each region can be leveraged, together with edges, to guide the node partition, thus favoring clusters of neighboring regions with similar connectivity patterns. In light of this, to analyze the considered weighted brain network, we propose a two-fold generalization of the extended stochastic block model by [11]: (i) we adopt a Poisson likelihood for the edge weights, and (ii) we specify a spatial cohesion function that encourages neighboring regions to be clustered together. The performance of the proposed method on brain network data illustrates the potential gains of leveraging spatial node covariates in network clustering.
Article
Discrete random probability measures stand out as effective tools for Bayesian clustering. The investigation in the area has been very lively, with a strong emphasis on nonparametric procedures based on either the Dirichlet process or on more flexible generalizations, such as the normalized random measures with independent increments (nrmi). The literature on finite-dimensional discrete priors is much more limited and mostly confined to the standard Dirichlet-multinomial model. While such a specification may be attractive due to conjugacy, it suffers from considerable limitations when it comes to addressing clustering problems. In order to overcome these, we introduce a novel class of priors that arise as the hierarchical compositions of finite-dimensional random discrete structures. Despite the analytical hurdles such a construction entails, we are able to characterize the induced random partition and determine explicit expressions of the associated urn scheme and of the posterior distribution. A detailed comparison with (infinite-dimensional) nrmis is also provided: indeed, informative bounds for the discrepancy between the partition laws are obtained. Finally, the performance of our proposal over existing methods is assessed on a real application where we study a publicly available dataset from the Italian education system comprising the scores of a mandatory nationwide test.
Preprint
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman-Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
Preprint
Full-text available
We consider the Bayesian mixture of finite mixtures (MFMs) and Dirichlet process mixture (DPM) models for clustering. Recent asymptotic theory has established that DPMs overestimate the number of clusters for large samples and that estimators from both classes of models are inconsistent for the number of clusters under misspecification, but the implications for finite sample analyses are unclear. The final reported estimate after fitting these models is often a single representative clustering obtained using an MCMC summarisation technique, but it is unknown how well such a summary estimates the number of clusters. Here we investigate these practical considerations through simulations and an application to gene expression data, and find that (i) DPMs overestimate the number of clusters even in finite samples, but only to a limited degree that may be correctable using appropriate summaries, and (ii) misspecification can lead to considerable overestimation of the number of clusters in both DPMs and MFMs, but results are nevertheless often still interpretable. We provide recommendations on MCMC summarisation and suggest that although the more appealing asymptotic properties of MFMs provide strong motivation to prefer them, results obtained using MFMs and DPMs are often very similar in practice.
Article
Full-text available
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm which allows to efficiently explore the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number K of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter α as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of α, enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R-package greed accompanying the paper.
Article
Full-text available
Network data often exhibit block structures characterized by clusters of nodes with similar patterns of edge formation. When such relational data are complemented by additional information on exogenous node partitions, these sources of knowledge are typically included in the model to supervise the cluster assignment mechanism or to improve inference on edge probabilities. Although these solutions are routinely implemented, there is a lack of formal approaches to test if a given external node partition is in line with the endogenous clustering structure encoding stochastic equivalence patterns among the nodes in the network. To fill this gap, we develop a formal Bayesian testing procedure which relies on the calculation of the Bayes factor between a stochastic block model with known grouping structure defined by the exogenous node partition and an infinite relational model that allows the endogenous clustering configurations to be unknown, random and fully revealed by the block–connectivity patterns in the network. A simple Markov chain Monte Carlo method for computing the Bayes factor and quantifying uncertainty in the endogenous groups is proposed. This strategy is evaluated in simulations, and in applications studying brain networks of Alzheimer’s patients.
Article
Full-text available
Compared to other types of social networks, criminal networks present particularly hard challenges, due to their strong resilience to disruption, which poses severe hurdles to Law-Enforcement Agencies (LEAs). Herein, we borrow methods and tools from Social Network Analysis (SNA) to (i) unveil the structure and organization of Sicilian Mafia gangs, based on two real-world datasets, and (ii) gain insights as to how to efficiently reduce the Largest Connected Component (LCC) of two networks derived from them. Mafia networks have peculiar features in terms of the links distribution and strength, which makes them very different from other social networks, and extremely robust to exogenous perturbations. Analysts also face difficulties in collecting reliable datasets that accurately describe the gangs’ internal structure and their relationships with the external world, which is why earlier studies are largely qualitative, elusive and incomplete. An added value of our work is the generation of two real-world datasets, based on raw data extracted from juridical acts, relating to a Mafia organization that operated in Sicily during the first decade of 2000s. We created two different networks, capturing phone calls and physical meetings, respectively. Our analysis simulated different intervention procedures: (i) arresting one criminal at a time (sequential node removal); and (ii) police raids (node block removal). In both the sequential, and the node block removal intervention procedures, the Betweenness centrality was the most effective strategy in prioritizing the nodes to be removed. For instance, when targeting the top 5% nodes with the largest Betweenness centrality, our simulations suggest a reduction of up to 70% in the size of the LCC. We also identified that, due the peculiar type of interactions in criminal networks (namely, the distribution of the interactions’ frequency), no significant differences exist between weighted and unweighted network analysis. Our work has significant practical applications for perturbing the operations of criminal and terrorist networks.
Article
Full-text available
All highly centralised enterprises run by criminals do share similar traits, which, if recognised, can help in the criminal investigative process. While conducting a complex confederacy investigation, law enforcement agents should not only identify the key participants but also be able to grasp the nature of the inter-connections between the criminals to understand and determine the modus operandi of an illicit operation. We studied community detection in criminal networks using the graph theory and formally introduced an algorithm that opens a new perspective of community detection compared to the traditional methods used to model the relations between objects. Community structure, generally described as densely connected nodes and similar patterns of links is an important property of complex networks. Our method differs from the traditional method by allowing law enforcement agencies to be able to compare the detected communities and thereby be able to assume a different viewpoint of the criminal network, as presented in the paper we have compared our algorithm to the well-known Girvan-Newman. We consider this method as an alternative or an addition to the traditional community detection methods mentioned earlier, as the proposed algorithm allows, and will assists in, the detection of different patterns and structures of the same community for enforcement agencies and researches. This methodology on community detection has not been extensively researched. Hence, we have identified it as a research gap in this domain and decided to develop a new method of criminal community detection.
Article
Full-text available
Abstract There have been rapid developments in model-based clustering of graphs, also known as block modelling, over the last ten years or so. We review different approaches and extensions proposed for different aspects in this area, such as the type of the graph, the clustering approach, the inference approach, and whether the number of groups is selected or estimated. We also review models that combine block modelling with topic modelling and/or longitudinal modelling, regarding how these models deal with multiple types of data. How different approaches cope with various issues will be summarised and compared, to facilitate the demand of practitioners for a concise overview of the current status of these areas of literature.
Article
Full-text available
Data quality is considered to be among the greatest challenges in research on covert networks. This study identifies six aspects of network data collection, namely nodes, ties, attributes, levels, dynamics, and context. Addressing these aspects presents challenges, but also opens theoretical and methodological opportunities. Furthermore, specific issues arise in this research context, stemming from the use of secondary data and the problem of missing data. While each of the issues and challenges has some specific solution in the literature on organized crime and social networks, the main argument of this paper is to try and follow a more systematic and general solution to deal with these issues. To this end, three potentially synergistic and combinable techniques for data collection are proposed for each stage of data collection – biographies for data extraction, graph databases for data storage, and checklists for data reporting. The paper concludes with discussing the use of statistical models to analyse covert networks and the cultivation of relations within the research community and between researchers and practitioners.
Article
Full-text available
The stochastic block model (SBM) is a probabilistic model for community structure in networks. Typically, only the adjacency matrix is used to perform SBM parameter inference. In this paper, we consider circumstances in which nodes have an associated vector of continuous attributes that are also used to learn the node-to-community assignments and corresponding SBM parameters. While this assumption is not realistic for every application, our model assumes that the attributes associated with the nodes in a network's community can be described by a common multivariate Gaussian model. In this augmented, attributed SBM, the objective is to simultaneously learn the SBM connectivity probabilities with the multivariate Gaussian parameters describing each community. While there are recent examples in the literature that combine connectivity and attribute information to inform community detection, our model is the first augmented stochastic block model to handle multiple continuous attributes. This provides the flexibility in biological data to, for example, augment connectivity information with continuous measurements from multiple experimental modalities. Because the lack of labeled network data often makes community detection results difficult to validate, we highlight the usefulness of our model for two network prediction tasks: link prediction and collaborative filtering. As a result of fitting this attributed stochastic block model, one can predict the attribute vector or connectivity patterns for a new node in the event of the complementary source of information (connectivity or attributes, respectively). We also highlight two biological examples where the attributed stochastic block model provides satisfactory performance in the link prediction and collaborative filtering tasks.
Article
Full-text available
The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference.
Article
Full-text available
The community structure that is observed in empirical networks has been of particular interest in the statistics literature, with a strong emphasis on the study of block models. We study an important network feature called node popularity, which is closely associated with community structure. Neither the classical stochastic block model nor its degree-corrected extension can satisfactorily capture the dynamics of node popularity as observed in empirical networks. We propose a popularity-adjusted block model for flexible and realistic modelling of node popularity. We establish consistency of likelihood modularity for community detection as well as estimation of node popularities and model parameters, and demonstrate the advantages of the new modularity over the degree-corrected block model modularity in simulations. By analysing the political blogs network, the British Members of Parliament network and the ‘Digital bibliography and library project’ bibliographical network, we illustrate that improved empirical insights can be gained through this methodology.
Article
Full-text available
Relational data are usually highly incomplete in practice, which inspires us to leverage side information to improve the performance of community detection and link prediction. This paper presents a Bayesian probabilistic approach that incorporates various kinds of node attributes encoded in binary form in relational models with Poisson likelihood. Our method works flexibly with both directed and undirected relational networks. The inference can be done by efficient Gibbs sampling which leverages sparsity of both networks and node attributes. Extensive experiments show that our models achieve the state-of-the-art link prediction results, especially with highly incomplete relational data.
Article
Full-text available
Typologies are intended to assist researchers in understanding complex social phenomena. This paper reviews the current literature on organised crime typologies and argues that the majority of organised crime typologies are reflected to some extent in a typology developed by the United Nations Office on Drugs and Crime (UNODC) in 2002. Organised crime typologies can be categorised into three groups: models that focus on the physical structure and operation of an OCG, the activities of OCGs and the social, cultural and historical conditions that facilitate organised crime activity. This paper will only discuss models that examine the physical structure and operation of an OCG; the UNODC typology is exclusively focused on structural elements. Typologies on organised crime structure have developed largely in isolation from each other and appear disparate. This paper will analyse the formation of each typology to establish their individual elements. It will then identify which typologies and their respective characteristics can be aligned with or distinguished from the UN typology. The value of this review is that it will enable greater uniformity and consistency in academic discussion on organised crime typologies.
Article
Full-text available
Latent stochastic block models are flexible statistical models that are widely used in social network analysis. In recent years, efforts have been made to extend these models to temporal dynamic networks, whereby the connections between nodes are observed at a number of different times. In this paper we extend the original stochastic block model by using a Markovian property to describe the evolution of nodes' cluster memberships over time. We recast the problem of clustering the nodes of the network into a model-based context, and show that the integrated completed likelihood can be evaluated analytically for a number of likelihood models. Then, we propose a scalable greedy algorithm to maximise this quantity, thereby estimating both the optimal partition and the ideal number of groups in a single inferential framework. Finally we propose applications of our methodology to both real and artificial datasets.
Article
Full-text available
Criminal organizations tend to be clustered to reduce risks of detection and information leaks. Yet, the literature exploring the relevance of subgroups for their internal structure is so far very limited. The paper applies methods of community analysis to explore the structure of a criminal network representing the individuals’ co-participation in meetings. It draws from a case study on a large law enforcement operation (“Operazione Infinito”) tackling the ‘Ndrangheta, a mafia organization from Calabria, a southern Italian region. The results show that the network is indeed clustered and that communities are associated, in a non-trivial way, with the internal organization of the ‘Ndrangheta into different “locali” (similar to mafia families). Furthermore, the results of community analysis can improve the prediction of the “locale” membership of the criminals (up to two thirds of any random sample of nodes) and the leadership roles (above 90% precision in classifying nodes as either bosses or non-bosses). The implications of these findings on the interpretation of the structure and functioning of the criminal network are discussed.
Article
Full-text available
Community detection in networks is one of the most popular topics of modern network science. Communities, or clusters, are usually groups of vertices having higher probability of being connected to each other than to members of other groups, though other patterns are possible. Identifying communities is an ill-defined problem. There are no universal protocols on the fundamental ingredients, like the definition of community itself, nor on other crucial issues, like the validation of algorithms and the comparison of their performances. This has generated a number of confusions and misconceptions, which undermine the progress in the field. We offer a guided tour through the main aspects of the problem. We also point out strengths and weaknesses of popular methods, and give directions to their use.
Article
Full-text available
This is the authors' post-print version of Calderoni, Francesco, and Carlo Piccardi. 2014. “Uncovering the Structure of Criminal Organizations by Community Analysis: The Infinito Network.” In 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), 301–8. Marrakech: IEEE Computer Society Press. doi:10.1109/SITIS.2014.20. Criminal organizations tend to be clustered to reduce risks of detection and information leaks. Yet, the literature has so far neglected to explore the relevance of subgroups for their internal structure. The paper focuses on a case study drawing from a large law enforcement operation ('Operazione Infinito'). It applies methods of community analysis to explore the structure of a 'Ndrangheta (a mafia from Calabria, a southern Italian region) network representing the individuals' co-participation in meetings. The results show that the network is significantly clustered and that communities are partially associated with the internal organization of the 'Ndrangheta into different 'locali' (similar to mafia families). The implications of these findings on the interpretation of the structure and functioning of the criminal network are discussed.
Article
Full-text available
Many methods have been proposed for community detection in networks, but most of them do not take into account additional information on the nodes that is often available in practice. In this paper, we propose a new joint community detection criterion that uses both the network edge information and the node features to detect community structures. One advantage our method has over existing joint detection approaches is the flexibility of learning the impact of different features which may differ across communities. Another advantage is the flexibility of choosing the amount of influence the feature information has on communities. The method is asymptotically consistent under the block model with additional assumptions on the feature distributions, and performs well on simulated and real networks.
Article
Full-text available
For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network, geographic location of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Here we demonstrate how this "metadata" can be used to improve our analysis and understanding of network structure. We focus in particular on the problem of community detection in networks and develop a mathematically principled approach that combines a network and its metadata to detect communities more accurately than can be done with either alone. Crucially, the method does not assume that the metadata are correlated with the communities we are trying to find. Instead the method learns whether a correlation exists and correctly uses or ignores the metadata depending on whether they contain useful information. The learned correlations are also of interest in their own right, allowing us to make predictions about the community membership of nodes whose network connections are unknown. We demonstrate our method on synthetic networks with known structure and on real-world networks, large and small, drawn from social, biological, and technological domains.
Article
Full-text available
Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to classical algorithms which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficulties in visualizing it add to this problem. In a Bayesian analysis, the posterior of a real-valued parameter of interest is often summarized by reporting a point estimate such as the posterior mean along with 95% credible intervals to characterize uncertainty. In this paper, we extend these ideas to develop appropriate point estimates and credible sets to summarize the posterior of clustering structure based on decision and information theoretic techniques.
Article
Full-text available
Stochastic blockmodels and variants thereof are among the most widely used approaches to community detection for social networks and relational data. A stochastic blockmodel partitions the nodes of a network into disjoint sets, called communities. The approach is inherently related to clustering with mixture models; and raises a similar model selection problem for the number of communities. The Bayesian information criterion (BIC) is a popular solution, however, for stochastic blockmodels, the conditional independence assumption given the communities of the endpoints among different edges is usually violated in practice. In this regard, we propose composite likelihood BIC (CL-BIC) to select the number of communities, and we show it is robust against possible misspecifications in the underlying stochastic blockmodel assumptions. We derive the requisite methodology and illustrate the approach using both simulated and real data. Supplementary materials containing the relevant computer code are available online.
Article
Full-text available
This article looks at three Italian mafia organizations (Cosa Nostra, Camorra, and ‘Ndrangheta). It applies an organizational approach to the understanding of violence in mafia organizations by studying the relationship between their organizational orders and their criminal behavior. The article identifies two different organizational orders, vertical and horizontal, and demonstrates that Italian mafias, although operating in similar environments, can greatly differ from each other, and over time, in terms of their organizational model. Findings suggest that mafias with a vertical organizational order, due to the presence of higher levels of coordination, (1) have greater control over conflict, as proved by the lower number of “ordinary” murders; and (2) have greater capacity to fight state repression, as testified by the greater number of “high-profile” assassinations (e.g. politicians, magistrates, and other institutional members) that they carry out. Evidence is provided using a mixedmethods approach that combines a qualitative, organizational analysis of historical and judiciary sources, in order to reconstruct the organizational models and their evolution over time, with a quantitative analysis of assassination trends, in order to relate organizational orders to the use of violence.
Article
Full-text available
The integrated likelihood (also called the marginal likelihood or the normalizing constant) is a central quantity in Bayesian model selection and model averaging. It is defined as the integral over the parameter space of the likelihood times the prior density. The Bayes factor for model comparison and Bayesian testing is a ratio of integrated likelihoods, and the model weights in Bayesian model averaging are proportional to the integrated likelihoods. We consider the estimation of the integrated likelihood from posterior simulation output, aiming at a generic method that uses only the likelihoods from the posterior simulation iterations. The key is the harmonic mean identity, which says that the reciprocal of the integrated likelihood is equal to the posterior harmonic mean of the likelihood. The simplest estimator based on the identity is thus the harmonic mean of the likelihoods. While this is an unbiased and simulation-consistent estimator, its reciprocal can have infinite variance and so it is unstable in general. We describe two methods for stabilizing the harmonic mean estimator. In the first one, the parameter space is reduced in such a way that the modified estimator involves a harmonic
Article
Full-text available
The study of criminal networks using traces from heterogeneous communication media is acquiring increasing importance in nowadays society. The usage of communication media such as phone calls and online social networks leaves digital traces in the form of metadata that can be used for this type of analysis. The goal of this work is twofold: first we provide a theoretical framework for the problem of detecting and characterizing criminal organizations in networks reconstructed from phone call records. Then, we introduce an expert system to support law enforcement agencies in the task of unveiling the underlying structure of criminal networks hidden in communication data. This platform allows for statistical network analysis, community detection and visual exploration of mobile phone network data. It allows forensic investigators to deeply understand hierarchies within criminal organizations, discovering members who play central role and provide connection among sub-groups. Our work concludes illustrating the adoption of our computational framework for a real-word criminal investigation.
Article
Full-text available
Social network analysis is the study of how links between a set of actors are formed. Typically, it is believed that links are formed in a structured manner, which may be due to, for example, political or material incentives, and which often may not be directly observable. The stochastic blockmodel represents this structure using latent groups which exhibit different connective properties, so that conditional on the group membership of two actors, the probability of a link being formed between them is represented by a connectivity matrix. The mixed membership stochastic blockmodel extends this model to allow actors membership to different groups, depending on the interaction in question, providing further flexibility. Attribute information can also play an important role in explaining network formation. Network models which do not explicitly incorporate covariate information require the analyst to compare fitted network models to additional attributes in a post-hoc manner. We introduce the mixed membership of experts stochastic blockmodel, an extension to the mixed membership stochastic blockmodel which incorporates covariate actor information into the existing model. The method is illustrated with application to the Lazega Lawyers dataset. Model and variable selection methods are also discussed.
Chapter
The Valencia International Meetings on Bayesian Statistics, held every four years, provide the main forum for researchers in the area of Bayesian Statistics to come together to present and discuss frontier developments in the field. Covering a broad range of applications and models, including genetics, computer vision and computation, the resulting proceedings provide a definitive, up-to-date overview encompassing a wide range of theoretical and applied research. This eighth proceedings includes edited and refereed versions of 20 invited papers plus extensive and in-depth discussion along with 19 extended four page abstracts of the best presentations offering a wide perspective of the developments in Bayesian statistics over the last four years.
Article
Network studies of organized crime (OC) normally explore two key relational issues: the internal structure of groups and the interactions among groups. The paper first discusses in depth two data sources that have been used to address these questions -- phone wiretaps and police-generated “events”– and reviews issues of validity, reliability and sampling. Next, it discusses challenges related to OC network data in general, focusing on the ‘double boundary specification’ problem and the time span of data collection. We conclude by arguing that structural analysis cannot be divorced from a deep contextual (qualitative) knowledge of the cases. The paper refers to concrete research dilemmas and solutions faced by scholars, including ourselves.
Book
The field of community detection has been expanding greatly since the 1980s, with a remarkable diversity of models and algorithms developed in different communities like machine learning, computer science, network science, social science, and statistical physics. Various fundamental questions remain nonetheless unsettled, such as: Are there really communities? Algorithms may output community structures, but are these meaningful or artefacts? Can we always extract the communities when they are present; fully, partially? And what is a good benchmark to measure the performance of algorithms, and how good are the current algorithms? This monograph describes recent developments aiming at answering these questions in the context of block models. Addressing the issues from an information-theoretic view-point, the author gives a comprehensive description of the historical and recent work that has led to key new concepts in the various recovery requirements for community detection. The monograph provides a compact introduction to community detection, which enables the reader to apply these techniques in applications such as understanding sociological behavior, protein to protein interactions; gene expressions; recommendation systems; medical prognosis; DNA 3D folding; image segmentation, natural language processing, product-customer segmentation, webpage sorting, and many more.
Article
Over the past decade, a considerable literature has emerged within criminology stemming from the collection of social network data and the adoption of social network analysis by a cadre of scholars. We review recent contributions to four areas of crime research: co-offending networks, illicit networks, gang-rivalry networks, and neighborhoods and crime. Our review highlights potential pitfalls that one might encounter when using social networks in criminological research and points to fruitful directions for further research. In particular, we recommend paying special attention to the clear specifications of what ties in the network are assumed to be doing, potential measurement weaknesses that can arise when using police or investigative data to construct a network, and understanding dynamic social network processes related to criminological outcomes. We envision a bright future in which the social network perspective will be more fully integrated into criminological theories, analyses, and applications.
Article
Brokerage is crucial for dark networks. In analyzing communications among criminals, which naturally induce bipartite networks, previous studies have focused on the classic Freeman's betweenness, conceived for one-mode matrices and possibly biasing the results. We explore different betweenness centrality including three inspired by the dual projection approach recently suggested by Everett and Borgatti 2013. We test these measures in identifying criminal leaders in a meeting participation network. Despite the expected high correlations among them, the measures yield different node rankings, capturing different characteristics of brokerage. Overall, the dual projection approaches show higher success than classic approaches in identifying the criminal leaders.
Article
We consider the analysis of spectral clustering algorithms for community detection under a stochastic block model (SBM). A general spectral clustering algorithm consists of three steps: (1) regularization of an appropriate adjacency or Laplacian matrix (2) a form of spectral truncation and (3) a k-means type algorithm in the reduced spectral domain. By varying each step, one can obtain different spectral algorithms. In light of the recent developments in refining consistency results for the spectral clustering, we identify the necessary bounds at each of these three steps, and then derive and compare consistency results for some existing spectral algorithms as well as a new variant that we propose. The focus of the paper is on providing a better understanding of the analysis of spectral methods for community detection, with an emphasis on the bipartite setting which has received less theoretical consideration. We show how the variations in the spectral truncation step reflects in the consistency results under a general SBM. We also investigate the necessary bounds for the k-means step in some detail, allowing one to replace this step with any algorithm (k-means type or otherwise) that guarantees the necessary bound. We discuss some of the neglected aspects of the bipartite setting, e.g., the role of the mismatch between the communities of the two sides on the performance of spectral methods. Finally, we show how the consistency results can be extended beyond SBMs to the problem of clustering inhomogeneous random graph models that can be approximated by SBMs in a certain sense.
Article
Significance Statistical theory has mostly focused on static networks observed as a single snapshot in time. In reality, networks are generally dynamic, and it is of substantial interest to discover the clusters within each network to visualize and model their connectivities. We propose the persistent communities by eigenvector smoothing algorithm for detecting time-varying community structure and apply it to a recent dataset in which gene expression is measured during a broad range of developmental periods in rhesus monkey brains. The analysis suggests the existence of change points as well as periods of persistent community structure; these are not well estimated by standard methods due to the small sample size of any one developmental period or region of the brain.
Article
Actor-event data are common in sociological settings, whereby one registers the pattern of attendance of a group of social actors to a number of events. We focus on 79 members of the Noordin Top terrorist network, who were monitored attending 45 events. The attendance or non-attendance of the terrorist to events defines the social fabric, such as group coherence and social communities. The aim of the analysis of such data is to learn about this social structure. Actor-event data is often transformed to actor-actor data in order to be further analysed by network models, such as stochastic block models. This transformation and such analyses lead to a natural loss of information, particularly when one is interested in identifying, possibly overlapping, subgroups or communities of actors on the basis of their attendances to events. In this paper we propose an actor-event model for overlapping communities of terrorists, which simplifies interpretation of the network. We propose a mixture model with overlapping clusters for the analysis of the binary actor-event network data, called {\tt manet}, and develop a Bayesian procedure for inference. After a simulation study, we show how this analysis of the terrorist network has clear interpretative advantages over the more traditional approaches of network analysis.
Article
Evaluating the marginal likelihood in Bayesian analysis is essential for model selection. Estimators based on a single Markov chain Monte Carlo sample from the posterior distribution include the harmonic mean estimator and the inflated density ratio estimator. We propose a new class of Monte Carlo estimators based on this single Markov chain Monte Carlo sample. This class can be thought of as a generalization of the harmonic mean and inflated density ratio estimators using a partition weighted kernel (likelihood times prior). We show that our estimator is consistent and has better theoretical properties than the harmonic mean and inflated density ratio estimators. In addition, we provide guidelines on choosing optimal weights. Simulation studies were conducted to examine the empirical performance of the proposed estimator. We further demonstrate the desirable features of the proposed estimator with two real data sets: one is from a prostate cancer study using an ordinal probit regression model with latent variables; the other is for the power prior construction from two Eastern Cooperative Oncology Group phase III clinical trials using the cure rate survival model with similar objectives.
Article
Many models and methods are now available for network analysis, but model selection and tuning remain challenging. Cross-validation is a useful general tool for these tasks in many settings, but is not directly applicable to networks since splitting network nodes into groups requires deleting edges and destroys some of the network structure. Here we propose a new network cross-validation strategy based on splitting edges rather than nodes, which avoids losing information and is applicable to a wide range of network problems. We provide a theoretical justification for our method in a general setting, and in particular show that the method has good asymptotic properties under the stochastic block model. Numerical results on both simulated and real networks show that our approach performs well for a number of model selection and parameter tuning tasks.
Article
Social networks exhibit two key topological features: global sparsity and local density. That is, the overall propensity for interaction between any two randomly selected actors is infinitesimal, but for any given individual there is massive variability in the propensity to interact with others in the network. Further, relevant scientific questions typically depending on the scale of analysis. In this paper, we propose a class of network models that represent network structure on multiple scales and enable statistical inference about relevant population-level parameters. Specifically, we capture global graph structure using a mixture of projective models that capture local graph structures. This approach allows us to differentially invest modeling effort within subgraphs of high density, often termed communities, while maintaining a parsimonious structure between said subgraphs. We illustrate the utility of our method using data on household relations from Karnataka, India.
Article
A fundamental problem in network analysis is clustering the nodes into groups, each of which shares a similar connectivity pattern. Existing algorithms for community detection assume the knowledge of the number of clusters or estimate it a priori using various selection criteria and subsequently estimate the community structure. Ignoring the uncertainty in the first stage may lead to erroneous clustering, particularly when the community structure is vague. We instead propose a coherent probabilistic framework (MFM-SBM) for simultaneous estimation of the number of communities and the community structure, adapting recently developed Bayesian nonparametric techniques to network models. An efficient Markov chain Monte Carlo (MCMC) algorithm is proposed which obviates the need to perform reversible jump MCMC on the number of clusters. The methodology is shown to outperform recently developed community detection algorithms in a variety of synthetic data examples and in benchmark real-datasets. We derive non-asymptotic bounds on the marginal posterior probability of the true configuration, and subsequently use it to prove a clustering consistency result which is novel in the Bayesian context to best of our knowledge.
Article
In this paper we propose a conceptually straightforward method to estimate the marginal data density value (also called the marginal likelihood). We show that the marginal likelihood is equal to the prior mean of the conditional density of the data given the vector of parameters restricted to a certain subset of the parameter space, A, times the reciprocal of the posterior probability of the subset A. This identity motivates one to use Arithmetic Mean estimator based on simulation from the prior distribution restricted to any (but reasonable) subset of the space of parameters. By trimming this space, regions of relatively low likelihood are removed, and thereby the efficiency of the Arithmetic Mean estimator is improved. We show that the adjusted Arithmetic Mean estimator is unbiased and consistent.
Article
This article offers some remarks on a few critical issues related to explanation in criminal network research. It first discusses two distinct perspectives on networks, namely a substantive approach that views networks as a distinct form of organisation, and an instrumental one that interprets networks as a collection of nodes and attributes. The latter stands at the basis of Social Network Analysis. This work contends that the instrumental approach is better suited to test hypotheses, as it does not assume any structure a priori\textit{a priori}, but derives it from the data. Moreover, social network techniques can be applied to investigate criminal networks while rejecting the notion of networks as a distinct form of organisation. Next, the article discusses some potential pitfalls associated with the instrumental approach and cautions against an over-reliance on structural measures alone when interpreting real-world networks. It then stresses the need to complement these measures with additional qualitative evidence. Finally, the article discusses the use of Quadratic Assignment Procedure regression models as a viable strategy to test hypotheses based on criminal network data.
Article
In this paper we present the results of the study of Sicilian Mafia organization by using Social Network Analysis. The study investigates the network structure of a Mafia organization, describing its evolution and highlighting its plasticity to interventions targeting membership and its resilience to disruption caused by police operations. We analyze two different datasets about Mafia gangs built by examining different digital trails and judicial documents spanning a period of ten years: the former dataset includes the phone contacts among suspected individuals, the latter is constituted by the relationships among individuals actively involved in various criminal offenses. Our report illustrates the limits of traditional investigation methods like tapping: criminals high up in the organization hierarchy do not occupy the most central positions in the criminal network, and oftentimes do not appear in the reconstructed criminal network at all. However, we also suggest possible strategies of intervention, as we show that although criminal networks (i.e., the network encoding mobsters and crime relationships) are extremely resilient to different kind of attacks, contact networks (i.e., the network reporting suspects and reciprocated phone calls) are much more vulnerable and their analysis can yield extremely valuable insights.
Article
Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.
Article
Community detection is a fundamental problem in network analysis with many methods available to estimate communities. Most of these methods assume that the number of communities is known, which is often not the case in practice. We propose a simple and very fast method for estimating the number of communities based on the spectral properties of certain graph operators, such as the non-backtracking matrix and the Bethe Hessian matrix. We show that the method performs well under several models and a wide range of parameters, and is guaranteed to be consistent under several asymptotic regimes. We compare the new method to several existing methods for estimating the number of communities and show that it is both more accurate and more computationally efficient.
Article
Extracting communities using existing community detection algorithms yields dense sub-networks that are difficult to analyse. Extracting a smaller sample that embodies the relationships of a list of suspects is an important part of the beginning of an investigation. In this paper, we present the efficacy of our shortest paths network search algorithm (SPNSA) that begins with an "algorithm feed", a small subset of nodes of particular interest, and builds an investigative sub-network. The algorithm feed may consist of known criminals or suspects, or persons of influence. This sets our approach apart from existing community detection algorithms. We apply the SPNSA on the Enron Dataset of e-mail communications starting with those convicted of money laundering in relation to the collapse of Enron as the algorithm feed. The algorithm produces sparse and small sub-networks that could feasibly identify a list of persons and relationships to be further investigated. In contrast, we show that identifying sub-networks of interest using either community detection algorithms or a k-Neighbourhood approach produces sub-networks of much larger size and complexity. When the 18 top managers of Enron were used as the algorithm feed, the resulting sub-network identified 4 convicted criminals that were not managers and so not part of the algorithm feed. We also directly tested the SPNSA by removing one of the convicted criminals from the algorithm feed and re-running the algorithm; in 5 out of 9 cases the left out criminal occurred in the resulting sub-network.
Article
A natural Bayesian approach for mixture models with an unknown number of components is to take the usual finite mixture model with Dirichlet weights, and put a prior on the number of components---that is, to use a mixture of finite mixtures (MFM). While inference in MFMs can be done with methods such as reversible jump Markov chain Monte Carlo, it is much more common to use Dirichlet process mixture (DPM) models because of the relative ease and generality with which DPM samplers can be applied. In this paper, we show that, in fact, many of the attractive mathematical properties of DPMs are also exhibited by MFMs---a simple exchangeable partition distribution, restaurant process, random measure representation, and in certain cases, a stick-breaking representation. Consequently, the powerful methods developed for inference in DPMs can be directly applied to MFMs as well. We illustrate with simulated and real data, including high-dimensional gene expression data.
Article
The stochastic block model (SBM) provides a popular framework for modeling community structures in networks. However, more attention has been devoted to problems concerning estimating the latent node labels and the model parameters than the issue of choosing the number of blocks. We consider an approach based on the log likelihood ratio statistic and analyze its asymptotic properties under model misspecification. We show the limiting distribution of the statistic in the case of underfitting is normal and obtain its convergence rate in the case of overfitting. These conclusions remain valid in the dense and semi-sparse regimes. The results enable us to derive the correct order of the penalty term for model complexity and arrive at a likelihood-based model selection criterion that is asymptotically consistent. In practice, the likelihood function can be estimated by more computationally efficient variational methods, allowing the criterion to be applied to moderately large networks.
Article
In a 1935 paper and in his book Theory of Probability, Jeffreys developed a methodology for quantifying the evidence in favor of a scientific theory. The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half. Although there has been much discussion of Bayesian hypothesis testing in the context of criticism of P-values, less attention has been given to the Bayes factor as a practical tool of applied statistics. In this article we review and discuss the uses of Bayes factors in the context of five scientific applications in genetics, sports, ecology, sociology, and psychology. We emphasize the following points:
Article
Biological and social systems consist of myriad interacting units. The interactions can be intuitively represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as neuroconnectomics, social networks, and genomics, graph data is accompanied by contextualizing measures on each node. We leverage these node covariates to help uncover latent communities in a graph, using a modification of spectral clustering. Statistical guarantees are provided under a joint mixture model that we call the Node Contextualized Stochastic Blockmodel, including a bound on the mis-clustering rate. For most simulated conditions, covariate assisted spectral clustering yields superior results relative to both regularized spectral clustering without node covariates and an adaptation of canonical correlation analysis. We apply covariate assisted spectral clustering to large brain graphs derived from diffusion MRI data, using the node locations or neurological region membership as covariates. In both cases, covariate assisted spectral clustering yields clusters that are easier to interpret neurologically.
Article
The stochastic block model and its variants have been a popular tool in analyzing large network data with community structures. In this paper we develop an efficient network cross-validation (NCV) approach to determine the number of communities, as well as to choose between the regular stochastic block model and the degree corrected block model. The proposed NCV method is based on a block-wise node-pair splitting technique, combined with an integrated step of community recovery using sub-blocks of the adjacency matrix. We prove that the probability of under selection vanishes as the number of node increases, under mild conditions satisfied by a wide range of popular community recovery algorithms. The solid performance of our method is also demonstrated in extensive simulations and a data example.
Article
A two-parameter family of exchangeable partitions with a simple updating rule is introduced. The partition is identified with a randomized version of a standard symmetric Dirichlet species-sampling model with finitely many types. A power-like distribution for the number of types is derived.
Article
In this paper we analyze the asymptotic behaviour of Gibbs-type priors, that represent a natural generalization of the Dirichlet process. After determining their topological support, we investigate their consistency according to the “what if”, or frequentist, approach, that postulates the existence of a “true” distribution P 0 . We provide a full taxonomy of their limiting behaviours: consistency holds essentially always for discrete P 0 , whereas inconsistency may occur for diffuse P 0 . Such findings are further illustrated by means of three special cases admitting closed form expressions and exhibiting a wide range of asymptotic behaviours. For both Gibbs-type priors and discrete nonparametric priors in general, the possible inconsistency should not be interpreted as evidence against their use tout court. It rather represents an indication that they are designed for modeling discrete distributions and evidence against their use in the case of diffuse P 0 .
Article
Community detection algorithms are fundamental tools that allow us to uncover organizational principles in networks. When detecting communities, there are two possible sources of information one can use: the network structure, and the features and attributes of nodes. Even though communities form around nodes that have common edges and common attributes, typically, algorithms have only focused on one of these two data modalities: community detection algorithms traditionally focus only on the network structure, while clustering algorithms mostly consider only node attributes. In this paper, we develop Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. CESNA statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure. CESNA has a linear runtime in the network size and is able to process networks an order of magnitude larger than comparable approaches. Last, CESNA also helps with the interpretation of detected communities by finding relevant node attributes for each community.