ArticlePublisher preview available

Fast generation of exchangeable sequences of clusters data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
This content is subject to copyright. Terms and conditions apply.
Statistics and Computing (2024) 34:72
https://doi.org/10.1007/s11222-024-10385-w
ORIGINAL PAPER
Fast generation of exchangeable sequences of clusters data
Keith Levin1
·Brenda Betancourt2
Received: 2 November 2023 / Accepted: 10 January 2024 / Published online: 7 February 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable
Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the
observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes
grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and
genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast
to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be
easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing
on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models
and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In
the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was
unknown prior to this work.
Keywords Random partition ·Microclustering ·Bell polynomials ·Renewal theory
1 Introduction
Random partitions are integral to a variety of Bayesian clus-
tering methods, with applications to text analysis (Blei et al.
2003;Blei2012) genetics (Pritchard et al. 2000; Falush
et al. 2003), entity resolution (Binette and Steorts 2022)
and community detection (Legramanti et al. 2022), to name
but a few. The most widely used random partition mod-
els are those based on Dirichlet processes and Pitman–Yor
processes (Antoniak 1974; Sethuraman 1994; Ishwaran and
James 2003), most notably the famed Chinese Restaurant
Process (CRP; Aldous 1985). A drawback of these models is
that they generate partitions in which one or more cells of the
partition grows linearly in the number of observations n.This
property is undesirable in applications to, for example, record
linkage and social network modeling, where data commonly
BKeith Levin
kdlevin@wisc.edu
Brenda Betancourt
betancourt-brenda@norc.org
1Department of Statistics, University of Wisconsin–Madison,
1300 University Ave, Madison, WI, USA
2Statistics & Data Science, NORC at the University of
Chicago, 4350 East–West Highway, Bethesda, MD, USA
exhibit a large number of small clusters. For these applica-
tions, a different mechanism is needed that better captures
the growth of cluster sizes with n.
The solution to this issue is to deploy models with the
microclustering property, whereby the size of the largest
cluster grows sublinearly in the number of observations n.
Early attempts in this direction appeared in Miller et al.
(2015); Zanella et al. (2016). The authors were motivated
by record linkage applications (Binette and Steorts 2022)
where clusters are expected to remain small even as the num-
ber of observations increases. This initial class of models,
constructed under the Kolchin representation of Gibbs par-
titions (Kolchin 1971), places a prior κon the number of
clusters K, and then draws from a distribution μover cluster
sizes conditional on K. This approach is comparatively sim-
ple, admitting an algorithm that facilitates sampling a priori
and a posteriori similar to the CRP. Unfortunately, the distri-
butions of the number of clusters and the size of a randomly
chosen cluster are not straightforwardly related to the priors
κand μ. More to the point, it is not yet theoretically proven
that this family of models indeed exhibits the microclustering
property in general.
More recently, Betancourt et al. (2022) considered a dif-
ferent approach to microclustering, called Exchangeable
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small dissimilarities among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in simulation and an application in digital numismatics. Supplementary Material with code is available online.
Article
Reliably learning group structures among nodes in network data is challenging in several applications. We are particularly motivated by studying covert networks that encode relationships among criminals. These data are subject to measurement errors, and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of routinely-used community detection algorithms, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation and uncertainty quantification. To cover these gaps, we develop a new class of extended stochastic block models (ESBM) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses many realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates the inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with criminal networks. A collapsed Gibbs sampler is proposed for the whole ESBM class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. The ESBM performance is illustrated in realistic simulations and in an application to an Italian mafia network, where we unveil key complex block structures, mostly hidden from state-of-the-art alternatives.
Article
Classical model-based partitional clustering algorithms, such ask-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing.Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.
Article
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme—integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage or deduplication). Here, we review motivational applications and seminal papers that have led to the growth of this area. We review modern probabilistic and Bayesian methods in statistics, computer science, machine learning, database management, economics, political science, and other disciplines that are used throughout industry and academia in applications such as human rights, official statistics, medicine, and citation networks, among others. Last, we discuss current research topics of practical importance.
Article
We describe extensions to the method of Pritchard et al. for inferring population structure from multilocus genotype data. Most importantly, we develop methods that allow for linkage between loci. The new model accounts for the correlations between linked loci that arise in admixed populations (“admixture linkage disequilibium”). This modification has several advantages, allowing (1) detection of admixture events farther back into the past, (2) inference of the population of origin of chromosomal regions, and (3) more accurate estimates of statistical uncertainty when linked loci are used. It is also of potential use for admixture mapping. In addition, we describe a new prior model for the allele frequencies within each population, which allows identification of subtle population subdivisions that were not detectable using the existing method. We present results applying the new methods to study admixture in African-Americans, recombination in Helicobacter pylori, and drift in populations of Drosophila melanogaster. The methods are implemented in a program, structure, version 2.0, which is available at http://pritch.bsd.uchicago.edu.
Article
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points — the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.
Article
Notwithstanding its title, the reader will not find in this book a systematic account of this huge subject. Certain classical aspects have been passed by, and the true title ought to be "Various questions of elementary combina­ torial analysis". For instance, we only touch upon the subject of graphs and configurations, but there exists a very extensive and good literature on this subject. For this we refer the reader to the bibliography at the end of the volume. The true beginnings of combinatorial analysis (also called combina­ tory analysis) coincide with the beginnings of probability theory in the 17th century. For about two centuries it vanished as an autonomous sub­ ject. But the advance of statistics, with an ever-increasing demand for configurations as well as the advent and development of computers, have, beyond doubt, contributed to reinstating this subject after such a long period of negligence. For a long time the aim of combinatorial analysis was to count the different ways of arranging objects under given circumstances. Hence, many of the traditional problems of analysis or geometry which are con­ cerned at a certain moment with finite structures, have a combinatorial character. Today, combinatorial analysis is also relevant to problems of existence, estimation and structuration, like all other parts of mathema­ tics, but exclusively forjinite sets.
Article
Many popular random partition models, such as the Chinese restaurant process and its two-parameter extension, fall in the class of exchangeable random partitions, and have found wide applicability in model-based clustering, population genetics, ecology or network analysis. While the exchangeability assumption is sensible in many cases, it has some strong implications. In particular, Kingman's representation theorem implies that the size of the clusters necessarily grows linearly with the sample size; this feature may be undesirable for some applications, as recently pointed out by Miller et al. (2015). We present here a flexible class of non-exchangeable random partition models which are able to generate partitions whose cluster sizes grow sublinearly with the sample size, and where the growth rate is controlled by one parameter. Along with this result, we provide the asymptotic behaviour of the number of clusters of a given size, and show that the model can exhibit a power-law behavior, controlled by another parameter. The construction is based on completely random measures and a Poisson embedding of the random partition, and inference is performed using a Sequential Monte Carlo algorithm. Additionally, we show how the model can also be directly used to generate sparse multigraphs with power-law degree distributions and degree sequences with sublinear growth. Finally, experiments on real datasets emphasize the usefulness of the approach compared to a two-parameter Chinese restaurant process.
Article
Size-constrained clustering (SCC) refers to the dual problem of using observations to determine latent cluster structure while at the same time assigning observations to the unknown clusters subject to an analyst defined constraint on cluster sizes. While several approaches have been proposed, SCC remains a difficult problem due to the combinatorial dependency between observations introduced by the size-constraints. Here we reformulate SCC as a decision problem and introduce a novel loss function to capture various types of size constraints. As opposed to prior work, our approach is uniquely suited to situations in which size constraints reflect and external limitation or desire rather than an internal feature of the data generation process. To demonstrate our approach, we develop a Bayesian mixture model for clustering respondents using both simulated and real categorical survey data. Our motivation for the development of this decision theoretic approach to SCC was to determine optimal team assignments for a Harry Potter themed scavenger hunt based on categorical survey data from participants.