A preview of this full-text is provided by Springer Nature.
Content available from Statistics and Computing
This content is subject to copyright. Terms and conditions apply.
Statistics and Computing (2024) 34:72
https://doi.org/10.1007/s11222-024-10385-w
ORIGINAL PAPER
Fast generation of exchangeable sequences of clusters data
Keith Levin1
·Brenda Betancourt2
Received: 2 November 2023 / Accepted: 10 January 2024 / Published online: 7 February 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable
Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the
observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes
grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and
genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast
to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be
easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing
on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models
and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In
the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was
unknown prior to this work.
Keywords Random partition ·Microclustering ·Bell polynomials ·Renewal theory
1 Introduction
Random partitions are integral to a variety of Bayesian clus-
tering methods, with applications to text analysis (Blei et al.
2003;Blei2012) genetics (Pritchard et al. 2000; Falush
et al. 2003), entity resolution (Binette and Steorts 2022)
and community detection (Legramanti et al. 2022), to name
but a few. The most widely used random partition mod-
els are those based on Dirichlet processes and Pitman–Yor
processes (Antoniak 1974; Sethuraman 1994; Ishwaran and
James 2003), most notably the famed Chinese Restaurant
Process (CRP; Aldous 1985). A drawback of these models is
that they generate partitions in which one or more cells of the
partition grows linearly in the number of observations n.This
property is undesirable in applications to, for example, record
linkage and social network modeling, where data commonly
BKeith Levin
kdlevin@wisc.edu
Brenda Betancourt
betancourt-brenda@norc.org
1Department of Statistics, University of Wisconsin–Madison,
1300 University Ave, Madison, WI, USA
2Statistics & Data Science, NORC at the University of
Chicago, 4350 East–West Highway, Bethesda, MD, USA
exhibit a large number of small clusters. For these applica-
tions, a different mechanism is needed that better captures
the growth of cluster sizes with n.
The solution to this issue is to deploy models with the
microclustering property, whereby the size of the largest
cluster grows sublinearly in the number of observations n.
Early attempts in this direction appeared in Miller et al.
(2015); Zanella et al. (2016). The authors were motivated
by record linkage applications (Binette and Steorts 2022)
where clusters are expected to remain small even as the num-
ber of observations increases. This initial class of models,
constructed under the Kolchin representation of Gibbs par-
titions (Kolchin 1971), places a prior κon the number of
clusters K, and then draws from a distribution μover cluster
sizes conditional on K. This approach is comparatively sim-
ple, admitting an algorithm that facilitates sampling a priori
and a posteriori similar to the CRP. Unfortunately, the distri-
butions of the number of clusters and the size of a randomly
chosen cluster are not straightforwardly related to the priors
κand μ. More to the point, it is not yet theoretically proven
that this family of models indeed exhibits the microclustering
property in general.
More recently, Betancourt et al. (2022) considered a dif-
ferent approach to microclustering, called Exchangeable
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.