To read the full-text of this research, you can request a copy directly from the authors.
Abstract
Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small dissimilarities among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in simulation and an application in digital numismatics. Supplementary Material with code is available online.
To read the full-text of this research, you can request a copy directly from the authors.
... This amounts to approximately 1,350 hours of work, making the analysis of large hoards almost unfeasible. However, recent studies have begun to address this challenge using Computer Vision and Machine Learning techniques, offering a more efficient approach to identifying die links in vast collections [21,30]. ...
... Although it is not the case for coin classification [1], the small amount of publicly available labeled data also makes it more difficult to use Deep Learning techniques for automatic feature extraction. Automated analysis therefore still relies on the development of scoring techniques tailored to the datasets studied [21,30,46]. Thus, it is important to highlight the analysis tools that perform well in these specific tasks, and that could be used in the future to accurately label datasets. ...
... In the domain of coin die link detection, some recent works are very promising [21,30,46]. However, the datasets used in these works are not publicly available, and the source codes for computing coin dissimilarities have not been released online. ...
The analyses of ancient coins, and especially the identification of those struck with the same die, provides invaluable information for archaeologists and historians. Nowadays, these die links are identified manually, which makes the process laborious, if not impossible when big treasures are discovered as the number of comparisons is too large. This study introduces advances that promise to streamline and enhance archaeological coin analysis. Our contributions include: 1) First publicly accessible labeled dataset of coin pictures (329 images) for die link detection, facilitating method benchmarking; 2) Novel SSIM-based scoring method for rapid and accurate discrimination of coin pairs, outperforming current techniques used in this research field; 3) Evaluation of clustering techniques using our score, demonstrating near-perfect die link identification. We provide datasets, to foster future research and the development of even more powerful tools for archaeology, and more particularly for numismatics.
... Mara (2022) notes that the use of mathematical and statistical methods in archaeology occurred quite early, at the forefront of Digital Humanities (Binford 1965;Clarke 1973). Since the 1970s, digital methods have been used in a variety of archaeological investigations, for instance mapping archaeological sites through the application of GIS, the reconstruction of ancient sites and artefacts through 2D or 3D modelling, sorting and typologizing various artefacts using unsupervised or semi-supervised statistical models, and the digitization and archiving of museum and/or private collections (Papaioannou, Karabassi, and Theoharis 2002;Rasheed and Nordin 2015;Khunti 2018;Tuno et al. 2022;Natarajan et al. 2023). ...
... These approaches often result in poor performance due to loss of spatial relationships between the different impressions on the coin. To the best of our knowledge, only a limited number of the existing proposals explicitly address the problem of die analysis, that is, the determination of the number of dies/casts/minting tools corresponding to a sample of coins and their exact partition (Natarajan et al. 2023). Moreover, current methods are mainly based on pairwise From past to future 3 comparisons and at best use heuristics to determine the number of dies (Taylor 2020). ...
... We performed a series of pre-processing steps following Natarajan et al. (2023). Specifically, we convert the images to greyscale and resize them to 300 × 300 pixels. ...
Over the past two decades, Digital Humanities has transformed the landscape of humanities and social sciences, enabling advanced computational analysis and interpretation of extensive datasets. Notably, recent initiatives in Southeast Asia, particularly in Singapore, focus on categorizing and archiving historical data such as artwork, literature and, most notably archaeological artefacts. This study illustrates the profound potential of Digital Humanities through the application of statistical methods on two distinct artefact datasets. Specifically, we present the results of an automated minting study of mid-first millennium CE struck and cast ‘Rising Sun’ coinage from mainland Southeast Asia, while subsequently utilizing unsupervised statistical methods on 2D images of 13th–14th-century earthenware ceramics excavated from the precolonial St. Andrew’s Cathedral site in central Singapore. This research offers a comparative assessment showcasing the transformative impact of statistics-based approaches on the interpretation and analysis of diverse archaeological materials and within Digital Humanities overall.
... Duan and Dunson (2021), Rigon et al. (2023), and Natarajan et al. (2023) introduce a new class of Bayesian Distance Clustering (BDC) models to address the previous issues. Let T = {C 1 , . . . ...
... This approach addresses the lack of probabilistic interpretation in distance-based clustering methods and the lack of computational scalability in traditional probabilistic Bayesian models. See Natarajan et al. (2023) andRigon et al. (2023 for a theoretical justification. ...
... However, the previous models for the partition T have a large support which grows according to the Bell number of N , making posterior exploration through MCMC difficult and leading to inference algorithms that are not scalable. This paper builds upon the BDC model in Natarajan et al. (2023) and introduces new partition models for BDC with the accompanying inference algorithms. We take inspiration from K-Medoids (Kaufman and Rousseeuw, 2009) and infer probabilistically which observations are the "medoids" instead of performing inference on the random partition model T directly. ...
In the era of Big Data, scalable and accurate clustering algorithms for high-dimensional data are essential. We present new Bayesian Distance Clustering (BDC) models and inference algorithms with improved scalability while maintaining the predictive accuracy of modern Bayesian non-parametric models. Unlike traditional methods, BDC models the distance between observations rather than the observations directly, offering a compromise between the scalability of distance-based methods and the enhanced predictive power and probabilistic interpretation of model-based methods. However, existing BDC models still rely on performing inference on the partition model to group observations into clusters. The support of this partition model grows exponentially with the dataset's size, complicating posterior space exploration and leading to many costly likelihood evaluations. Inspired by K-medoids, we propose using tessellations in discrete space to simplify inference by focusing the learning task on finding the best tessellation centers, or "medoids." Additionally, we extend our models to effectively handle multi-view data, such as data comprised of clusters that evolve across time, enhancing their applicability to complex datasets. The real data application in numismatics demonstrates the efficacy of our approach.
... Our uncertainty based construction, and the subsequent matching of the keypoints between image pairs, is inspired by the work of [12], who consider Gaussian process landmarking on manifolds for analysis and comparison of three-dimensional anatomical shapes in evolutionary biology [13]. We utilize the pairwise dissimilarities calculated from the Gaussian process based features for Bayesian distance based microclustering [14]. In the Bayesian framework, the number of dies used to strike the sample, a key unknown [15], is an object of inference, to be predicted in conjunction with the die labels. ...
... We utilized parallel computing resources to calculate all pairwise dissimilarities, since the number of pairs is quadratic in the number of images. Clustering is then based on the dissimilarity scores via a chaperones algorithm implementation of the Bayesian distance microclustering model of [14]. Expert domain knowledge can be used to inform prior hyper-parameters. ...
... The distances d ij are used to cluster the images via the Bayesian distance microclustering algorithm developed in [14], to which we refer for further details. The Bayesian distance based approach defines a likelihood directly on the distances between observations [49], representing a middle ground between model based clustering [50], which for high-dimensional data is computationally prohibitive, and distance-based clustering, such as hierarchical or k-means clustering which lack a probabilistic background. ...
Die analysis is an essential numismatic method, and an important tool of ancient economic history. Yet, manual die studies are too labor-intensive to comprehensively study large coinages such as those of the Roman Empire. We address this problem by proposing a model for unsupervised computational die analysis, which can reduce the time investment necessary for large-scale die studies by several orders of magnitude, in many cases from years to weeks. From a computer vision viewpoint, die studies present a challenging unsupervised clustering problem, because they involve an unknown and large number of highly similar semantic classes of imbalanced sizes. We address these issues through determining dissimilarities between coin faces derived from specifically devised Gaussian process-based keypoint features in a Bayesian distance clustering framework. The efficacy of our method is demonstrated through an analysis of 1135 Roman silver coins struck between 64-66 C.E..
... Historically, this process requires experts to manually examine coins for subtle similarities and differences, making it both time-consuming and subjective (Heinecke et al., 2021). Advances in image processing now allow for systematic analysis of die characteristics, such as engraver marks, symbols, and minor design inconsistencies, linking coins with greater precision and efficiency than ever before (McCord-Taylor 2020;Natarajan et al., 2023;Harris et al., 2024). By leveraging feature extraction algorithms and neural networks to compare and cluster coins from the same die, automated die studies have significantly improved both efficiency and consistency. ...
... 2021). Automated die studies are achieving ever-increasing levels of accuracy (Cabral et al., 2024;Natarajan et al., 2023), driven by advancements in computational techniques and machine learning. However, the effectiveness of these studies can be significantly enhanced by integrating more robust and efficient object detection methods. ...
In this work we investigate the application of advanced object detection techniques to digital numismatics, focussing on the analysis of historical coins. Leveraging models such as Contrastive Language-Image Pre-training (CLIP), we develop a flexible framework for identifying and classifying specific coin features using both image and textual descriptions. By examining two distinct datasets, modern Russian coins featuring intricate "Saint George and the Dragon" designs and degraded 1st millennium AD Southeast Asian coins bearing Hindu-Buddhist symbols, we evaluate the efficacy of different detection algorithms in search and classification tasks. Our results demonstrate the superior performance of larger CLIP models in detecting complex imagery, while traditional methods excel in identifying simple geometric patterns. Additionally, we propose a statistical calibration mechanism to enhance the reliability of similarity scores in low-quality datasets. This work highlights the transformative potential of integrating state-of-the-art object detection into digital numismatics, enabling more scalable, precise, and efficient analysis of historical artifacts. These advancements pave the way for new methodologies in cultural heritage research, artefact provenance studies, and the detection of forgeries.
... However, in either case, HMMs with a variable number of states are prone to overfitting, and hence such algorithms can lead to an unnecessarily large number of similar states (Duan and Dunson, 2018). Recent advancements in the field of mixture modelling have introduced the use of repulsive priors, which promote parsimony in the model (Petralia et al., 2012;Quinlan et al., 2021;Natarajan et al., 2023). These repulsive prior distributions serve to impose constraints on the proximity of state parameters, which discourages similar states from being created. ...
... Unavoidably, this particular form of penalty applied to the parameter space also affects the selection of the number of states (Natarajan et al., 2023). An additional advantage associated with incorporating a repulsive prior into HMMs, whether with fixed or variable dimensions, is the mitigation of overfitting. ...
Hidden Markov models (HMMs) offer a robust and efficient framework for analyzing time series data, modelling both the underlying latent state progression over time and the observation process, conditional on the latent state. However, a critical challenge lies in determining the appropriate number of underlying states, often unknown in practice. In this paper, we employ a Bayesian framework, treating the number of states as a random variable and employing reversible jump Markov chain Monte Carlo to sample from the posterior distributions of all parameters, including the number of states. Additionally, we introduce repulsive priors for the state parameters in HMMs, and hence avoid overfitting issues and promote parsimonious models with dissimilar state components. We perform an extensive simulation study comparing performance of models with independent and repulsive prior distributions on the state parameters, and demonstrate our proposed framework on two ecological case studies: GPS tracking data on muskox in Antarctica and acoustic data on Cape gannets in South Africa. Our results highlight how our framework effectively explores the model space, defined by models with different latent state dimensions, while leading to latent states that are distinguished better and hence are more interpretable, enabling better understanding of complex dynamic systems.
... i.) NOSE models the entire abrupt change process directly through θ(t) (≡ θ) rather than the aggregating all sets of segment parameters in prevailing methods. In this sense, NOSE can be viewed as an infinite-dimensional extension of StepSignalMargiLike (Du et al., 2016), which represents the abrupt change scheme through a finite-dimensional vector We may try to explain the success of NOSE from the perspective of cohesion and repulsion in clustering (Natarajan et al., 2023). To some extent, change-point detection may be viewed as an ordered clustering task on sequential data. ...
... Those data points within the same segment can be viewed as a cluster. Quoting Natarajan et al. (2023), "clusters are composed of objects which have small dissimilarities among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion)". Intuitively, jump size may be viewed as a metric of dissimilarity between data points. ...
Change-points detection has long been important and active research areas with broad application in economics, medical research, genetics, social science, etc. It includes themes of the number, locations, and jump sizes of change-points. However, most existing methods focus on segment parameters or features, regardless of Bayesian or frequentist. We propose an innovative non-segmental approach to detect multiple change-points by concentrating the abrupt changes into a global distributional parameter, which is characterized by a function of states of the system. We construct a class of discrete spike and Cauchy-slab variate prior, which distinguishes from existing such two-group mixture priors by a dynamic rate of a latent indicator. A 3-sigma discrimination criterion of change-points is built with sigma being the standard deviation of the sequence of differences between marginal maximum a posteriori estimates on two adjacent discretized states. It intrinsically guarantees reasonable false positive rate to prevent over-detection. The proposed method is powerful and robust to unveil structure changes of the autoregression model for house prices in London and the linear regression model for age-specific fertility in US, not to mention consistent detection results of shifts of mean or scale on known data sets. Abundant simulations demonstrate that the proposed method outperforms state-of-the-art others in finite-sample performance. An R package \texttt{BaRDCP} is developed for detecting changes in mean, scale, and the regression or autocorrelation coefficient.
... While ESC models are more interpretable and have better-developed theory than previouslyproposed microclustering models, there is no known relationship between the cluster size distribution µ and the number of clusters K under these models. Recently, Natarajan et al. (2021) (Proposition 2) established the distribution of the number of clusters K for the case where µ is a shifted negative binomial, one of the specific models first proposed by Betancourt et al. (to appear). Bystrova et al. (2020) established the behavior of K under a related class of Gibbs-type processes. ...
... Example: Negative Binomial Cluster Sizes. By way of illustration, we consider the model that has received the most attention to date in the microclustering literature (see, e.g., Zanella et al. 2016;Betancourt et al. to appear;Natarajan et al. 2021), the ESC-NB model. Under this model, µ takes the form of a shifted negative binomial distribution, ...
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman-Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
... Unfortunately, we do not expect this condition to hold in complete generality for gradientbridged posterior models. This is a common limitation to loss-based generalized Bayes approaches [Natarajan et al., 2024, Rigon et al., 2023. Nonetheless, we can identify salient sufficient conditions that ensure the marginalization condition, such as separability of the loss over i. ...
Many statistical problems include model parameters that are defined as the solutions to optimization sub-problems. These include classical approaches such as profile likelihood as well as modern applications involving flow networks or Procrustes distances. In such cases, the likelihood of the data involves an implicit function, often complicating inferential procedures and entailing prohibitive computational cost. In this article, we propose an intuitive and tractable posterior inference approach for this setting. We introduce a class of continuous models that handle implicit function values using the first-order optimality of the sub-problems. Specifically, we apply a shrinkage kernel to the gradient norm, which retains a probabilistic interpretation within a generative model. This can be understood as a generalization of the Gibbs posterior framework to newly enable concentration around partial minimizers in a subset of the parameters. We show that this method, termed the gradient-bridged posterior, is amenable to efficient posterior computation, and enjoys theoretical guarantees, establishing a Bernstein--von Mises theorem for asymptotic normality. The advantages of our approach are highlighted on a synthetic flow network experiment and an application to data integration using Procrustes distances.
... In a preprint [15], Heinecke et al . propose an automated ("unsupervised") approach based on a similar pipeline as CADS, with Gaussian process keypoint extraction and a Bayesian distance microclustering algorithm [29]. However, their code is only partially released, which makes it hard to reproduce their results. ...
Die studies are fundamental to quantifying ancient monetary production, providing insights into the relationship between coinage, politics, and history. The process requires tedious manual work, which limits the size of the corpora that can be studied. Few works have attempted to automate this task, and none have been properly released and evaluated from a computer vision perspective. We propose a fully automatic approach that introduces several innovations compared to previous methods. We rely on fast and robust local descriptors matching that is set automatically. Second, the core of our proposal is a clustering-based approach that uses an intrinsic metric (that does not need the ground truth labels) to determine its critical hyper-parameters. We validate the approach on two corpora of Greek coins, propose an automatic implementation and evaluation of previous baselines, and show that our approach significantly outperforms them.
... This appeals when the likelihood is inaccessible or intractable; there is a well-established literature on partial information settings including methods based on composite likelihood [Lindsay, 1988, Varin et al., 2011, partial likelihood [Sinha et al., 2003, Dunson andTaylor, 2005], pairwise likelihood [Jensen and Künsch, 1994], and others. Recently, there has been a burgeoning interest in loss-based Bayesian models, including works involving classification loss [Polson and Scott, 2011] or distance-based losses [Duan and Dunson, 2021, Rigon et al., 2023, Natarajan et al., 2023. Loss-based generalized Bayes models typically use a probability distribution called the "Gibbs posterior" [Jiang and Tanner, 2008], taking the form: ...
Optimization is widely used in statistics, thanks to its efficiency for delivering point estimates on useful spaces, such as those satisfying low cardinality or combinatorial structure. To quantify uncertainty, Gibbs posterior exponentiates the negative loss function to form a posterior density. Nevertheless, Gibbs posteriors are supported in a high-dimensional space, and do not inherit the computational efficiency or constraint formulations from optimization. In this article, we explore a new generalized Bayes approach, viewing the likelihood as a function of data, parameters, and latent variables conditionally determined by an optimization sub-problem. Marginally, the latent variable given the data remains stochastic, and is characterized by its posterior distribution. This framework, coined "bridged posterior", conforms to the Bayesian paradigm. Besides providing a novel generative model, we obtain a positively surprising theoretical finding that under mild conditions, the √ n-adjusted posterior distribution of the parameters under our model converges to the same normal distribution as that of the canonical integrated posterior. Therefore, our result formally dispels a long-held belief that partial optimization of latent variables may lead to underestimation of parameter uncertainty. We demonstrate the practical advantages of our approach under several settings, including maximum-margin classification, latent normal models, and harmonization of multiple networks.
... An alternative definition of repulsive mixture is provided by Malsiner-Walli et al. [2017], whose approach encourages nearby components to merge into groups at a first hierarchical level and then to enforce between-group separation at the second level. A similar idea has been employed in Natarajan et al. [2021] in the context of distance-based clustering, where the repulsive term appears at the likelihood level. Finally, we note that Fúquene et al. [2019] propose the use on Non-Local priors (NLP) to select the number of components, characterised by improved parsimony obtained through the inclusion of a penalty term, and leading to well-separated components with non-negligible weight, interpretable as distinct subpopulations. ...
Mixture models are commonly used in applications with heterogeneity and overdispersion in the population, as they allow the identification of subpopulations. In the Bayesian framework, this entails the specification of suitable prior distributions for the weights and location parameters of the mixture. Widely used are Bayesian semi-parametric models based on mixtures with infinite or random number of components, such as Dirichlet process mixtures or mixtures with random number of components. Key in this context is the choice of the kernel for cluster identification. Despite their popularity, the flexibility of these models and prior distributions often does not translate into interpretability of the identified clusters. To overcome this issue, clustering methods based on repulsive mixtures have been recently proposed. The basic idea is to include a repulsive term in the prior distribution of the atoms of the mixture, which favours mixture locations far apart. This approach is increasingly popular and allows one to produce well-separated clusters, thus facilitating the interpretation of the results. However, the resulting models are usually not easy to handle due to the introduction of unknown normalising constants. Exploiting results from statistical mechanics, we propose in this work a novel class of repulsive prior distributions based on Gibbs measures. Specifically, we use Gibbs measures associated to joint distributions of eigenvalues of random matrices, which naturally possess a repulsive property. The proposed framework greatly simplifies the computations needed for the use of repulsive mixtures due to the availability of the normalising constant in closed form. We investigate theoretical properties of such class of prior distributions, and illustrate the novel class of priors and their properties, as well as their clustering performance, on benchmark datasets.
... , y n ∈ R p . Although a variety of models have been proposed in the literature (Neal, 2003;Teh et al., 2007;Duan & Dunson, 2021;Natarajan et al., 2021), Bayesian mixtures constitute a direct approach for model-based clustering; see Fruhwirth-Schnatter et al. (2019) for a recent review. In mixture models, it is assumed that data are generated from m (either random or fixed) homogeneous populations. ...
Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2020). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. To favor well-separated clusters of data, the repulsive point process must be anisotropic, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes.
... Un autre travail intéressant est celui proposé par Heinecke et al. [Heinecke et al., 2021]. Leur approche est basée sur les Processus Gaussiens pour calculer des points d'intérêt et le microclustering bayésien [Natarajan et al., 2021] pour regrouper les monnaies de même coin. Leur approche a été testée uniquement sur des monnaies romaines. ...
Regrouper les monnaies selon leur coin est un problème qui a de nombreuses applications en numismatique. Ce regroupement est crucial pour comprendre l'histoire économique de certains peuples, surtout pour les peuples dont peu de traces écrites existent, comme les peuples celtes. C'est une tâche difficile, qui demande beaucoup de temps et d'expertise. Pourtant, les travaux qui se sont penchés sur l'identification automatique de coins monétaires sont très rares. Cette thèse propose un outil automatique pour savoir si deux motifs ont été imprimés avec le même motif, particulièrement savoir si deux monnaies ont été frappées avec le même coin. Basé sur des algorithmes de recalage basées apprentissage profond, la méthode proposée a permis de classer un trésor d'un millier de monnaies Riedones datant du IIème siècle avant notre ère. Ce trésor nous a permis de constituer un jeu de données annotées d'acquisitions 3D de monnaies appelée Riedones3D. Cette base de données est utile pour les spécialistes en monnaies celtiques, mais également à la communauté vision par ordinateur pour développer de nouveaux algorithmes de reconnaissance de coins monétaires. Des évaluations rigoureuses sur Riedones3D et sur d'autres œuvres celtiques montrent l'intérêt de la méthode proposée. En effet, elle peut s'adapter à des motifs inconnus. Finalement, nous proposons un nouvel algorithme de recalage qui peut s'adapter à n'importe quel type de capteur. Grâce à cet algorithme, il est potentiellement possible pour un spécialiste d'utiliser des capteurs plus rapides ou moins onéreux pour faire l'acquisition des monnaies ou des motifs gravés.
Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. The repulsive point process must be anisotropic to favour well-separated clusters of data, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes. We illustrate our model in simulations as well as a plant species co-occurrence dataset.
Mixture models are commonly used in applications with heterogeneity and overdispersion in the population, as they allow the identification of subpopulations. In the Bayesian framework, this entails the specification of suitable prior distributions for the weights and locations of the mixture. Despite their popularity, the flexibility of these models often does not translate into the interpretability of the clusters. To overcome this issue, repulsive mixture models have been recently proposed. The basic idea is to include a repulsive term in the distribution of the atoms of the mixture, favouring mixture locations far apart. This approach induces well-separated clusters, aiding the interpretation of the results. However, these models are usually not easy to handle due to unknown normalizing constants. We exploit results from equilibrium statistical mechanics, where the molecular chaos hypothesis implies that nearby particles spread out over time. In particular, we exploit the connection between random matrix theory and statistical mechanics and propose a novel class of repulsive prior distributions based on Gibbs measures associated with joint distributions of eigenvalues of random matrices. The proposed framework greatly simplifies computations thanks to the availability of the normalizing constant in closed form. We investigate the theoretical properties and clustering performance of the proposed distributions.
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
Bayesian cluster analysis offers substantial benefits over algorithmic approaches by providing not only point estimates but also uncertainty in the clustering structure and patterns within each cluster. An overview of Bayesian cluster analysis is provided, including both model-based and loss-based approaches, along with a discussion on the importance of the kernel or loss selected and prior specification. Advantages are demonstrated in an application to cluster cells and discover latent cell types in single-cell RNA sequencing data to study embryonic cellular development. Lastly, we focus on the ongoing debate between finite and infinite mixtures in a model-based approach and robustness to model misspecification. While much of the debate and asymptotic theory focuses on the marginal posterior of the number of clusters, we empirically show that quite a different behaviour is obtained when estimating the full clustering structure.
This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.
Loss-based clustering methods, such as k-means and its variants, are standard tools for finding groups in data. However, the lack of quantification of uncertainty in the estimated clusters is a disadvantage. Model-based clustering based on mixture models provides an alternative, but such methods face computational problems and large sensitivity to the choice of kernel. This article proposes a generalized Bayes framework that bridges between these paradigms through the use of Gibbs posteriors. In conducting Bayesian updating, the loglikelihood is replaced by a loss function for clustering, leading to a rich family of clustering methods. The Gibbs posterior represents a coherent updating of Bayesian beliefs without needing to specify a likelihood for the data, and can be used for characterizing uncertainty in clustering. We consider losses based on Bregman divergence and pairwise similarities, and develop efficient deterministic algorithms for point estimation along with sampling algorithms for uncertainty quantification. Several existing clustering algorithms, including k-means, can be interpreted as generalized Bayes estimators under our framework, and hence we provide a method of uncertainty quantification for these approaches; for example, allowing calculation of the probability a data point is well clustered.
Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.
As a means of improving analysis of biological shapes, we propose a greedy algorithm for sampling a Riemannian manifold based on the uncertainty of a Gaussian process. This is known to produce a near optimal experimental design with the manifold as the domain, and appears to outperform the use of user-placed landmarks in representing geometry of biological objects. We provide an asymptotic analysis for the decay of the maximum conditional variance, which is frequently employed as a greedy criterion for similar variance- or uncertainty-based sequential experimental design strategies, to our knowledge this is the first result of this type for experimental design. The key observation is to link the greedy algorithm with reduced basis methods in the context of model reduction for partial differential equations. We apply the proposed landmarking algorithm to geometric morphometrics, a branch of evolutionary biology focusing on the analysis and comparisons of anatomical shapes, and compare the automatically sampled landmarks with the "ground truth" landmarks manually placed by evolutionary anthropologists, the results suggest that Gaussian process landmarks perform equally well or better, in terms of both spatial coverage and downstream statistical analysis. We expect this approach will find additional applications in other fields of research.
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
Clustering is widely studied in statistics and machine learning, with
applications in a variety of fields. As opposed to classical algorithms which
return a single clustering solution, Bayesian nonparametric models provide a
posterior over the entire space of partitions, allowing one to assess
statistical properties, such as uncertainty on the number of clusters. However,
an important problem is how to summarize the posterior; the huge dimension of
partition space and difficulties in visualizing it add to this problem. In a
Bayesian analysis, the posterior of a real-valued parameter of interest is
often summarized by reporting a point estimate such as the posterior mean along
with 95% credible intervals to characterize uncertainty. In this paper, we
extend these ideas to develop appropriate point estimates and credible sets to
summarize the posterior of clustering structure based on decision and
information theoretic techniques.
Constructivist philosophy and Hasok Chang's active scientific realism are
used to argue that the idea of "truth" in cluster analysis depends on the
context and the clustering aims. Different characteristics of clusterings are
required in different situations. Researchers should be explicit about on what
requirements and what idea of "true clusters" their research is based, because
clustering becomes scientific not through uniqueness but through transparent
and open communication. The idea of "natural kinds" is a human construct, but
it highlights the human experience that the reality outside the observer's
control seems to make certain distinctions between categories inevitable.
Various desirable characteristics of clusterings and various approaches to
define a context-dependent truth are listed, and I discuss what impact these
ideas can have on the comparison of clustering methods, and the choice of a
clustering methods and related decisions in practice.
Modeling structure in complex networks using Bayesian non-parametrics makes
it possible to specify flexible model structures and infer the adequate model
complexity from the observed data. This paper provides a gentle introduction to
non-parametric Bayesian modeling of complex networks: Using an infinite mixture
model as running example we go through the steps of deriving the model as an
infinite limit of a finite parametric model, inferring the model parameters by
Markov chain Monte Carlo, and checking the model's fit and predictive
performance. We explain how advanced non-parametric models for complex networks
can be derived and point out relevant literature.
In this paper we address the problem of choosing a single clustering estimate ^c based on an MCMC sample of clusterings c(1);c(2) :::;c(M) from the posterior distribution of a Bayesian cluster model. Methods to derive ^c based on the posterior similarity matrix, a matrix with entries P(ci = cjjy), the posterior probabilities that the observations i and j are in the same cluster, are reviewed and discussed. Minimization of a com- monly used loss function for this purpose by Binder (1978) is shown to be equivalent to maximizing the Rand index between estimated and true clustering. We propose a new criterion for choosing an estimated cluster- ing, the posterior expected adjusted Rand index with the true clustering, which outperforms Binder's loss, MAP and an ad hoc criterion in a sim- ulation study. An application to Fisher's Iris data is also provided. Keywords: Adjusted Rand index; Bayesian inference; Cluster analysis; Markov chain Monte Carlo; Loss functions.
A survey of recent developments in the theory and application of com-posite likelihood is provided, building on the review paper of Varin (2008). A range of application areas, including geostatistics, spatial extremes, and space-time mod-els, as well as clustered and longitudinal data and time series are considered. The important area of applications to statistical genetics is omitted, in light of Larribe and Fearnhead (2011). Emphasis is given to the development of the theory, and the current state of knowledge on efficiency and robustness of composite likelihood inference.
blockmodeling for graphs is proposed. The model assumes that the vertices of the graph are partitioned into two unknown blocks
and that the probability of an edge between two vertices depends only on the blocks to which they belong. Statistical procedures
are derived for estimating the probabilities of edges and for predicting the block structure from observations of the edge
pattern only. ML estimators can be computed using the EM algorithm, but this strategy is practical only for small graphs.
A Bayesian estimator, based on the Gibbs sampling, is proposed. This estimator is practical also for large graphs. When ML
estimators are used, the block structure can be predicted based on predictive likelihood. When Gibbs sampling is used, the
block structure can be predicted from posterior predictive probabilities. A side result is that when the number of vertices
tends to infinity while the probabilities remain constant, the block structure can be recovered correctly with probability
tending to 1.
about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. We consider the problem of detecting features, such as minefields or seismic faults, in spatial point processes when there is substantial clutter. We use model-based clustering based on a mixture model for the process, in which features are assumed to generate points according to highly linear multivariate normal densities, and the clutter arises according to a spatial Poisson process. Nonlinear features are represented by several densities, giving a piecewise linear representation. Hierarchical model-based clustering provides a first estimate of the features, and this is then refined using the EM algorithm. The number of features is estimated from an approximation to its posterior distribution. The method gives good results for the minefield and seismic fault problems. Software to implement it is available on the World Wide Web.
We propose a probability model for random partitions in the presence of covariates. In other words, we develop a model-based clustering algorithm that exploits available covariates. The motivating application is predicting time to progression for patients in a breast cancer trial. We proceed by reporting a weighted average of the responses of clusters of earlier patients. The weights should be determined by the similarity of the new patient's covariate with the covariates of patients in each cluster. We achieve the desired inference by defining a random partition model that includes a regression on covariates. Patients with similar covariates are a priori more likely to be clustered together. Posterior predictive inference in this model formalizes the desired prediction.We build on product partition models (PPM). We define an extension of the PPM to include a regression on covariates by including in the cohesion function a new factor that increases the probability of experimental units with similar covariates to be included in the same cluster. We discuss implementations suitable for any combination of continuous, categorical, count, and ordinal covariates.An implementation of the proposed model as R-package is available for download.
Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.
Loss-based clustering methods, such as k-means and its variants, are standard tools for finding groups in data. However, the lack of quantification of uncertainty in the estimated clusters is a disadvantage. Model-based clustering based on mixture models provides an alternative, but such methods face computational problems and large sensitivity to the choice of kernel. This article proposes a generalized Bayes framework that bridges between these paradigms through the use of Gibbs posteriors. In conducting Bayesian updating, the loglikelihood is replaced by a loss function for clustering, leading to a rich family of clustering methods. The Gibbs posterior represents a coherent updating of Bayesian beliefs without needing to specify a likelihood for the data, and can be used for characterizing uncertainty in clustering. We consider losses based on Bregman divergence and pairwise similarities, and develop efficient deterministic algorithms for point estimation along with sampling algorithms for uncertainty quantification. Several existing clustering algorithms, including k-means, can be interpreted as generalized Bayes estimators under our framework, and hence we provide a method of uncertainty quantification for these approaches; for example, allowing calculation of the probability a data point is well clustered.
We propose a randomized greedy search algorithm to find a point estimate for a random partition based on a loss function and posterior Monte Carlo samples. Given the large size and awkward discrete nature of the search space, the minimization of the posterior expected loss is challenging. Our approach is a stochastic search based on a series of greedy optimizations performed in a random order and is embarrassingly parallel. We consider several loss functions, including Binder loss and variation of information. We note that criticisms of Binder loss are the result of using equal penalties of misclassification and we show an efficient means to compute Binder loss with potentially unequal penalties. Furthermore, we extend the original variation of information to allow for unequal penalties and show no increased computational costs. We provide a reference implementation of our algorithm. Using a variety of examples, we show that our method produces clustering estimates that better minimize the expected loss and are obtained faster than existing methods. Supplementary materials for this article are available online.
Computer Vision: Algorithms and Applications explores the variety of techniques used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both in specialized applications such as image search and autonomous navigation, as well as for fun, consumer-level tasks that students can apply to their own personal photos and videos.
More than just a source of “recipes,” this exceptionally authoritative and comprehensive textbook/reference takes a scientific approach to the formulation of computer vision problems. These problems are then analyzed using the latest classical and deep learning models and solved using rigorous engineering principles.
Topics and features:
• Structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses
• Incorporates totally new material on deep learning and applications such as mobile computational photography, autonomous navigation, and augmented reality
• Presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects
• Includes 1,500 new citations and 200 new figures that cover the tremendous developments from the last decade
• Provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, estimation theory, datasets, and software
Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.
About the Author
Dr. Richard Szeliski has more than 40 years’ experience in computer vision research, most recently at Facebook and Microsoft Research, where he led the Computational Photography and Interactive Visual Media groups. He is currently an Affiliate Professor at the University of Washington where he co-developed (with Steve Seitz) the widely adopted computer vision curriculum on which this book is based.
We introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs in a population sample. The model overcomes limitations of existing approaches to understanding, summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombination process; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption that LD necessarily has a “block-like” structure; and (iv) being computationally tractable for huge genomic regions (up to complete chromosomes). We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case where recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate from population data. We also outline how the model could be useful in other contexts, such as in the development of more efficient haplotype-based methods for LD mapping.
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points — the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an epidemiological application, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.
Choosing the number of mixture components remains an elusive challenge. Model selection criteria can be either overly liberal or conservative and return poorly separated components of limited practical use. We formalize non‐local priors (NLPs) for mixtures and show how they lead to well‐separated components with non‐negligible weight, interpretable as distinct subpopulations. We also propose an estimator for posterior model probabilities under local priors and NLPs, showing that Bayes factors are ratios of posterior‐to‐prior empty cluster probabilities. The estimator is widely applicable and helps to set thresholds to drop unoccupied components in overfitted mixtures. We suggest default prior parameters based on multimodality for normal–T‐mixtures and minimal informativeness for categorical outcomes. We characterize theoretically the NLP‐induced sparsity, derive tractable expressions and algorithms. We fully develop normal, binomial and product binomial mixtures but the theory, computation and principles hold more generally. We observed a serious lack of sensitivity of the Bayesian information criterion, insufficient parsimony of the Akaike information criterion and a local prior, and a mixed behaviour of the singular Bayesian information criterion. We also considered overfitted mixtures; their performance was competitive but depended on tuning parameters. Under our default prior elicitation NLPs offered a good compromise between sparsity and power to detect meaningfully separated components.
A thoroughly revised and updated edition of thisintroduction to modern statistical methods for shape analysis Shape analysis is an important tool in the many disciplines where objects are compared using geometrical features. Examples include comparing brain shape in schizophrenia; investigating protein molecules in bioinformatics; and describing growth of organisms in biology. This book is a significant update of the highly-regarded `Statistical Shape Analysis' by the same authors. The new edition lays the foundations of landmark shape analysis, including geometrical concepts and statistical techniques, and extends to include analysis of curves, surfaces, images and other types of object data. Key definitions and concepts are discussed throughout, and the relative merits of different approaches are presented. The authors have included substantial new material on recent statistical developments and offer numerous examples throughout the text. Concepts are introduced in an accessible manner, while retaining sufficient detail for more specialist statisticians to appreciate the challenges and opportunities of this new field. Computer code has been included for instructional use, along with exercises to enable readers to implement the applications themselves in R and to follow the key ideas by hands-on analysis. Statistical Shape Analysis: with Applications in R will offer a valuable introduction to this fast-moving research area for statisticians and other applied scientists working in diverse areas, including archaeology, bioinformatics, biology, chemistry, computer science, medicine, morphometics and image analysis.
Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (sometimes) redundant clusters. In this work we propose a method that produces parsimonious mixture models (i.e. mixtures that discourage the creation of redundant clusters), without sacrificing flexibility or model fit. This method is based on the idea of repulsion, that is, that any two mixture components are encouraged to be well separated. We propose a family of d-dimensional probability densities whose coordinates tend to repel each other in a smooth way. The induced probability measure has a close relation with Gibbs Measures, Graph Theory and Point Processes. We investigate its global properties and explore its use in the context of mixture models for density estimation. Computational techniques are detailed and we illustrate its usefulness with some well-known data sets and a small simulation study.
A new flexible prior for Bayesian image analysis and reservoir modelling is defined in terms of interacting coloured Voronoi cells described by a certain nearest-neighbour Markov point process. This prior can be defined in both two and three (as well as higher) dimensions, and simple MCMC algorithms can be used for drawing inference from the posterior distribution. Various 2D and 3D applications are considered.
We propose a random partition distribution indexed by pairwise similarity information such that partitions compatible with the similarities are given more probability. The use of pairwise similarities, in the form of distances, is common in some clustering algorithms (e.g., hierarchical clustering), but we show how to use this type of information to define a prior partition distribution for flexible Bayesian modeling. A defining feature of the distribution is that it allocates probability among partitions within a given number of subsets, but it does not shift probability among sets of partitions with different numbers of subsets. Our distribution places more probability on partitions that group similar items yet keeps the total probability of partitions with a given number of subsets constant. The distribution of the number of subsets (and its moments) is available in closed-form and is not a function of the similarities. Our formulation has an explicit probability mass function (with a tractable normalizing constant) so the full suite of MCMC methods may be used for posterior inference. We compare our distribution with several existing partition distributions, showing that our formulation has attractive properties. We provide three demonstrations to highlight the features and relative performance of our distribution.
This document is due to appear as a chapter of the forthcoming Handbook of
Approximate Bayesian Computation (ABC) edited by S. Sisson, Y. Fan, and M.
Beaumont.
Since the earliest work on ABC, it has been recognised that using summary
statistics is essential to produce useful inference results. This is because
ABC suffers from a curse of dimensionality effect, whereby using high
dimensional inputs causes large approximation errors in the output. It is
therefore crucial to find low dimensional summaries which are informative about
the parameter inference or model choice task at hand. This chapter reviews the
methods which have been proposed to select such summaries, extending the
previous review paper of Blum et al. (2013) with recent developments. Related
theoretical results on the ABC curse of dimensionality and sufficiency are also
discussed.
Most generative models for clustering implicitly assume that the number of
data points in each cluster grows linearly with the total number of data
points. Finite mixture models, Dirichlet process mixture models, and
Pitman--Yor process mixture models make this assumption, as do all other
infinitely exchangeable clustering models. However, for some tasks, this
assumption is undesirable. For example, when performing entity resolution, the
size of each cluster is often unrelated to the size of the data set.
Consequently, each cluster contains a negligible fraction of the total number
of data points. Such tasks therefore require models that yield clusters whose
sizes grow sublinearly with the size of the data set. We address this
requirement by defining the \emph{microclustering property} and introducing a
new model that exhibits this property. We compare this model to several
commonly used clustering models by checking model fit using real and simulated
data sets.
We discuss the use of the determinantal point process (DPP) as a prior for
latent structure in biomedical applications, where inference often centers on
the interpretation of latent features as biologically or clinically meaningful
structure. Typical examples include mixture models, where the terms of the
mixture are meant to represent clinically meaningful subpopulations (of
patients, genes, etc.); as well as generalizations of latent partition schemes
including feature allocation models. We discuss a class of repulsive priors on
latent mixture components and propose the DPP as an attractive prior for
feature-specific parameters, when the goal is again to interpret those as
clinically relevant structure. We illustrate the advantages of DPP priors in
three case studies, including inference in mixture models for magnetic
resonance images (MRI) and for protein expression, and a feature allocation
model for gene expression using data from The Cancer Genome Atlas. An important
part of our argument is an efficient and straightforward posterior simulation
method. We implement a variation of reversible jump Markov chain Monte Carlo
simulation for inference under the DPP prior, using a density with respect to
the unit rate Poisson process.
A natural Bayesian approach for mixture models with an unknown number of
components is to take the usual finite mixture model with Dirichlet weights,
and put a prior on the number of components---that is, to use a mixture of
finite mixtures (MFM). While inference in MFMs can be done with methods such as
reversible jump Markov chain Monte Carlo, it is much more common to use
Dirichlet process mixture (DPM) models because of the relative ease and
generality with which DPM samplers can be applied. In this paper, we show that,
in fact, many of the attractive mathematical properties of DPMs are also
exhibited by MFMs---a simple exchangeable partition distribution, restaurant
process, random measure representation, and in certain cases, a stick-breaking
representation. Consequently, the powerful methods developed for inference in
DPMs can be directly applied to MFMs as well. We illustrate with simulated and
real data, including high-dimensional gene expression data.
We consider the problem of finding a geometrically consistent set of point matches between two images. We assume that local descriptors have provided a set of candidate matches, which may include many outliers. We then seek the largest subset of these correspondences that can be aligned perfectly using a nonrigid deformation that exerts a bounded distortion. We formulate this as a constrained optimization problem and solve it using a constrained, iterative reweighted least-squares algorithm. In each iteration of this algorithm we solve a convex quadratic program obtaining a globally optimal match over a subset of the bounded distortion transformations. We further prove that a sequence of such iterations converges monotonically to a critical point of our objective function. We show experimentally that this algorithm produces excellent results on a number of test sets, in comparison to several state-of-the-art approaches.
Due to the dimension and the dependency structure of genetic data, com-posite likelihood methods have found their natural place in the statistical methodo-logy involving such data. After a brief description of the type of data one encounters in population genetic studies, we introduce the questions of interest concerning the main genetic parameters in population genetics, and present an up-to-date review on how composite likelihoods have been used to estimate these parameters.
We explore the effect of dimensionality on the “nearest neighbor” problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10–15 dimensions.
These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10–15) dimensionality!
A Markovian approach to the specification of spatial stochastic interaction for irregularly distributed data points is reviewed. Three specific methods of statistical analysis are proposed; the first two are generally applicable whilst the third relates only to "normally" distributed variables. Some reservations are expressed and the need for practical investigations is emphasized.
Product partition models assume that observations in different components of a random partition of the data are independent given the partition. If the probability distribution of random partitions is in a certain product form prior to making the observations, it is also in product form given the observations. The product model thus provides a convenient machinery for allowing the data to weight the partitions likely to hold; and inference about particular future observations may then be made by first conditioning on the partition, and then averaging over all partitions. This model is applied to fatalities in manned rocket launches, using data from the SOYUZ, APOLLO, SHUTTLE, and post-Challenger SHUTTLE programs in the Soviet Union and the United States. The combination of these data suggest that the chance of a fatality in the next shuttle, launch is about .03, after allowing for the possibility that the older programs are of slight relevance to the present shuttle program.
This paper establishes a general formulation for Bayesian model-based clustering, in which subset la-bels are exchangeable, and items are also exchangeable, possibly up to covariate effects. The notational framework is rich enough to encompass a variety of existing procedures, including some recently discussed methodologies involving stochastic search or hierarchical clustering, but more importantly allows the formu-lation of clustering procedures that are optimal with respect to a specified loss function. Our focus is on loss functions based on pairwise coincidences, that is, whether pairs of items are clustered into the same subset or not. Optimisation of the posterior expected loss function can be formulated as a binary integer programming problem, which can be readily solved by standard software when clustering a modest number of items, but quickly becomes impractical as problem scale increases. To combat this, a new heuristic item-swapping algorithm is introduced. This performs well in our numerical experiments, on both simulated and real data examples. The paper includes a comparison of the statistical performance of the (approximate) optimal clustering with earlier methods that are model-based but ad hoc in their detailed definition.
The class of species sampling mixture models is introduced as an exten-sion of semiparametric models based on the Dirichlet process to models based on the general class of species sampling priors, or equivalently the class of all exchangeable urn distributions. Using Fubini calculus in conjunction with Pitman (1995, 1996), we derive characterizations of the posterior distribution in terms of a posterior par-tition distribution that extend the results of Lo (1984) for the Dirichlet process. These results provide a better understanding of models and have both theoretical and practical applications. To facilitate the use of our models we generalize the work in Brunner, Chan, James and Lo (2001) by extending their weighted Chinese restaurant (WCR) Monte Carlo procedure, an i.i.d. sequential importance sampling (SIS) procedure for approximating posterior mean functionals based on the Dirich-let process, to the case of approximation of mean functionals and additionally their posterior laws in species sampling mixture models. We also discuss collapsed Gibbs sampling, Pólya urn Gibbs sampling and a Pólya urn SIS scheme. Our framework allows for numerous applications, including multiplicative counting process models subject to weighted gamma processes, as well as nonparametric and semiparamet-ric hierarchical models based on the Dirichlet process, its two-parameter extension, the Pitman-Yor process and finite dimensional Dirichlet priors.
Estimation of variance and covariance components has importance in various substantive fields such as animal breeding and evolutionary biology among others. The most popular methods of variance components estimation are maximum likelihood (ML), restricted maximum likelihood (REML), analysis of variance and covariance (ANOVA) and minimum quadratic norm (MINQUE). All these methods are computationally intensive. This computational barrier is particularly limiting in data obtained from large animal breeding experiments involving multiple traits. The purpose of this paper is to introduce a new method, which we call maximum composite likelihood (MCL), for the estimation of variance and covariance components. This method is as generally applicable as the method of maximum likelihood: to cases where designs are balanced or unbalanced, involving mixed effects and multiple traits or designs where random effects are correlated to each other. The MCL approach, in contrast to ML/REML or ANOVA, however, does not require inversion of matrices. As a consequence the computational burden is reduced from O(N3) to O(N2) where N denotes the total sample size. Moreover, and in contrast to the ML/REML estimating functions, the estimating functions obtained for MCL, after a minor modification, are shown to possess a unique solution thus guaranteeing convergence of the numerical optimization routine. Conditions are specified that assure consistency and asymptotic normality of these estimators. These results do not depend on the assumption of a Gaussian distribution of the random effects. Simulation study indicates that there is only a small loss of statistical efficiency in using MCL as compared to REML but a substantial gain in the computational efficiency.
This work considers probability models for partitions of a set of n elements using a predictive approach, i.e., models that are specified in terms of the conditional probability of either joining an already existing cluster or forming a new one. The inherent structure can be motivated by resorting to hierarchical models of either parametric or nonparametric nature. Parametric examples include the product partition models (PPMs) and the model-based approach of Dasgupta and Raftery (J. Amer. Statist. Assoc. 93 (1998) 294), while nonparametric alternatives include the Dirichlet process, and more generally, the species sampling models (SSMs). Under exchangeability, PPMs and SSMs induce the same type of partition structure. The methods are discussed in the context of outlier detection in normal linear regression models and of (univariate) density estimation.
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Product partition models assume that observations in different components of a random partition of the data are independent. If the probability distribution of random partitions is in a certain product form prior to making the observations, it is also in product form given the observations. The product model thus provides a convenient machinery for allowing the data to weight the partitions likely to hold; and inference about particular future observations may then be made by first conditioning on the partition and then averaging over all partitions. These models apply with special computational simplicity to change point problems, where the partitions divide the sequence of observations into components within which different regimes hold. We show, with appropriate selection of prior product models, that the observations can eventually determine approximately the true partition.
Multidimensional scaling is the problem of representingn objects geometrically byn points, so that the interpoint distances correspond in some sense to experimental dissimilarities between objects. In just what sense distances and dissimilarities should correspond has been left rather vague in most approaches, thus leaving these approaches logically incomplete. Our fundamental hypothesis is that dissimilarities and distances are monotonically related. We define a quantitative, intuitively satisfying measure of goodness of fit to this hypothesis. Our technique of multidimensional scaling is to compute that configuration of points which optimizes the goodness of fit. A practical computer program for doing the calculations is described in a companion paper.
Natural populations of living organisms often have complex histories consisting of phases of expansion and decline, and the
migratory patterns within them may fluctuate over space and time. When parts of a population become relatively isolated, e.g.,
due to geographical barriers, stochastic forces reshape certain DNA characteristics of the individuals over generations such
that they reflect the restricted migration and mating/reproduction patterns. Such populations are typically termed as genetically
structured and they may be statistically represented in terms of several clusters between which DNA variations differ clearly
from each other. When detailed knowledge of the ancestry of a natural population is lacking, the DNA characteristics of a
sample of current generation individuals often provide a wealth of information in this respect. Several statistical approaches
to model-based clustering of such data have been introduced, and in particular, the Bayesian approach to modeling the genetic
structure of a population has attained a vivid interest among biologists. However, the possibility of utilizing spatial information
from sampled individuals in the inference about genetic clusters has been incorporated into such analyses only very recently.
While the standard Bayesian hierarchical modeling techniques through Markov chain Monte Carlo simulation provide flexible
means for describing even subtle patterns in data, they may also result in computationally challenging procedures in practical
data analysis. Here we develop a method for modeling the spatial genetic structure using a combination of analytical and stochastic
methods. We achieve this by extending a novel theory of Bayesian predictive classification with the spatial information available,
described here in terms of a colored Voronoi tessellation over the sample domain. Our results for real and simulated data
sets illustrate well the benefits of incorporating spatial information to such an analysis.
This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering to clustering . The basic properties of VI are presented and discussed. We focus on two kinds of properties: (1) those that help one build intuition about the new criterion (in particular, it is shown the VI is a true metric on the space of clusterings), and (2) those that pertain to the comparability of VI values over different experimental conditions. As the latter properties have rarely been discussed explicitly before, other existing comparison criteria are also examined in their light. Finally we present the VI from an axiomatic point of view, showing that it is the only "sensible" criterion for comparing partitions that is both aligned to the lattice and convexely additive. As a consequence, we prove an impossibility result for comparing partitions: there is no criterion for comparing partitions that simultaneously satisfies the above two desirable properties and is bounded.
This paper presents a Bayesian nonlinear approach for the analysis of spatial count data. It extends the Bayesian partition methodology of Holmes, Denison, and Mallick (1999, Bayesian partitioning for classification and regression, Technical Report, Imperial College, London) to handle data that involve counts. A demonstration involving incidence rates of leukemia in New York state is used to highlight the methodology. The model allows us to make probability statements on the incidence rates around point sources without making any parametric assumptions about the nature of the influence between the sources and the surrounding location.
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
Many stochastic simulation approaches for generating observations from a posterior distribution depend on knowing a likelihood function. However, for many complex probability models, such likelihoods are either impossible or computationally prohibitive to obtain. Here we present a Markov chain Monte Carlo method for generating observations from a posterior distribution without the use of likelihoods. It can also be used in frequentist applications, in particular for maximum-likelihood estimation. The approach is illustrated by an example of ancestral inference in population genetics. A number of open problems are highlighted in the discussion.