Article

From GEM back to Dirichlet via Hoppe's Urn

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In generalisation of the beta law obtained under the GEM/Poisson–Dirichlet distribution in Hirth [12] we undertake here an analogous construction which results in the Dirichlet law. Our proof makes use of Hoppe's Pólya-like urn model in population genetics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... , µ(k) and and A i = {i} for 1 ≤ i ≤ k, the property that D(A 1 ), . . . , D(A k ) is given by a Dirichlet distribution was first stated in a population genetics context in [12]; see also [26]. ...
... Specifically, as noted in the introduction, when X is finite we have that ν is a Dirichlet distribution with parameters {θµ j } j∈X . (cf. [12], [26]). ...
Preprint
We consider the connections among `clumped' residual allocation models (RAMs), a general class of stick-breaking processes including Dirichlet processes, and the occupation laws of certain discrete space time-inhomogeneous Markov chains related to simulated annealing and other applications. An intermediate structure is introduced in a given RAM, where proportions between successive indices in a list are added or clumped together to form another RAM. In particular, when the initial RAM is a Griffiths-Engen-McCloskey (GEM) sequence and the indices are given by the random times that an auxiliary Markov chain jumps away from its current state, the joint law of the intermediate RAM and the locations visited in the sojourns is given in terms of a `disordered' GEM sequence, and an induced Markov chain. Through this joint law, we identify a large class of `stick breaking' processes as the limits of empirical occupation measures for associated time-inhomogeneous Markov chains.
... solves the recurrence. This shows that χ µ1 := k≥1 µ k ξ (k) has beta(pθ, (1 − p) θ) distribution in this case (see [23]). ...
Article
Full-text available
Suppose some random resource (energy, mass or space) $\chi \geq 0$ is to be shared at random between (possibly infinitely many) species (atoms or fragments). Assume ${\Bbb E}\chi =\theta <\infty $ and suppose the amount of the individual share is necessarily bounded from above by 1. This random partitioning model can naturally be identified with the study of infinitely divisible random variables with L\'{e}vy measure concentrated on the interval% $.$ Special emphasis is put on these special partitioning models in the Poisson-Kingman class. The masses attached to the atoms of such partitions are sorted in decreasing order. Considering nearest- neighbors spacings yields a partition of unity which also deserves special interest. For such partition models, various statistical questions are addressed among which: correlation structure, cumulative energy of the first $K$ largest items, partition function, threshold and covering statistics, weighted partition, R\'{e}nyi's, typical and size-biased fragments size. Several physical images are supplied. When the unbounded L\'{e}vy measure of $\chi $ is $\theta x^{-1}\cdot {\bf I}% (x\in (0,1)) dx$, the spacings partition has Griffiths-Engen-McCloskey or GEM$(\theta) $ distribution and $% \chi $ follows Dickman distribution. The induced partition models have many remarkable peculiarities which are outlined. The case with finitely many (Poisson) fragments in the partition law is also briefly addressed. Here, the L\'{e}vy measure is bounded.
Article
We consider the generalization of the Pólya urn scheme with possibly infinitely many colors, as introduced in [37], [4], [5], and [6]. For countably many colors, we prove almost sure convergence of the urn configuration under the uniform ergodicity assumption on the associated Markov chain. The proof uses a stochastic coupling of the sequence of chosen colors with a branching Markov chain on a weighted random recursive tree as described in [6], [31], and [26]. Using this coupling we estimate the covariance between any two selected colors. In particular, we re-prove the limit theorem for the classical urn models with finitely many colors.
Article
The prime factorization of a random integer has a GEM/Poisson-Dirichlet distribution as transparently proved by Donnelly and Grimmett [8]. By similarity to the arc-sine law for the mean distribution of the divisors of a random integer, due to Deshouillers, Dress and Tenenbaum [6] (see also Tenenbaum [24, II.6.2, p. 233]), DDT theorem we obtain an arc-sine law in the GEM/Poisson-Dirichlet context. In this context we also investigate the distribution of the number of components larger than [varepsilon] which correspond to the number of prime factors larger than nε.
Article
The Ewens sampling formula in population genetics can be viewed as a probability measure on the group of permutations of a finite set of integers. Functional limit theory for processes defined through partial sums of dependent variables with respect to the Ewens sampling formula is developed. Techniques from probabilistic number theory are used to establish necessary and sufficient conditions for weak convergence of the associated dependent process to a process with independent increments. Not many results on the necessity part are known in the literature.
Article
Full-text available
A Markov process of partitions of the natural numbers is constructed by defining a Pólya-like urn model. The marginal distributions of this process are the Ewens' sampling distributions of population genetics. Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/46944/1/285_2004_Article_BF00275863.pdf
Article
Full-text available
The behaviour of a Pólya-like urn which generates Ewens' sampling formula in population genetics is investigated. Connections are made with work of Watterson and Kingman and to the Poisson-Dirichlet distribution. The order in which novel types occur in the urn is shown to parallel the age distribution of the infinitely many alleles diffusion model and consequences of this property are explored. Finally the urn process is related to Kingman's coalescent with mutation to provide a rigorous basis for this parallel.
Article
A random mapping partitions the set {1, 2, ···, m } into components, where i and j are in the same component if some functional iterate of i equals some functional iterate of j . We consider various functionals of these partitions and of samples from it, including the number of components of ‘small' size and of size O ( m ) as m → ∞the size of the largest component, the number of components, and various symmetric functionals of the normalized component sizes. In many cases exact results, while available, are uniformative, and we consider various approximations. Numerical and simulation results are also presented. A central tool for many calculations is the ‘frequency spectrum', both exact and asymptotic.
Article
The heaps process (also known as a Tsetlin library) provides a model for a self-regulating filing system. Items are requested from time to time according to their popularity and returned to the top of the heap after use. The size-biased permutation of a collection of popularities is a particular random permutation of those popularities, which arises naturally in a number of applications and is of independent interest. For a slightly non-standard formulation of the heaps process we prove that it converges to the size-biased permutation of its initial distribution. This leads to a number of new characterizations of the property of invariance under size-biased permutation, notably what might be described as invariance under ‘partial size-biasing' of any order. Finally we consider in detail the heaps process with Poisson–Dirichlet initial distribution, exhibiting the tractable nature of its equilibrium distribution and explicitly calculating a number of quantities of interest.
Article
Concepts of independence for nonnegative continuous random variables, X1,⋯, Xk, subject to the constraint Σ Xi = 1 are developed. These concepts provide a means of modeling random vectors of proportions which is useful in analyzing certain kinds of data; and which may be of interest in quantifying prior opinions about multinomial parameters. A generalization of the Dirichlet distribution is given, and its relation to the Dirichlet is simply indicated by means of the concepts. The concepts are used to obtain conclusions of biological interest for data on bone composition in rats and scute growth in turtles.
Article
We consider a certain randomized geometric series of relative abundances and the limit of the Dirichlet distribution used in the derivation of the logarithmic series distribution. When an individual is chosen at random from either of the populations, the abundance of the species it represents is shown to have the same distribution. The maximum likelihood estimate for the geometric series model with fixed abundances is evaluated, and a bias correction is proposed.
Article
We calculate posterior distributions associated with a version of the Poisson-Dirichlet distribution called the GEM. The GEM has been shown (by several authors) to be the limiting stationary distribution for allele frequencies listed in age order associated with the neutral infinite alleles model. In view of this result, we use our posterior distributions to calculate Bayes estimators for the frequency of the oldest allele given a sample.
Article
The heaps process (also known as a Tsetlin library) provides a model for a self-regulating filing system. Items are requested from time to time according to their popularity and returned to the top of the heap after use. The size-biased permutation of a collection of popularities is a particular random permutation of those popularities, which arises naturally in a number of applications and is of independent interest. For a slightly non-standard formulation of the heaps process we prove that it converges to the size-biased permutation of its initial distribution. This leads to a number of new characterizations of the property of invariance under size-biased permutation, notably what might be described as invariance under ‘partial size-biasing’ of any order. Finally, we consider in detail the heaps process with Poisson-Dirichlet initial distribution, exhibiting the tractable nature of its equilibrium distribution and explicitly calculating a number of quantities of interest.
Article
It is impossible to choose at random a probability distribution on a countably infinite set in a manner invariant under permutations of that set. However, approximations to such a choice can be made by considering exchangeable probability measures on the class of probability distributions over a finite set, and letting the size of that set increase without limit. Under suitable conditions the resulting probabilities, when arranged in descending order, have non‐degenerate limiting distributions. These apparently arcane considerations lead to rather concrete conclusions in certain problems in applied probability.
Article
The random splitting model of Lloyd and Williams (1988) is generalised, to allow first beta splitting distributions and secondly, discrete splitting distributions. Analogues of some earlier results are developed. In particular, the equality of the average value of the largest split and the probability that this is achieved at the first split continues to hold.
Article
A characteristic property of the Ewen sampling formula is shown to follow from the invariance under size-biased sampling of the Poisson-Dirichlet distribution.
Article
A random integer N, drawn uniformly from the set (1,2,…,n), has a prime factorization of the form N = α1 α2…αM where α1 ≥ α2 ≥ … ≥ αM. We establish the asymptotic distribution, as n → ∞, of the vector A(n) = (log α1,/log N: i: ≥ 1) in a transparent manner. By randomly re-ordering the components of A(n), in a size-biased manner, we obtain a new vector B(n) whose asymptotic distribution is the GEM distribution with parameter 1; this is a distribution on the infinite-dimensional simplex of vectors (x1,x2,…) having non-negative components with unit sum. Using a standard continuity argument, this entails the weak convergence of A(n) to the corresponding Poisson–Dirichlet distribution on this simplex; this result was obtained by Billingsley [3].
Article
An expression is found for the stationary density of the allele frequencies, in the infinitely-many alleles model. It is assumed that all alleles are neutral, that there is a constant mutation rate, and that the population is sufficiently large for the diffusion model to apply. Bounds on some moments are calculated from the density, and some applications are made to the problem of which allele is oldest in a population. A postulate made by Ewens (1972), concerning the distribution of allele numbers in a finite random sample from the neutral diffusion population, is shown to be correct.
Article
Classical population genetics theory was largely directed towards processes relating to the future. Present theory, by contrast, focuses on the past, and in particular is motivated by the desire to make inferences about the evolutionary processes which have led to the presently observed patterns and nature of genetic variation. There are many connections between the classical prospective theory and the new retrospective theory. However, the retrospective theory introduces ideas not appearing in the classical theory, particularly those concerning the ancestry of the genes in a sample or in the entire population. It also introduces two important new distributions into the scientific literature, namely the Poisson-Dirichlet and the GEM: these are important not only in population genetics, but also in a very wide range in science and mathematics. Some of these are discussed. Population genetics theory has been greatly enriched by the introduction of many new concepts relating to the past evolution of biological populations.
Article
Let X1, X2, ⋯, Xk be positive random variables such that $\sum^k_{i = 1} X_i . It is shown, under the assumption of continuous pdf's, that if Xi/1 - ∑j ≠ i Xj is independent of the set {Xj; j≠ i} for every i = 1, 2, ⋯, k, then X1, X2, ⋯, Xk have a Dirichlet distribution, namely $f(x_1, x_2, \cdots, x_k) = \Gamma\big\lbrack\sum^{k + 1}_{i = 1} \alpha_i\big\rbrack\big\lbrack\prod^k_{i = 1} \frac{x_i^{\alpha_i - 1}}{\Gamma(\alpha_i)}\big\rbrack(1 - \sum^k_{i = 1}x_i)^{\alpha_{k + 1}},$ αi positive, $x_i \geq 0, \sum^k_{i = 1} x_i .
Article
The Silences of the Archives, the Reknown of the Story. The Martin Guerre affair has been told many times since Jean de Coras and Guillaume Lesueur published their stories in 1561. It is in many ways a perfect intrigue with uncanny resemblance, persuasive deception and a surprizing end when the two Martin stood face to face, memory to memory, before captivated judges and a guilty feeling Bertrande de Rols. The historian wanted to go beyond the known story in order to discover the world of the heroes. This research led to disappointments and surprizes as documents were discovered concerning the environment of Artigat’s inhabitants and bearing directly on the main characters thanks to notarial contracts. Along the way, study of the works of Coras and Lesueur took a new direction. Coming back to the affair a quarter century later did not result in finding new documents (some are perhaps still buried in Spanish archives), but by going back over her tracks, the historian could only be struck by the silences of the archives that refuse to reveal their secrets and, at the same time, by the possible openings they suggest, by the intuition that almost invisible threads link here and there characters and events.
Article
Ranked and size-biased permutations are particular functions on the set of probability measures on the simplex. They represent two recently studied schemes for relabelling groups in certain stochastic models, and are of particular interest in describing the limiting behaviour of such models. We prove that the ranked permutations of a sequence of measures converge if and only if the size-biased permutations converge, and give conditions under which weak convergence of measures guarantees weak convergence of both permutations. Applications include a proof of the fact that the GEM distribution is the size biased permutation of the Poisson-Dirichlet and a new proof of the fact that when labelled in a particular way, normalized cycle lengths in a random permutation converge to the GEM distribution. These techniques also allow some problems concerned with the random splitting of an interval to be related to known results in other fields.
Article
The probability that the kth largest prime factor of a number n is at most nx is shown to approach a limit Fk(x) as n → ∞. Several interesting properties of Fk(x) are explored, and numerical tables are given. These results are applied to the analysis of an algorithm commonly used to find all prime factors of a given number. The average number of digits in the kth largest prime factor of a random m-digit number is shown to be asymptotically equivalent to the average length of the kth longest cycle in a permutation on m objects.
Article
The prime factorization of a random integer has a GEM/Poisson-Dirichlet distribution as transparently proved by Donnelly and Grimmett [8]. By similarity to the arc-sine law for the mean distribution of the divisors of a random integer, due to Deshouillers, Dress and Tenenbaum [6] (see also Tenenbaum [24, II.6.2, p. 233]), DDT theorem we obtain an arc-sine law in the GEM/Poisson-Dirichlet context. In this context we also investigate the distribution of the number of components larger than [varepsilon] which correspond to the number of prime factors larger than nε.
Article
Without using the prime number theorem, we obtain the asymptotics of the $r$th largest prime divisor of a harmonically distributed random positive integer $N$; harmonic asymptotics are obtained from asymptotics of the zeta distribution via Tauberian methods. (Knuth and Trabb-Pardo need a strong form of the prime number theorem to obtain the distributions when $N$ is uniformly distributed.) A trick brings in Poisson variates, and then we can use the methods developed for the fractional length of the $r$th longest cycle in a random permutation.
Article
Models for selectively neutral mutation, in which mutation always yields a new allele, seem always to lead, in the limit of large population size, to a sampling formula first propounded by Ewens in 1972. It is shown that the asymptotic validity of the Ewens formula is equivalent to a certain limiting joint distribution for the allele proportions in the population, arranged in descending order. The familiar diffusion approximations are corollaries of this limiting distribution, and therefore share the apparent robustness of the sampling formula.
Article
A process analogous to Kingman's coalescent is introduced to describe the genealogy of populations evolving according to the infinitely- many neutral alleles model. The process records population frequencies in old and new classes, and labels the new classes in order of decreasing age. Its marginal distribution is characterized in a form which is amenable to explicit calculations and the transition densities of the associated K-allele models follow readily from this representation.
Exchangeability and related topics In: Ecole d'Eté de Probabilités de Saint-Flour XIII: Lecture Notes in Mathematics 1117 1-198
  • D Aldous
Anwendungen eines Polya-Urnenmodells in der Populationsgenetik
  • G Notter