Conference Paper

Rates of convergence for the cluster tree

Conference: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada.
Source: DBLP


For a density f on R d, a high-density cluster is any connected component of {x: f(x) ≥ λ}, for some λ> 0. The set of all high-density clusters form a hierarchy called the cluster tree of f. We present a procedure for estimating the cluster tree given samples fromf. We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1

  • Source
    • "Note that the persistence of the cluster structure over a small range of levels ρ ∈ (ρ * , ρ * * ] is assumed either explicitly or implicitly in basically all densitybased clustering approaches that deal with several levels ρ, see e.g. [5] [17]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is fed with histogram-based level set estimates, we provide a finite sample analysis, which is then used to show that the algorithm consistently estimates both the smallest level and the corresponding connected components. We further establish rates of convergence for the two estimation problems, and last but not least, we present a simple, yet adaptive strategy for determining the width-parameter of the involved density estimator in a data-depending way.
    Full-text · Article · Sep 2014 · The Annals of Statistics
  • Source
    • "The present results are based in part on earlier conference versions, namely Chaudhuri and Dasgupta (2010) and Kpotufe and von Luxburg (2011). The result of Chaudhuri and Dasgupta (2010) analyzes the consistency of the first cluster tree estimator (see next section) but provides no pruning method for the estimator. "
    [Show abstract] [Hide abstract]
    ABSTRACT: For a density $f$ on ${\mathbb R}^d$, a {\it high-density cluster} is any connected component of $\{x: f(x) \geq \lambda\}$, for some $\lambda > 0$. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of $f$. We present two procedures for estimating the cluster tree given samples from $f$. The first is a robust variant of the single linkage algorithm for hierarchical clustering. The second is based on the $k$-nearest neighbor graph of the samples. We give finite-sample convergence rates for these algorithms which also imply consistency, and we derive lower bounds on the sample complexity of cluster tree estimation. Finally, we study a tree pruning procedure that guarantees, under milder conditions than usual, to remove clusters that are spurious while recovering those that are salient.
    Preview · Article · Jun 2014 · IEEE Transactions on Information Theory
  • Source
    • "For these procedures, the relevant density levels are the edge weights of G. Frequently, iteration over these levels is done by initializing G with an empty edge set and adding successively more heavily weighted edges, in the manner of traditional single linkage clustering. In this family, the Chaudhuri and Dasgupta algorithm (which is a generalization of Wishart (1969)) is particularly interesting because the authors prove finite sample rates for convergence to the true level set tree (Chaudhuri and Dasgupta 2010). To the best of our knowledge, however, only Stuetzle and Nugent (2010) has a publicly available implementation, in the R package gslclust. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The level set tree approach of Hartigan (1975) provides a probabilistically based and highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the level sets of a density estimator, this approach offers many advantages for exploratory analysis and clustering, especially for complex and high-dimensional data. Several R packages exist for level set tree estimation, but their practical usefulness is limited by computational inefficiency, absence of interactive graphical capabilities and, from a theoretical perspective, reliance on asymptotic approximations. To make it easier for practitioners to capture the advantages of level set trees, we have written the Python package DeBaCl for DEnsity-BAsed CLustering. In this article we illustrate how DeBaCl's level set tree estimates can be used for difficult clustering tasks and interactive graphical data analysis. The package is intended to promote the practical use of level set trees through improvements in computational efficiency and a high degree of user customization. In addition, the flexible algorithms implemented in DeBaCl enjoy finite sample accuracy, as demonstrated in recent literature on density clustering. Finally, we show the level set tree framework can be easily extended to deal with functional data.
    Preview · Article · Jul 2013
Show more