Conference Proceeding

# Rates of convergence for the cluster tree.

01/2010; In proceeding of: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada.
Source: DBLP
0 0
·
0 Bookmarks
·
60 Views
• ##### Article: Consistency of Single Linkage for High-Density Clusters
[hide abstract]
ABSTRACT: High-density clusters are defined on a population with density f in r dimensions to be the maximal connected sets of form {x | f(x) ≥ c}. Single-linkage clustering is evaluated for consistency in detecting such high-density clusters—other standard hierarchical techniques, such as average and complete linkage, are hopelessly inconsistent for these clusters. The asymptotic consistency of single linkage closely depends on the percolation problem of Broadbent and Hammersley—if small spheres are removed at random from a solid, at which density of spheres will water begin to flow through the solid? If there is a single critical density such that no flow takes place below a certain density, and flow occurs through a single connected set above that density, then single linkage is consistent in separating high-density clusters (by disjoint single-linkage clusters that include a positive fraction of sample points in the respective clusters and pass arbitrarily close to all points in the respective clusters). The existence of a single critical point remains a conjecture. A weaker result is proved that shows that single-linkage clusters detect high-density clusters if there is a low enough valley separating them.
Journal of The American Statistical Association - J AMER STATIST ASSN. 01/1981; 76(374):388-394.
• Source
##### Article: Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
[hide abstract]
ABSTRACT: We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only. Comment: 31 pages, 2 figures
12/2009;
• Source
##### Article: Adaptive Hausdorff estimation of density level sets
[hide abstract]
ABSTRACT: Consider the problem of estimating the $\gamma$-level set $G^*_{\gamma}=\{x:f(x)\geq\gamma\}$ of an unknown $d$-dimensional density function $f$ based on $n$ independent observations $X_1,...,X_n$ from the density. This problem has been addressed under global error criteria related to the symmetric set difference. However, in certain applications a spatially uniform mode of convergence is desirable to ensure that the estimated set is close to the target set everywhere. The Hausdorff error criterion provides this degree of uniformity and, hence, is more appropriate in such situations. It is known that the minimax optimal rate of error convergence for the Hausdorff metric is $(n/\log n)^{-1/(d+2\alpha)}$ for level sets with boundaries that have a Lipschitz functional form, where the parameter $\alpha$ characterizes the regularity of the density around the level of interest. However, the estimators proposed in previous work are nonadaptive to the density regularity and require knowledge of the parameter $\alpha$. Furthermore, previously developed estimators achieve the minimax optimal rate for rather restricted classes of sets (e.g., the boundary fragment and star-shaped sets) that effectively reduce the set estimation problem to a function estimation problem. This characterization precludes level sets with multiple connected components, which are fundamental to many applications. This paper presents a fully data-driven procedure that is adaptive to unknown regularity conditions and achieves near minimax optimal Hausdorff error control for a class of density level sets with very general shapes and multiple connected components. Comment: Published in at http://dx.doi.org/10.1214/08-AOS661 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
The Annals of Statistics 08/2009; · 2.53 Impact Factor