Page 1
Rates of convergence for the cluster tree
Kamalika Chaudhuri
UC San Diego
kchaudhuri@ucsd.edu
Sanjoy Dasgupta
UC San Diego
dasgupta@cs.ucsd.edu
Abstract
For a density f on Rd, a high-density cluster is any connected component of {x :
f(x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy
called the cluster tree of f. We present a procedure for estimating the cluster tree
given samples from f. We give finite-sample convergence rates for our algorithm,
as well as lower bounds on the sample complexity of this estimation problem.
1 Introduction
A central preoccupation of learning theory is to understand what statistical estimation based on a
finite data set reveals about the underlying distribution from which the data were sampled. For
classification problems, there is now a well-developed theory of generalization. For clustering,
however, this kind of analysis has proved more elusive.
Consider for instance k-means, possibly the most popular clustering procedure in use today. If
this procedure is run on points X1,...,Xnfrom distribution f, and is told to find k clusters, what
do these clusters reveal about f? Pollard [8] proved a basic consistency result: if the algorithm
always finds the global minimum of the k-means cost function (which is NP-hard, see Theorem 3
of [3]), then as n → ∞, the clustering is the globally optimal k-means solution for f. This result,
however impressive, leaves the fundamental question unanswered: is the best k-means solution to f
an interesting or desirable quantity, in settings outside of vector quantization?
In this paper, we are interested in clustering procedures whose output on a finite sample converges
to “natural clusters” of the underlying distribution f. There are doubtless many meaningful ways
to define natural clusters. Here we follow some early work on clustering (for instance, [5]) by
associating clusters with high-density connected regions. Specifically, a cluster of density f is any
connected component of {x : f(x) ≥ λ}, for any λ > 0. The collection of all such clusters forms
an (infinite) hierarchy called the cluster tree (Figure 1).
Are there hierarchical clustering algorithms which converge to the cluster tree? Previous theory
work [5, 7] has provided weak consistency results for the single-linkage clustering algorithm, while
other work [13] has suggested ways to overcome the deficiencies of this algorithm by making it
more robust, but without proofs of convergence. In this paper, we propose a novel way to make
single-linkage more robust, while retaining most of its elegance and simplicity (see Figure 3). We
establish its finite-sample rate of convergence (Theorem 6); the centerpiece of our argument is a
result on continuum percolation (Theorem 11). We also give a lower bound on the problem of
cluster tree estimation (Theorem 12), which matches our upper bound in its dependence on most of
the parameters of interest.
2 Definitions and previous work
Let X be a subset of Rd. We exclusively consider Euclidean distance on X, denoted ? · ?. Let
B(x,r) be the closed ball of radius r around x.
1
Page 2
X
C3
C2
C1
λ1
λ2
λ3
f(x)
Figure 1: A probability density f on R, and three of its clusters: C1, C2, and C3.
2.1 The cluster tree
We start with notions of connectivity. A path P in S ⊂ X is a continuous 1 − 1 function P :
[0,1] → S. If x = P(0) and y = P(1), we write x
S. This relation – “connected in S” – is an equivalence relation that partitions S into its connected
components. We say S ⊂ X is connected if it has a single connected component.
The cluster tree is a hierarchy each of whose levels is a partition of a subset of X, which we will
occasionally call a subpartition of X. Write Π(X) = {subpartitions of X}.
Definition 1 For any f : X → R, the cluster tree of f is a function Cf: R → Π(X) given by
Cf(λ) = connected components of {x ∈ X : f(x) ≥ λ}.
Any element of Cf(λ), for any λ, is called a cluster of f.
P
? y, and we say that x and y are connected in
For any λ, Cf(λ) is a set of disjoint clusters of X. They form a hierarchy in the following sense.
Lemma 2 Pick any λ′≤ λ. Then:
1. For any C ∈ Cf(λ), there exists C′∈ Cf(λ′) such that C ⊆ C′.
2. For any C ∈ Cf(λ) and C′∈ Cf(λ′), either C ⊆ C′or C ∩ C′= ∅.
We will sometimes deal with the restriction of the cluster tree to a finite set of points x1,...,xn.
Formally, the restriction of a subpartition C ∈ Π(X) to these points is defined to be C[x1,...,xn] =
{C ∩ {x1,...,xn} : C ∈ C}. Likewise, the restriction of the cluster tree is Cf[x1,...,xn] : R →
Π({x1,...,xn}), where Cf[x1,...,xn](λ) = Cf(λ)[x1,...,xn]. See Figure 2 for an example.
2.2Notion of convergence and previous work
Suppose a sample Xn⊂ X of size n is used to construct a tree Cnthat is an estimate of Cf. Hartigan
[5] provided a very natural notion of consistency for this setting.
Definition 3 For any sets A,A′⊂ X, let An(resp, A′
A ∩ Xn(resp, A′∩ Xn). We say Cnis consistent if, whenever A and A′are different connected
components of {x : f(x) ≥ λ} (for some λ > 0), P(Anis disjoint from A′
It is well known that if Xnis used to build a uniformly consistent density estimate fn(that is,
supx|fn(x) − f(x)| → 0), then the cluster tree Cfnis consistent; see the appendix for details.
The big problem is that Cfnis not easy to compute for typical density estimates fn: imagine, for
instance, how one might go about trying to find level sets of a mixture of Gaussians! Wong and
n) denote the smallest cluster of Cncontaining
n) → 1 as n → ∞.
2
Page 3
X
f(x)
Figure 2: A probability density f, and the restriction of Cfto a finite set of eight points.
Lane [14] have an efficient procedure that tries to approximate Cfnwhen fnis a k-nearest neighbor
density estimate, but they have not shown that it preserves the consistency property of Cfn.
There is a simple and elegant algorithm that is a plausible estimator of the cluster tree: single
linkage (or Kruskal’s algorithm); see the appendix for pseudocode. Hartigan [5] has shown that it is
consistent in one dimension (d = 1). But he also demonstrates, by a lovely reduction to continuum
percolation, that this consistency fails in higher dimension d ≥ 2. The problem is the requirement
that A ∩ Xn⊂ An: by the time the clusters are large enough that one of them contains all of A,
there is a reasonable chance that this cluster will be so big as to also contain part of A′.
With this insight, Hartigan defines a weaker notion of fractional consistency, under which An(resp,
A′
be very close (at distance → 0 as n → ∞) to the remainder. He then shows that single linkage has
this weaker consistency property for any pair A,A′for which the ratio of inf{f(x) : x ∈ A∪A′} to
sup{inf{f(x) : x ∈ P} : paths P from A to A′} is sufficiently large. More recent work by Penrose
[7] closes the gap and shows fractional consistency whenever this ratio is > 1.
n) need not contain all of A∩Xn(resp, A′∩Xn), but merely a sizeable chunk of it – and ought to
A more robust version of single linkage has been proposed by Wishart [13]: when connecting points
at distance r from each other, only consider points that have at least k neighbors within distance r
(for some k > 2). Thus initially, when r is small, only the regions of highest density are available for
linkage, while the rest of the data set is ignored. As r gets larger, more and more of the data points
become candidates for linkage. This scheme is intuitively sensible, but Wishart does not provide a
proof of convergence. Thus it is unclear how to set k, for instance.
Stuetzle and Nugent [12] have an appealing top-down scheme for estimating the cluster tree, along
with a post-processing step (called runt pruning) that helps identify modes of the distribution. The
consistency of this method has not yet been established.
Several recent papers [6, 10, 9, 11] have considered the problem of recovering the connected com-
ponents of {x : f(x) ≥ λ} for a user-specified λ: the flat version of our problem. In particular,
the algorithm of [6] is intuitively similar to ours, though they use a single graph in which each point
is connected to its k nearest neighbors, whereas we have a hierarchy of graphs in which each point
is connected to other points at distance ≤ r (for various r). Interestingly, k-nn graphs are valuable
for flat clustering because they can adapt to clusters of different scales (different average interpoint
distances). But they are challenging to analyze and seem to require various regularity assumptions
on the data. A pleasant feature of the hierarchical setting is that different scales appear at different
levels of the tree, rather than being collapsed together. This allows the use of r-neighbor graphs, and
makes possible an analysis that has minimal assumptions on the data.
3Algorithm and results
In this paper, we consider a generalization of Wishart’s scheme and of single linkage, shown in
Figure 3. It has two free parameters: k and α. For practical reasons, it is of interest to keep these as
3
Page 4
1. For each xiset rk(xi) = inf{r : B(xi,r) contains k data points}.
2. As r grows from 0 to ∞:
(a) Construct a graph Grwith nodes {xi: rk(xi) ≤ r}.
Include edge (xi,xj) if ?xi− xj? ≤ αr.
(b) Let?C(r) be the connected components of Gr.
Figure 3: Algorithm for hierarchical clustering. The input is a sample Xn = {x1,...,xn} from
density f on X. Parameters k and α need to be set. Single linkage is (α = 1,k = 2). Wishart
suggested α = 1 and larger k.
small as possible. We provide finite-sample convergence rates for all 1 ≤ α ≤ 2 and we can achieve
k ∼ dlogn, which we conjecture to be the best possible, if α ≥√2. Our rates for α = 1 force k to
be much larger, exponential in d. It is a fascinating open problem to determine whether the setting
(α = 1,k ∼ dlogn) yields consistency.
3.1A notion of cluster salience
Suppose density f is supported on some subset X of Rd. We will show that the hierarchical cluster-
ing procedure is consistent in the sense of Definition 3. But the more interesting question is, what
clusters will be identified from a finite sample? To answer this, we introduce a notion of salience.
The first consideration is that a cluster is hard to identify if it contains a thin “bridge” that would
make it look disconnected in a small sample. To control this, we consider a “buffer zone” of width
σ around the clusters.
Definition 4 For Z ⊂ Rdand σ > 0, write Zσ= Z + B(0,σ) = {y ∈ Rd: infz∈Z?y − z? ≤ σ}.
An important technical point is that Zσis a full-dimensional set, even if Z itself is not.
Second, the ease of distinguishing two clusters A and A′depends inevitably upon the separation
between them. To keep things simple, we’ll use the same σ as a separation parameter.
Definition 5 Let f be a density on X ⊂ Rd. We say that A,A′⊂ X are (σ,ǫ)-separated if there
exists S ⊂ X (separator set) such that:
• Any path in X from A to A′intersects S.
• supx∈Sσf(x) < (1 − ǫ)infx∈Aσ∪A′
Under this definition, Aσand A′
is zero. However, Sσneed not be contained in X.
σf(x).
σmust lie within X, otherwise the right-hand side of the inequality
3.2Consistency and finite-sample rate of convergence
Here we state the result for α ≥
1 ≤ α ≤ 2 and k ∼ (2/α)ddlogn.
Theorem 6 There is an absolute constant C such that the following holds. Pick any δ,ǫ > 0, and
run the algorithm on a sample Xnof size n drawn from f, with settings
√2 ≤ α ≤ 2 and k = C ·dlogn
Then there is a mapping r : [0,∞) → [0,∞) such that with probability at least 1−δ, the following
holds uniformly for all pairs of connected subsets A,A′⊂ X: If A,A′are (σ,ǫ)-separated (for ǫ
and some σ > 0), and if
√2 and k ∼ dlogn. The analysis section also has results for
ǫ2
· log21
δ.
λ :=inf
x∈Aσ∪A′
σf(x) ≥
1
vd(σ/2)d·k
n·
?
1 +ǫ
2
?
(*)
where vdis the volume of the unit ball in Rd, then:
4
Page 5
1. Separation. A ∩ Xnis disconnected from A′∩ Xnin Gr(λ).
2. Connectedness. A ∩ Xnand A′∩ Xnare each individually connected in Gr(λ).
The two parts of this theorem – separation and connectedness – are proved in Sections 3.3 and 3.4.
We mention in passing that this finite-sample result implies consistency (Definition 3): as n → ∞,
take kn= (dlogn)/ǫ2
Under mild conditions, any two connected components A,A′of {f ≥ λ} are (σ,ǫ)-separated for
some σ,ǫ > 0 (see appendix); thus they will get distinguished for sufficiently large n.
nwith any schedule of (ǫn: n = 1,2,...) such that ǫn→ 0 and kn/n → 0.
3.3 Analysis: separation
The cluster tree algorithm depends heavily on the radii rk(x): the distance within which x’s nearest
k neighbors lie (including x itself). Thus the empirical probability mass of B(x,rk(x)) is k/n. To
show that rk(x) is meaningful, we need to establish that the mass of this ball under density f is also,
very approximately, k/n. The uniform convergence of these empirical counts follows from the fact
that balls in Rdhave finite VC dimension, d + 1. Using uniform Bernstein-type bounds, we get a
set of basic inequalities that we use repeatedly.
Lemma 7 Assume k ≥ dlogn, and fix some δ > 0. Then there exists a constant Cδsuch that with
probability > 1 − δ, every ball B ⊂ Rdsatisfies the following conditions:
f(B) ≥Cδdlogn
n
f(B) ≥k
n
f(B) ≤k
n
Here fn(B) = |Xn∩ B|/n is the empirical mass of B, while f(B) =?
PROOF: See appendix. Cδ= 2Colog(2/δ), where Cois the absolute constant from Lemma 16. ?
=⇒
fn(B) > 0
n+Cδ
n−Cδ
?
?
kdlogn=⇒
fn(B) ≥k
fn(B) <k
n
kdlogn=⇒
n
Bf(x)dx is its true mass.
We will henceforth think of δ as fixed, so that we do not have to repeatedly quantify over it.
Lemma 8 Pick 0 < r < 2σ/(α + 2) such that
vdrdλ
≥
k
n+Cδ
k
n−Cδ
n
?
?
kdlogn
vdrdλ(1 − ǫ)<
n
kdlogn
(recall that vdis the volume of the unit ball in Rd). Then with probability > 1 − δ:
1. Grcontains all points in (Aσ−r∪ A′
2. A ∩ Xnis disconnected from A′∩ Xnin Gr.
σ−r) ∩ Xnand no points in Sσ−r∩ Xn.
PROOF: For (1), any point x ∈ (Aσ−r∪A′
at least k neighbors within radius r. Likewise, any point x ∈ Sσ−rhas f(B(x,r)) < vdrdλ(1−ǫ);
and thus, by Lemma 7, has strictly fewer than k neighbors within distance r.
σ−r) has f(B(x,r)) ≥ vdrdλ; and thus, by Lemma 7, has
For (2), since points in Sσ−rare absent from Gr, any path from A to A′in that graph must have an
edge across Sσ−r. But any such edge has length at least 2(σ − r) > αr and is thus not in Gr. ?
Definition 9 Define r(λ) to be the value of r for which vdrdλ =k
n+Cδ
n
√kdlogn.
To satisfy the conditions of Lemma 8, it suffices to take k ≥ 4C2
δ(d/ǫ2)logn; this is what we use.
5
Page 6
xi
π(xi)
x′
x
x′
x
xi
π(xi)
xi+1
Figure 4: Left: P is a path from x to x′, and π(xi) is the point furthest along the path that is within
distance r of xi. Right: The next point, xi+1∈ Xn, is chosen from the half-ball of B(π(xi),r) in
the direction of xi− π(xi).
3.4 Analysis: connectedness
We need to show that points in A (and similarly A′) are connected in Gr(λ). First we state a simple
bound (proved in the appendix) that works if α = 2 and k ∼ dlogn; later we consider smaller α.
Lemma 10 Suppose 1 ≤ α ≤ 2. Then with probability ≥ 1 − δ, A ∩ Xnis connected in Gr
whenever r ≤ 2σ/(2 + α) and the conditions of Lemma 8 hold, and
?2
vdrdλ ≥
α
?dCδdlogn
n
.
Comparing this to the definition of r(λ), we see that choosing α = 1 would entail k ≥ 2d, which is
undesirable. We can get a more reasonable setting of k ∼ dlogn by choosing α = 2, but we’d like
α to be as small as possible. A more refined argument shows that α ≈√2 is enough.
Theorem 11 Suppose α ≥√2. Then, withprobability > 1−δ, A∩Xnis connected in Grwhenever
r ≤ σ/2 and the conditions of Lemma 8 hold, and
vdrdλ ≥4Cδdlogn
n
.
PROOF: We have already made heavy use of uniform convergence over balls. We now also require
the class G of half-balls, each of which is the intersection of an open ball and a halfspace through
the center of the ball. Formally, each of these functions is defined by a center µ and a unit direction
u, and is the indicator function of the set
{z ∈ Rd: ?z − µ? < r,(z − µ) · u > 0}.
We will describe any such set as “the half of B(µ,r) in direction u”. If the half-ball lies entirely
in Aσ, its probability mass is at least (1/2)vdrdλ. By uniform convergence over G (which has VC
dimension at most 2d + 1), we can then conclude (as in Lemma 7) that if vdrdλ ≥ (4Cδdlogn)/n,
then with probability at least 1 − δ, every such half-ball within Aσcontains at least one data point.
Pick any x,x′∈ A∩Xn; there is a path P in A with x
x0= x,x1,x2,..., ending in x′, such that for every i, point xiis active in Grand ?xi−xi+1? ≤ αr.
This will confirm that x is connected to x′in Gr.
To begin with, recall that P is a continuous 1−1 function from [0,1] into A. We are also interested
in the inverse P−1, which sends a point on the path back to its parametrization in [0,1]. For any
point y ∈ X, define N(y) to be the portion of [0,1] whose image under P lies in B(y,r): that is,
N(y) = {0 ≤ z ≤ 1 : P(z) ∈ B(y,r)}. If y is within distance r of P, then N(y) is nonempty.
Define π(y) = P(supN(y)), the furthest point along the path within distance r of y (Figure 4, left).
P
? x′. We’ll identify a sequence of data points
The sequence x0,x1,x2,... is defined iteratively; x0= x, and for i = 0,1,2,... :
• If ?xi− x′? ≤ αr, set xi+1= x′and stop.
6
Page 7
• By construction, xiis within distance r of path P and hence N(xi) is nonempty.
• Let B be the open ball of radius r around π(xi). The half of B in direction xi−π(xi) must
contain a data point; this is xi+1(Figure 4, right).
The process eventually stops because each π(xi+1) is strictly further along path P than π(xi);
formally, P−1(π(xi+1)) > P−1(π(xi)). This is because ?xi+1− π(xi)? < r, so by continuity of
the function P, there are points further along the path (beyond π(xi)) whose distance to xi+1is still
< r. Thus xi+1is distinct from x0,x1,...,xi. Since there are finitely many data points, the process
must terminate, so the sequence {xi} does constitute a path from x to x′.
Each xilies in Ar ⊆ Aσ−rand is thus active in Gr(Lemma 8). Finally, the distance between
successive points is:
?xi− xi+1?2
=
?xi− π(xi) + π(xi) − xi+1?2
?xi− π(xi)?2+ ?π(xi) − xi+1?2− 2(xi− π(xi)) · (xi+1− π(xi))
2r2
≤ α2r2,
where the second-last inequality is from the definition of half-ball. ?
=
≤
To complete the proof of Theorem 6, take k = 4C2
of Lemma 8 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 9) then
translates into
vdrdλ =k
n
This shows that clusters at density level λ emerge when the growing radius r of the cluster tree
algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,
we need this radius to be at most σ/2; this is what yields the final lower bound on λ.
δ(d/ǫ2)logn, which satisfies the requirements
?
1 +ǫ
2
?
.
4Lower bound
We have shown that the algorithm of Figure 3 distinguishes pairs of clusters that are (σ,ǫ)-separated.
The number of samples it requires to capture clusters at density ≥ λ is, by Theorem 6,
?
We’ll now show that this dependence on σ, λ, and ǫ is optimal. The only room for improvement,
therefore, is in constants involving d.
O
d
vd(σ/2)dλǫ2log
d
vd(σ/2)dλǫ2
?
,
Theorem 12 Pick any ǫ in (0,1/2), any d > 1, and any σ,λ > 0 such that λvd−1σd< 1/50. Then
there exist: an input space X ⊂ Rd; a finite family of densities Θ = {θi} on X; subsets Ai,A′
X such that Aiand A′
the following additional property.
i,Si⊂
iare (σ,ǫ)-separated by Sifor density θi, and infx∈Ai,σ∪A′
i,σθi(x) ≥ λ, with
Consider any algorithm that is given n ≥ 100 i.i.d. samples Xnfrom some θi ∈ Θ and, with
probability at least 1/2, outputs a tree in which the smallest cluster containing Ai∩ Xnis disjoint
from the smallest cluster containing A′
?
i∩ Xn. Then
n = Ω
1
vdσdλǫ2d1/2log
1
vdσdλ
?
.
PROOF: We start by constructing the various spaces and densities. X is made up of two disjoint
regions: a cylinder X0, and an additional region X1whose sole purpose is as a repository for excess
probability mass. Let Bd−1be the unit ball in Rd−1, and let σBd−1be this same ball scaled to have
radius σ. The cylinder X0stretches along the x1-axis; its cross-section is σBd−1and its length is
4(c + 1)σ for some c > 1 to be specified: X0= [0,4(c + 1)σ] × σBd−1. Here is a picture of it:
7
Page 8
0
4(c + 1)σ
4σ
8σ
12σ
σ
x1axis
We will construct a family of densities Θ = {θi} on X, and then argue that any cluster tree algorithm
that is able to distinguish (σ,ǫ)-separated clusters must be able, when given samples from some θI,
to determine the identity of I. The sample complexity of this latter task can be lower-bounded using
Fano’s inequality (typically stated as in [2], but easily rewritten in the convenient form of [15], see
appendix): it is Ω((log|Θ|)/β), for β = maxi?=jK(θi,θj), where K(·,·) is KL divergence.
The family Θ contains c − 1 densities θ1,...,θc−1, where θiis defined as follows:
• Density λ on [0,4σi+σ]×σBd−1and on [4σi+3σ,4(c+1)σ]×σBd−1. Since the cross-
sectional area of the cylinder is vd−1σd−1, the total mass here is λvd−1σd(4(c + 1) − 2).
• Density λ(1 − ǫ) on (4σi + σ,4σi + 3σ) × σBd−1.
• Point masses 1/(2c) at locations 4σ,8σ,...,4cσ along the x1-axis (use arbitrarily narrow
spikes to avoid discontinuities).
• The remaining mass, 1/2−λvd−1σd(4(c+1)−2ǫ), is placed on X1in some fixed manner
(that does not vary between different densities in Θ).
Here is a sketch of θi. The low-density region of width 2σ is centered at 4σi + 2σ on the x1-axis.
point mass 1/2c
density λ(1 − ǫ)
density λ
2σ
For any i ?= j, the densities θiand θjdiffer only on the cylindrical sections (4σi + σ,4σi + 3σ) ×
σBd−1and (4σj+σ,4σj+3σ)×σBd−1, which are disjoint and each have volume 2vd−1σd. Thus
K(θi,θj)=2vd−1σd
?
λlog
λ
λ(1 − ǫ)+ λ(1 − ǫ)logλ(1 − ǫ)
4
ln2vd−1σdλǫ2
λ
?
=2vd−1σdλ(−ǫlog(1 − ǫ)) ≤
(using ln(1 − x) ≥ −2x for 0 < x ≤ 1/2). This is an upper bound on the β in the Fano bound.
Now define the clusters and separators as follows: for each 1 ≤ i ≤ c − 1,
• Aiis the line segment [σ,4σi] along the x1-axis,
• A′
• Si= {4σi + 2σ} × σBd−1is the cross-section of the cylinder at location 4σi + 2σ.
Thus Aiand A′
that Aiand A′
With the various structures defined, what remains is to argue that if an algorithm is given a sample
Xnfrom some θI(where I is unknown), and is able to separate AI∩Xnfrom A′
effectively infer I. This has sample complexity Ω((logc)/β). Details are in the appendix. ?
iis the line segment [4σ(i + 1),4(c + 1)σ − σ] along the x1-axis, and
iare one-dimensional sets while Siis a (d − 1)-dimensional set. It can be checked
iare (σ,ǫ)-separated by Siin density θi.
I∩Xn, then it can
There remains a discrepancy of 2dbetween the upper and lower bounds; it is an interesting open
problem to close this gap. Does the (α = 1,k ∼ dlogn) setting (yet to be analyzed) do the job?
Acknowledgments. We thank the anonymous reviewers for their detailed and insightful comments,
and the National Science Foundation for support under grant IIS-0347646.
8
Page 9
References
[1] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture
Notes in Artificial Intelligence, 3176:169–207, 2004.
[2] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 2005.
[3] S. Dasgupta and Y. Freund. Random projection trees for vector quantization. IEEE Transac-
tions on Information Theory, 55(7):3229–3242, 2009.
[4] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Jour-
nal of Machine Learning Research, 10:281–299, 2009.
[5] J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American
Statistical Association, 76(374):388–394, 1981.
[6] M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs
for identifying noisy clusters. Theoretical Computer Science, 410:1749–1764, 2009.
[7] M. Penrose. Single linkage clustering and continuum percolation. Journal of Multivariate
Analysis, 53:94–109, 1995.
[8] D. Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135–140, 1981.
[9] P. Rigollet and R. Vert. Fast rates for plug-in estimators of density level sets. Bernoulli,
15(4):1154–1178, 2009.
[10] A. Rinaldo and L. Wasserman.Generalized density clustering.
38(5):2678–2722, 2010.
[11] A. Singh, C. Scott, and R. Nowak. Adaptive hausdorff estimation of density level sets. Annals
of Statistics, 37(5B):2760–2782, 2009.
[12] W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree
of a density. Journal of Computational and Graphical Statistics, 19(2):397–418, 2010.
[13] D. Wishart. Mode analysis: a generalization of nearest neighbor which reduces chaining ef-
fects. In Proceedings of the Colloquium on Numerical Taxonomy held in the University of St.
Andrews, pages 282–308, 1969.
[14] M.A. Wong and T. Lane. A kth nearest neighbour clustering procedure. Journal of the Royal
Statistical Society Series B, 45(3):362–368, 1983.
[15] B. Yu. Assouad, Fano and Le Cam. Festschrift for Lucien Le Cam, pages 423–435, 1997.
Annals of Statistics,
9
Page 10
5Appendix: using a uniformly consistent density estimate
One way to build a cluster tree is to return Cfn, where fnis a uniformly consistent density estimate.
Lemma 13 Suppose estimator fnof density f (on space X) satisfies
sup
x∈X|fn(x) − f(x)| ≤ ǫn.
Pick any two disjoint sets A,A′⊂ X and define
α= inf
x∈A∪A′f(x)
sup
AP
?A′
β=
inf
x∈Pf(x)
If α − β > 2ǫnthen A,A′lie entirely in disjoint connected components of Cfn(α − ǫn).
PROOF: A and A′are each connected in Cfn(α − ǫn). But there is no path from A to A′in Cfn(λ)
for λ > β + ǫn. ?
The problem, however, is that computing the level sets of fnis usually not an easy task. Hence we
adopt a different approach in this paper.
6 Appendix: single linkage
This procedure for building a hierarchical clustering takes as input a data set x1,...,xn∈ Rd.
1. For each data point xi, set r2(xi) = distance from xito its nearest neighbor.
2. As r grows from 0 to ∞:
(a) Construct a graph Grwith nodes {xi: r2(xi) ≤ r}.
Include edge (xi,xj) if ?xi− xj? ≤ r.
(b) Let?C(r) be the connected components of Gr.
7 Appendix: consistency
The following is a straightforward exercise in analysis.
Lemma 14 Suppose density f : Rd→ R is continuous and is zero outside a compact subset
X ⊂ Rd. Suppose further that for some λ, {x ∈ X : f(x) ≥ λ} has finitely many connected
components, among them A and A′. Then there exist σ,ǫ > 0 such that A and A′are (σ,ǫ)-
separated.
PROOF: Let A1,A2,...,Akbe the connected components of {f ≥ λ}, with A = A1and A′= A2.
First, each Aiis closed and thus compact. To see this, pick any x ∈ X \ Ai. There must be some x′
on the shortest path from x to Aiwith f(x′) < λ (otherwise x ∈ Ai). By continuity of f, there is
some ball B(x′,r) on which f < λ; thus this ball doesn’t touch Ai. Then B(x,r) doesn’t touch Ai.
Next, for any i ?= j, define ∆ij = infx∈Ai,y∈Aj?x − y? to be the distance between Aiand Aj.
We’ll see that ∆ij> 0. Specifically, define g : Ai× Aj→ R by g(a,a′) = ?a − a′?. Since g has
compact domain, it attains its infimum for some a ∈ Ai,a′∈ Aj. Thus ∆ij= ?a − a′? > 0.
Let ∆ = mini?=j∆ij> 0, and define S to be the set of points at distance exactly ∆/2 from A:
S = {x ∈ X : inf
y∈A?x − y? = ∆/2}.
S separates A from A′. Moreover, it is closed by continuity of ? · ?, and hence is compact. Define
λo= supx∈Sf(x). Since S is compact, f (restricted to S) is maximized at some xo∈ S. Then
λo= f(xo) < λ.
10
Page 11
To finish up, set δ = (λ − λo)/3 > 0. By uniform continuity of f, there is some σ > 0 such that f
doesn’t change by more than δ on balls of radius σ. Then f(x) ≤ λo+ δ = λ − 2δ for x ∈ Sσand
f(x) ≥ λ − δ for x ∈ Aσ∪ A′
Thus S is a (σ,δ/(λ − δ))-separator for A,A′. ?
σ.
8 Appendix: proof of Lemma 7
We start with a standard generalization result due to Vapnik and Chervonenkis; the following version
is a paraphrase of Theorem 5.1 of [1].
Theorem 15 Let G be a class of functions from X to {0,1} with VC dimension d < ∞, and P a
probability distribution on X. Let E denote expectation with respect to P. Suppose n points are
drawn independently at random from P; let Endenote expectation with respect to this sample. Then
for any δ > 0, with probability at least 1 − δ, the following holds for all g ∈ G:
−min(βn
where βn=?(4/n)(dln2n + ln(8/δ)).
By applying this bound to the class G of indicator functions over balls, we get the following:
Lemma 16 Suppose Xnis a sample of n points drawn independently at random from a distribution
f over X. For any set Y ⊂ X, let fn(Y ) = |Xn∩Y |/n. There is a universal constant Co> 0 such
that for any δ > 0, with probability at least 1 − δ, for any ball B ⊂ Rd,
f(B) ≥Co
n
?
?
?
Eng,β2
n+ βn
?
Eg) ≤ Eg − Eng ≤ min(β2
n+ βn
?
Eng,βn
?
Eg),
?
dlogn + log1
dlogn + log1
δ
??
??
?
=⇒
fn(B) > 0
f(B) ≥k
n+Co
n
dlogn + log1
δ+
?
?
k
?
?
δ
=⇒
fn(B) ≥k
n
f(B) <k
n−Co
n
dlogn + log1
δ+
kdlogn + log1
δ
=⇒
fn(B) <k
n
PROOF: The bound f(B) − fn(B) ≤ βn
?f(B) from Theorem 15 yields
f(B) > β2
n
=⇒
n+ βn
?
n
fn(B) > 0.
?fn(B). It follows that
=⇒
?f(B)) to get
=⇒
For the second bound, we use f(B) − fn(B) ≤ β2
f(B) ≥k
n+ β2
n+ βn
k
fn(B) ≥k
n.
For the last bound, we rearrange f(B) − fn(B) ≥ −(β2
f(B) <k
n+ βn
n− β2
n− βn
?
k
n
fn(B) <k
n.
?
Lemma 7 now follows immediately, by taking k ≥ dlogn. Since the uniform convergence bounds
have error bars of magnitude (dlogn)/n, it doesn’t make sense to take k any smaller than this.
9Appendix: proof of Lemma 10
Consider any x,x′∈ A ∩ Xn. Since A is connected, there is a path P in A with x
0 < γ < 1. Because the density of A is lower bounded away from zero, it follows by a volume
P
? x′. Fix any
11
Page 12
and packing-covering argument that A, and thus P, can be covered by a finite number of balls of
diameter γr. Thus we can choose finitely many points z1,z2,...,zk∈ P such that x = z0, x′= zk
and
?zi+1− zi? ≤ γr.
By Lemma 7, any ball centered in A with radius (α − γ)r/2 contains at least one data point if
?(α − γ)r
Assume for the moment that this holds. Then, every ball B(zi,(α − γ)r/2) contains at least one
point; call it xi.
By the upper bound on r, each such xilies in Aσ−r; therefore, by Lemma 8, the xiare all active in
Gr. Moreover, consecutive points xiare close together:
vd
2
?d
λ ≥
Cδdlogn
n
.
(1)
?xi+1− xi? ≤ ?xi+1− zi+1? + ?zi+1− zi? + ?zi− xi? ≤ αr.
Therefore, all edges (xi,xi+1) exist in Gr, whereby x is connected to x′in Gr.
All this assumes that equation (1) holds for some γ > 0. Taking γ → 0 gives the lemma.
10Appendix: Fano’s inequality
Considerthefollowinggameplayedwithapredefined, finiteclassofdistributionsΘ = {θ1,...,θℓ},
defined on a common space X:
• Nature picks I ∈ {1,2,...,ℓ}.
• Player is given n i.i.d. samples X1,...,Xnfrom θI.
• Player then guesses the identity of I.
Fano’s inequality [2, 15] gives a lower bound on the number of samples n needed to achieve a certain
success probability. It depends on how similar the distributions θiare: the more similar, the more
samples are needed. Define
ℓ
?
where K(·) is KL divergence. Then n needs to be Ω((logℓ)/β). Here’s the formal statement.
Theorem 17 (Fano) Let g : Xn→ {1,2,...,ℓ} denote Player’s computation. If Nature chooses I
uniformly at random from {1,2,...,ℓ}, then for any 0 < δ < 1,
n ≤(1 − δ)(logℓ) − 1
β
β =1
ℓ2
i,j=1
K(θi,θj)
=⇒
Pr(g(X1,...,Xn) ?= I) ≥ δ,
where the logarithm is base two.
11Appendix: proof details for Theorem 12
Once the various structures are defined, the remainder of the proof is broken into two phases. The
first will establish that if n is at least a small constant (say, 100), then it must be the case that
n = Ω(1/(vdσdλǫ2d1/2)). The second part of the proof will then extend this to show that if n is at
least this latter quantity, then in fact it must be even larger, the lower bound of the theorem statement.
To start with, choose c to be a small constant, such as 5. Then, even a small sample Xnis likely
to contain all of the c point masses on the x1-axis (each of which has mass 1/2c). Suppose the
algorithm is promised in advance that the underlying density is one of the c − 1 choices θI, and is
subsequently able (with probability at least 1/2) to separate AIfrom A′
all the point masses within AI, and all the point masses within A′
I. To do this, it must connect
I, and yet keep these two groups
12
Page 13
apart. In short, this algorithm must be able to determine (with probability at least 1/2) the segment
(4σI + σ,4σI + 3σ) of lower density, and hence the identity of I.
We can thus apply Fano’s inequality to conclude that we need
?logc
The last equality comes from the formula vd= πd/2/Γ((d/2)+1), whereupon vd−1= O(vdd1/2).
Now consider a larger value of c:
?
and apply the same construction. We have already established that we need n = Ω(c/ǫ2) samples,
so assume n is at least this large. Then, it is very likely that when the underlying density is θI, the
sample Xnwill contain the four point masses at 4σ, 4σI, 4σ(I + 1), and 4(c + 1)σ. Therefore, the
clustering algorithm must connect the point at 4σ to that at 4σI and the point at 4σ(I +1) to that at
4(c+1)σ, while keeping the two groups apart. Therefore, this algorithm can determine I. Applying
Fano’s inequality gives n = Ω((logc)/β), which is the bound in the theorem statement.
n = Ω
β
?
= Ω
?
1
vd−1σdλǫ2
?
= Ω
?
1
vdσdλǫ2d1/2
?
c =
1
8vd−1σdλ− 1
?
,
12 Appendix: better convergence rates in some instances
Our convergence rates (Theorem 6) contain a condition (*) for clusters of density ≥ λ that are
(σ,ǫ)-separated:
1
vd(σ/2)d·k
λ ≥
n·
?
1 +ǫ
2
?
.
Paraphrased, this means that in order for such clusters to be distinguished, it is sufficient that the
number of data points be at least
n ≥
d2d
vdσdλǫ2,
ignoring logarithmic factors. There is a 2dfactor here that does not appear in the lower bound
(Theorem 12). Can this term be removed?
We now show that this term can be improved in two particular cases.
• When the separation ǫ is not too small, in particular when ǫ > (9/10)d, the 2dterm can be
improved to (1 + 1/√2)d, which is roughly 1.7d.
• If the density f is Lipschitz with parameter ℓ, that is, if
|f(x) − f(x′)| ≤ ℓ?x − x′? for all x,x′∈ X,
and if ǫ ≥ 3ℓσ/λ, then the 2dterm can be removed altogether.
12.1Better rates when the separation is not too small
Theorem 18 Theorem 6 holds if ǫ >
?
7+4√2
16
?−d/2
vd(2σ/(α + 2))d·k
, and if the condition (*) is replaced by:
λ :=inf
x∈Aσ∪A′
σf(x) ≥
1
n·
?
1 +ǫ
2
?
.
The lower bound on ǫ is at most (9/10)d.
13
Page 14
v(q,r,σ)
h
θ
r
H
0
y
σ
Figure 5: Balls B(0,σ) and B(y,r), where ?y? = q.
Analysis overview
The main idea is to bound the volume B(x,r) \ Aσfor a point x that lies in Aσbut not in Aσ−r. If
this volume is small, then most of B(x,r) is inside Aσ, and thus has density λ or more. To establish
this bound, we begin with some notation.
Definition 19 For r,q ≤ σ, define v(q,r,σ) as the volume of the region B(y,r) \ B(0,σ), for any
y such that ||y|| = q. (By symmetry this volume is the same for all such y.) For an illustration, see
Figure 12.1.
Lemma 20 Let y ∈ Aqand r,q ≤ σ. Then vol(B(y,r) \ Aσ) ≤ v(q,r,σ).
PROOF: As y ∈ Aq, there exists some y′∈ A such that ||y − y′|| ≤ q. Let q′= ||y − y′||. As Aσ
contains B(y′,σ),
v(q′,r,σ) = vol(B(y,r) \ B(y′,σ)) ≥ vol(B(y,r) \ Aσ)
The observation that for q′≤ q, v(q′,r,σ) ≤ v(q,r,σ) concludes the lemma. ?
Lemma 21 Let r ≤ q = σ/(1 + 1/√2). Then,
v(q,r,σ) ≤
1
2
?
7 + 4√2
16
?−d/2
vdrd
PROOF: WLOG, suppose that y lies along the first coordinate axis. If r ≤ σ − q, the sphere
B(0,σ) ⊇ B(y,r), and thus v(q,r,σ) = 0. Thus, forthe restof the proof, we assume that r > σ−q.
Let H be the (d−1)-dimensional hyperplane that contains the intersection of the spherical surfaces
of B(0,σ) and B(y,r) – see Figure 12.1. By spherical symmetry, H is orthogonal to the first
coordinate axis. Let h be the distance between y and H, and let θ = arccos(h/r).
Any x ∈ B(y,r) \ B(0,σ) also lies to the right of the hyperplane H in Figure 12.1. v(q,r,σ) is
thus at most the volume of a spherical cap of B(y,r) that subtends an angle θ at the center y. If
0 < θ < π/2, thatis, ifthecenter y liestotheleftofH, thenwecanupper-bound thisvolume bythat
of the smallest enclosing hemisphere. A simple calculation shows that the latter is (1/2)vdrdsindθ.
14
Page 15
We first calculate sinθ for r = q. Let x1be the first coordinate of any point on the intersection of
the spherical surfaces of B(0,σ) and B(y,r). Then, x2
values of q and r in terms of σ, x1= σ2/2q =(1+√2)σ
√
4
, from which v(q,r,σ) ≤1
For smaller r, observe that for a fixed q, the distance h increases with decreasing r, along with
decreasing θ. As the volume of the spherical cap also decreases with decreasing θ, the lemma
follows. ?
1− (x1− q)2= σ2− r2. Plugging in the
, which we verify is to the left of H.
2√2
Simple algebra now shows that sinθ =
7+4√2
2
?
7+4√2
16
?−d/2
vdrd.
Analysis: separation
Lemma 22 Let α ≥√2, and let q = σ/(1 + 1/√2). Pick 0 < r < 2σ/(α + 2) such that
(vdrd− v(q,r,σ))λ
≥
k
n+Cδ
k
n−Cδ
n
?
?
kdlogn
vdrdλ(1 − ǫ)<
n
kdlogn
Then with probability > 1 − δ:
1. Grcontains all points in (Aq∪ A′
2. A ∩ Xnis disconnected from A′∩ Xnin Gr.
q) ∩ Xnand no points in Sσ−r∩ Xn.
PROOF: Notice first of all that r ≤ q. From Lemma 20, for any point x ∈ (Aq∪A′
most the volume of B(x,r) that lies outside Aσ∪A′
and thus, by Lemma 7, x has at least k neighbors within radius r. Likewise, any point x ∈ Sσ−r
has f(B(x,r)) < vdrdλ(1 − ǫ); and thus, by Lemma 7, has strictly fewer than k neighbors within
distance r. This establishes (1).
q), v(q,r,σ) is at
σ; therefore, f(B(x,r)) ≥ (vdrd−v(q,r,σ))λ,
For (2), since points in Sσ−rare absent from Gr, any path from A to A′in that graph must have an
edge across Sσ−r. But any such edge has length at least 2(σ − r) > αr and is thus not in Gr. ?
Definition 23 Define r(λ) to be the value of r for which (vdrd−v(q,r,σ))λ =k
n+Cδ
n
√kdlogn.
To satisfy the conditions of Lemma 22, recall that if ǫ >
16C2
?
7+4√2
16
?−d/2
, it suffices to take k ≥
δ(d/ǫ2)logn; this is what we use.
Analysis: connectedness
To show that points in A (and similarly A′) are connected in Gr(λ), we observe that as all x ∈ Aq∪
A′
Since α ≥√2, this condition holds for any r ≤ 2σ/(α + 2).
To complete the proof of Theorem 18, take k = 16C2
of Lemma 22 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 23)
then translates into
(vdrd− v(q,r,σ))λ =k
qare active, the arguments of Theorem 11 follow exactly as before, provided r ≤ σ/(1 + 1/√2).
δ(d/ǫ2)logn, which satisfies the requirements
n
?
1 +ǫ
2
?
.
This shows that clusters at density level λ emerge when the growing radius r of the cluster tree
algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,
we need this radius to be at most 2σ/(2 + α); this is what yields the final lower bound on λ.
15
Page 16
12.2 Better rates when the density is Lipschitz
In this section, we establish an even sharper upper bound on the rate of convergence, provided the
density f is smooth. In particular, we assume that f is Lipschitz, with a Lipschitz constant ℓ.
Theorem 24 Theorem 6 holds if the density f has Lipschitz constant ℓ and if the condition (*) is
replaced by:
λ :=inf
x∈Aσ∪A′
σf(x) ≥
1
vd˜ σd·k
n·
?
1 +ǫ
2
?
where ˜ σ = min(σ,λǫ
3ℓ).
As usual, for the analysis we first treat separation, then connectedness.
Lemma 25 Pick 0 < r ≤ ˜ σ, α < 2, such that
vdrd(λ − ˜ σℓ)
≥
k
n+Cδ
k
n−Cδ
n
?
?
kdlogn
vdrd(λ(1 − ǫ) + ˜ σℓ)<
n
kdlogn
(recall that vdis the volume of the unit ball in Rd). Then with probability > 1 − δ:
1. Grcontains all points in (Aσ∪ A′
2. A ∩ Xnis disconnected from A′∩ Xnin Gr.
σ) ∩ Xnand no points in Sσ∩ Xn.
PROOF: For (1), if r ≤ ˜ σ, for any point x ∈ (Aσ∪ A′
λ − ˜ σℓ. Therefore, f(B(x,r)) ≥ vdrd(λ − ˜ σℓ); and thus, by Lemma 7, has at least k neighbors
within radius r. Likewise, for r ≤ ˜ σ, any point x ∈ Sσhas f(B(x,r)) < vdrd(λ(1−ǫ)+ ˜ σℓ); and
thus, by Lemma 7, has strictly fewer than k neighbors within distance r.
σ), the density in any y ∈ B(x,r) is at least
For (2), since points in Sσare absent from Gr, any path from A to A′in that graph must have an
edge across Sσ. But any such edge has length at least 2σ > αr (as r ≤ ˜ σ ≤ σ, and α ≤ 2) and is
thus not in Gr. ?
Definition 26 Define r(λ) to be the value of r for which vdrd(λ − ˜ σℓ) =k
As ˜ σ ≤ λǫ/3ℓ, to satisfy the conditions of Lemma 25, it suffices to take k ≥ 36C2
is what we use.
n+Cδ
n
√kdlogn.
δ(d/ǫ2)logn; this
We now need to show that points in A (and similarly A′) are connected in Gr(λ). To show this, note
that under the conditions on ℓ the proof of Theorem 11 applies for r = ˜ σ, and α =√2.
To complete the proof of Theorem 24, take k = 36C2
of Lemma 25 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 26)
then translates into
vdrd(λ − ˜ σℓ) =k
This shows that clusters at density level λ emerge when the growing radius r of the cluster tree
algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,
we need this radius to be at most ˜ σ; this is what yields the final lower bound on λ.
δ(d/ǫ2)logn, which satisfies the requirements
n
?
1 +ǫ
2
?
.
16