Page 1

Rates of convergence for the cluster tree

Kamalika Chaudhuri

UC San Diego

kchaudhuri@ucsd.edu

Sanjoy Dasgupta

UC San Diego

dasgupta@cs.ucsd.edu

Abstract

For a density f on Rd, a high-density cluster is any connected component of {x :

f(x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy

called the cluster tree of f. We present a procedure for estimating the cluster tree

given samples from f. We give finite-sample convergence rates for our algorithm,

as well as lower bounds on the sample complexity of this estimation problem.

1 Introduction

A central preoccupation of learning theory is to understand what statistical estimation based on a

finite data set reveals about the underlying distribution from which the data were sampled. For

classification problems, there is now a well-developed theory of generalization. For clustering,

however, this kind of analysis has proved more elusive.

Consider for instance k-means, possibly the most popular clustering procedure in use today. If

this procedure is run on points X1,...,Xnfrom distribution f, and is told to find k clusters, what

do these clusters reveal about f? Pollard [8] proved a basic consistency result: if the algorithm

always finds the global minimum of the k-means cost function (which is NP-hard, see Theorem 3

of [3]), then as n → ∞, the clustering is the globally optimal k-means solution for f. This result,

however impressive, leaves the fundamental question unanswered: is the best k-means solution to f

an interesting or desirable quantity, in settings outside of vector quantization?

In this paper, we are interested in clustering procedures whose output on a finite sample converges

to “natural clusters” of the underlying distribution f. There are doubtless many meaningful ways

to define natural clusters. Here we follow some early work on clustering (for instance, [5]) by

associating clusters with high-density connected regions. Specifically, a cluster of density f is any

connected component of {x : f(x) ≥ λ}, for any λ > 0. The collection of all such clusters forms

an (infinite) hierarchy called the cluster tree (Figure 1).

Are there hierarchical clustering algorithms which converge to the cluster tree? Previous theory

work [5, 7] has provided weak consistency results for the single-linkage clustering algorithm, while

other work [13] has suggested ways to overcome the deficiencies of this algorithm by making it

more robust, but without proofs of convergence. In this paper, we propose a novel way to make

single-linkage more robust, while retaining most of its elegance and simplicity (see Figure 3). We

establish its finite-sample rate of convergence (Theorem 6); the centerpiece of our argument is a

result on continuum percolation (Theorem 11). We also give a lower bound on the problem of

cluster tree estimation (Theorem 12), which matches our upper bound in its dependence on most of

the parameters of interest.

2 Definitions and previous work

Let X be a subset of Rd. We exclusively consider Euclidean distance on X, denoted ? · ?. Let

B(x,r) be the closed ball of radius r around x.

1

Page 2

X

C3

C2

C1

λ1

λ2

λ3

f(x)

Figure 1: A probability density f on R, and three of its clusters: C1, C2, and C3.

2.1 The cluster tree

We start with notions of connectivity. A path P in S ⊂ X is a continuous 1 − 1 function P :

[0,1] → S. If x = P(0) and y = P(1), we write x

S. This relation – “connected in S” – is an equivalence relation that partitions S into its connected

components. We say S ⊂ X is connected if it has a single connected component.

The cluster tree is a hierarchy each of whose levels is a partition of a subset of X, which we will

occasionally call a subpartition of X. Write Π(X) = {subpartitions of X}.

Definition 1 For any f : X → R, the cluster tree of f is a function Cf: R → Π(X) given by

Cf(λ) = connected components of {x ∈ X : f(x) ≥ λ}.

Any element of Cf(λ), for any λ, is called a cluster of f.

P

? y, and we say that x and y are connected in

For any λ, Cf(λ) is a set of disjoint clusters of X. They form a hierarchy in the following sense.

Lemma 2 Pick any λ′≤ λ. Then:

1. For any C ∈ Cf(λ), there exists C′∈ Cf(λ′) such that C ⊆ C′.

2. For any C ∈ Cf(λ) and C′∈ Cf(λ′), either C ⊆ C′or C ∩ C′= ∅.

We will sometimes deal with the restriction of the cluster tree to a finite set of points x1,...,xn.

Formally, the restriction of a subpartition C ∈ Π(X) to these points is defined to be C[x1,...,xn] =

{C ∩ {x1,...,xn} : C ∈ C}. Likewise, the restriction of the cluster tree is Cf[x1,...,xn] : R →

Π({x1,...,xn}), where Cf[x1,...,xn](λ) = Cf(λ)[x1,...,xn]. See Figure 2 for an example.

2.2Notion of convergence and previous work

Suppose a sample Xn⊂ X of size n is used to construct a tree Cnthat is an estimate of Cf. Hartigan

[5] provided a very natural notion of consistency for this setting.

Definition 3 For any sets A,A′⊂ X, let An(resp, A′

A ∩ Xn(resp, A′∩ Xn). We say Cnis consistent if, whenever A and A′are different connected

components of {x : f(x) ≥ λ} (for some λ > 0), P(Anis disjoint from A′

It is well known that if Xnis used to build a uniformly consistent density estimate fn(that is,

supx|fn(x) − f(x)| → 0), then the cluster tree Cfnis consistent; see the appendix for details.

The big problem is that Cfnis not easy to compute for typical density estimates fn: imagine, for

instance, how one might go about trying to find level sets of a mixture of Gaussians! Wong and

n) denote the smallest cluster of Cncontaining

n) → 1 as n → ∞.

2

Page 3

X

f(x)

Figure 2: A probability density f, and the restriction of Cfto a finite set of eight points.

Lane [14] have an efficient procedure that tries to approximate Cfnwhen fnis a k-nearest neighbor

density estimate, but they have not shown that it preserves the consistency property of Cfn.

There is a simple and elegant algorithm that is a plausible estimator of the cluster tree: single

linkage (or Kruskal’s algorithm); see the appendix for pseudocode. Hartigan [5] has shown that it is

consistent in one dimension (d = 1). But he also demonstrates, by a lovely reduction to continuum

percolation, that this consistency fails in higher dimension d ≥ 2. The problem is the requirement

that A ∩ Xn⊂ An: by the time the clusters are large enough that one of them contains all of A,

there is a reasonable chance that this cluster will be so big as to also contain part of A′.

With this insight, Hartigan defines a weaker notion of fractional consistency, under which An(resp,

A′

be very close (at distance → 0 as n → ∞) to the remainder. He then shows that single linkage has

this weaker consistency property for any pair A,A′for which the ratio of inf{f(x) : x ∈ A∪A′} to

sup{inf{f(x) : x ∈ P} : paths P from A to A′} is sufficiently large. More recent work by Penrose

[7] closes the gap and shows fractional consistency whenever this ratio is > 1.

n) need not contain all of A∩Xn(resp, A′∩Xn), but merely a sizeable chunk of it – and ought to

A more robust version of single linkage has been proposed by Wishart [13]: when connecting points

at distance r from each other, only consider points that have at least k neighbors within distance r

(for some k > 2). Thus initially, when r is small, only the regions of highest density are available for

linkage, while the rest of the data set is ignored. As r gets larger, more and more of the data points

become candidates for linkage. This scheme is intuitively sensible, but Wishart does not provide a

proof of convergence. Thus it is unclear how to set k, for instance.

Stuetzle and Nugent [12] have an appealing top-down scheme for estimating the cluster tree, along

with a post-processing step (called runt pruning) that helps identify modes of the distribution. The

consistency of this method has not yet been established.

Several recent papers [6, 10, 9, 11] have considered the problem of recovering the connected com-

ponents of {x : f(x) ≥ λ} for a user-specified λ: the flat version of our problem. In particular,

the algorithm of [6] is intuitively similar to ours, though they use a single graph in which each point

is connected to its k nearest neighbors, whereas we have a hierarchy of graphs in which each point

is connected to other points at distance ≤ r (for various r). Interestingly, k-nn graphs are valuable

for flat clustering because they can adapt to clusters of different scales (different average interpoint

distances). But they are challenging to analyze and seem to require various regularity assumptions

on the data. A pleasant feature of the hierarchical setting is that different scales appear at different

levels of the tree, rather than being collapsed together. This allows the use of r-neighbor graphs, and

makes possible an analysis that has minimal assumptions on the data.

3Algorithm and results

In this paper, we consider a generalization of Wishart’s scheme and of single linkage, shown in

Figure 3. It has two free parameters: k and α. For practical reasons, it is of interest to keep these as

3

Page 4

1. For each xiset rk(xi) = inf{r : B(xi,r) contains k data points}.

2. As r grows from 0 to ∞:

(a) Construct a graph Grwith nodes {xi: rk(xi) ≤ r}.

Include edge (xi,xj) if ?xi− xj? ≤ αr.

(b) Let?C(r) be the connected components of Gr.

Figure 3: Algorithm for hierarchical clustering. The input is a sample Xn = {x1,...,xn} from

density f on X. Parameters k and α need to be set. Single linkage is (α = 1,k = 2). Wishart

suggested α = 1 and larger k.

small as possible. We provide finite-sample convergence rates for all 1 ≤ α ≤ 2 and we can achieve

k ∼ dlogn, which we conjecture to be the best possible, if α ≥√2. Our rates for α = 1 force k to

be much larger, exponential in d. It is a fascinating open problem to determine whether the setting

(α = 1,k ∼ dlogn) yields consistency.

3.1A notion of cluster salience

Suppose density f is supported on some subset X of Rd. We will show that the hierarchical cluster-

ing procedure is consistent in the sense of Definition 3. But the more interesting question is, what

clusters will be identified from a finite sample? To answer this, we introduce a notion of salience.

The first consideration is that a cluster is hard to identify if it contains a thin “bridge” that would

make it look disconnected in a small sample. To control this, we consider a “buffer zone” of width

σ around the clusters.

Definition 4 For Z ⊂ Rdand σ > 0, write Zσ= Z + B(0,σ) = {y ∈ Rd: infz∈Z?y − z? ≤ σ}.

An important technical point is that Zσis a full-dimensional set, even if Z itself is not.

Second, the ease of distinguishing two clusters A and A′depends inevitably upon the separation

between them. To keep things simple, we’ll use the same σ as a separation parameter.

Definition 5 Let f be a density on X ⊂ Rd. We say that A,A′⊂ X are (σ,ǫ)-separated if there

exists S ⊂ X (separator set) such that:

• Any path in X from A to A′intersects S.

• supx∈Sσf(x) < (1 − ǫ)infx∈Aσ∪A′

Under this definition, Aσand A′

is zero. However, Sσneed not be contained in X.

σf(x).

σmust lie within X, otherwise the right-hand side of the inequality

3.2Consistency and finite-sample rate of convergence

Here we state the result for α ≥

1 ≤ α ≤ 2 and k ∼ (2/α)ddlogn.

Theorem 6 There is an absolute constant C such that the following holds. Pick any δ,ǫ > 0, and

run the algorithm on a sample Xnof size n drawn from f, with settings

√2 ≤ α ≤ 2 and k = C ·dlogn

Then there is a mapping r : [0,∞) → [0,∞) such that with probability at least 1−δ, the following

holds uniformly for all pairs of connected subsets A,A′⊂ X: If A,A′are (σ,ǫ)-separated (for ǫ

and some σ > 0), and if

√2 and k ∼ dlogn. The analysis section also has results for

ǫ2

· log21

δ.

λ :=inf

x∈Aσ∪A′

σf(x) ≥

1

vd(σ/2)d·k

n·

?

1 +ǫ

2

?

(*)

where vdis the volume of the unit ball in Rd, then:

4

Page 5

1. Separation. A ∩ Xnis disconnected from A′∩ Xnin Gr(λ).

2. Connectedness. A ∩ Xnand A′∩ Xnare each individually connected in Gr(λ).

The two parts of this theorem – separation and connectedness – are proved in Sections 3.3 and 3.4.

We mention in passing that this finite-sample result implies consistency (Definition 3): as n → ∞,

take kn= (dlogn)/ǫ2

Under mild conditions, any two connected components A,A′of {f ≥ λ} are (σ,ǫ)-separated for

some σ,ǫ > 0 (see appendix); thus they will get distinguished for sufficiently large n.

nwith any schedule of (ǫn: n = 1,2,...) such that ǫn→ 0 and kn/n → 0.

3.3 Analysis: separation

The cluster tree algorithm depends heavily on the radii rk(x): the distance within which x’s nearest

k neighbors lie (including x itself). Thus the empirical probability mass of B(x,rk(x)) is k/n. To

show that rk(x) is meaningful, we need to establish that the mass of this ball under density f is also,

very approximately, k/n. The uniform convergence of these empirical counts follows from the fact

that balls in Rdhave finite VC dimension, d + 1. Using uniform Bernstein-type bounds, we get a

set of basic inequalities that we use repeatedly.

Lemma 7 Assume k ≥ dlogn, and fix some δ > 0. Then there exists a constant Cδsuch that with

probability > 1 − δ, every ball B ⊂ Rdsatisfies the following conditions:

f(B) ≥Cδdlogn

n

f(B) ≥k

n

f(B) ≤k

n

Here fn(B) = |Xn∩ B|/n is the empirical mass of B, while f(B) =?

PROOF: See appendix. Cδ= 2Colog(2/δ), where Cois the absolute constant from Lemma 16. ?

=⇒

fn(B) > 0

n+Cδ

n−Cδ

?

?

kdlogn=⇒

fn(B) ≥k

fn(B) <k

n

kdlogn=⇒

n

Bf(x)dx is its true mass.

We will henceforth think of δ as fixed, so that we do not have to repeatedly quantify over it.

Lemma 8 Pick 0 < r < 2σ/(α + 2) such that

vdrdλ

≥

k

n+Cδ

k

n−Cδ

n

?

?

kdlogn

vdrdλ(1 − ǫ)<

n

kdlogn

(recall that vdis the volume of the unit ball in Rd). Then with probability > 1 − δ:

1. Grcontains all points in (Aσ−r∪ A′

2. A ∩ Xnis disconnected from A′∩ Xnin Gr.

σ−r) ∩ Xnand no points in Sσ−r∩ Xn.

PROOF: For (1), any point x ∈ (Aσ−r∪A′

at least k neighbors within radius r. Likewise, any point x ∈ Sσ−rhas f(B(x,r)) < vdrdλ(1−ǫ);

and thus, by Lemma 7, has strictly fewer than k neighbors within distance r.

σ−r) has f(B(x,r)) ≥ vdrdλ; and thus, by Lemma 7, has

For (2), since points in Sσ−rare absent from Gr, any path from A to A′in that graph must have an

edge across Sσ−r. But any such edge has length at least 2(σ − r) > αr and is thus not in Gr. ?

Definition 9 Define r(λ) to be the value of r for which vdrdλ =k

n+Cδ

n

√kdlogn.

To satisfy the conditions of Lemma 8, it suffices to take k ≥ 4C2

δ(d/ǫ2)logn; this is what we use.

5

Page 6

xi

π(xi)

x′

x

x′

x

xi

π(xi)

xi+1

Figure 4: Left: P is a path from x to x′, and π(xi) is the point furthest along the path that is within

distance r of xi. Right: The next point, xi+1∈ Xn, is chosen from the half-ball of B(π(xi),r) in

the direction of xi− π(xi).

3.4 Analysis: connectedness

We need to show that points in A (and similarly A′) are connected in Gr(λ). First we state a simple

bound (proved in the appendix) that works if α = 2 and k ∼ dlogn; later we consider smaller α.

Lemma 10 Suppose 1 ≤ α ≤ 2. Then with probability ≥ 1 − δ, A ∩ Xnis connected in Gr

whenever r ≤ 2σ/(2 + α) and the conditions of Lemma 8 hold, and

?2

vdrdλ ≥

α

?dCδdlogn

n

.

Comparing this to the definition of r(λ), we see that choosing α = 1 would entail k ≥ 2d, which is

undesirable. We can get a more reasonable setting of k ∼ dlogn by choosing α = 2, but we’d like

α to be as small as possible. A more refined argument shows that α ≈√2 is enough.

Theorem 11 Suppose α ≥√2. Then, withprobability > 1−δ, A∩Xnis connected in Grwhenever

r ≤ σ/2 and the conditions of Lemma 8 hold, and

vdrdλ ≥4Cδdlogn

n

.

PROOF: We have already made heavy use of uniform convergence over balls. We now also require

the class G of half-balls, each of which is the intersection of an open ball and a halfspace through

the center of the ball. Formally, each of these functions is defined by a center µ and a unit direction

u, and is the indicator function of the set

{z ∈ Rd: ?z − µ? < r,(z − µ) · u > 0}.

We will describe any such set as “the half of B(µ,r) in direction u”. If the half-ball lies entirely

in Aσ, its probability mass is at least (1/2)vdrdλ. By uniform convergence over G (which has VC

dimension at most 2d + 1), we can then conclude (as in Lemma 7) that if vdrdλ ≥ (4Cδdlogn)/n,

then with probability at least 1 − δ, every such half-ball within Aσcontains at least one data point.

Pick any x,x′∈ A∩Xn; there is a path P in A with x

x0= x,x1,x2,..., ending in x′, such that for every i, point xiis active in Grand ?xi−xi+1? ≤ αr.

This will confirm that x is connected to x′in Gr.

To begin with, recall that P is a continuous 1−1 function from [0,1] into A. We are also interested

in the inverse P−1, which sends a point on the path back to its parametrization in [0,1]. For any

point y ∈ X, define N(y) to be the portion of [0,1] whose image under P lies in B(y,r): that is,

N(y) = {0 ≤ z ≤ 1 : P(z) ∈ B(y,r)}. If y is within distance r of P, then N(y) is nonempty.

Define π(y) = P(supN(y)), the furthest point along the path within distance r of y (Figure 4, left).

P

? x′. We’ll identify a sequence of data points

The sequence x0,x1,x2,... is defined iteratively; x0= x, and for i = 0,1,2,... :

• If ?xi− x′? ≤ αr, set xi+1= x′and stop.

6

Page 7

• By construction, xiis within distance r of path P and hence N(xi) is nonempty.

• Let B be the open ball of radius r around π(xi). The half of B in direction xi−π(xi) must

contain a data point; this is xi+1(Figure 4, right).

The process eventually stops because each π(xi+1) is strictly further along path P than π(xi);

formally, P−1(π(xi+1)) > P−1(π(xi)). This is because ?xi+1− π(xi)? < r, so by continuity of

the function P, there are points further along the path (beyond π(xi)) whose distance to xi+1is still

< r. Thus xi+1is distinct from x0,x1,...,xi. Since there are finitely many data points, the process

must terminate, so the sequence {xi} does constitute a path from x to x′.

Each xilies in Ar ⊆ Aσ−rand is thus active in Gr(Lemma 8). Finally, the distance between

successive points is:

?xi− xi+1?2

=

?xi− π(xi) + π(xi) − xi+1?2

?xi− π(xi)?2+ ?π(xi) − xi+1?2− 2(xi− π(xi)) · (xi+1− π(xi))

2r2

≤ α2r2,

where the second-last inequality is from the definition of half-ball. ?

=

≤

To complete the proof of Theorem 6, take k = 4C2

of Lemma 8 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 9) then

translates into

vdrdλ =k

n

This shows that clusters at density level λ emerge when the growing radius r of the cluster tree

algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,

we need this radius to be at most σ/2; this is what yields the final lower bound on λ.

δ(d/ǫ2)logn, which satisfies the requirements

?

1 +ǫ

2

?

.

4Lower bound

We have shown that the algorithm of Figure 3 distinguishes pairs of clusters that are (σ,ǫ)-separated.

The number of samples it requires to capture clusters at density ≥ λ is, by Theorem 6,

?

We’ll now show that this dependence on σ, λ, and ǫ is optimal. The only room for improvement,

therefore, is in constants involving d.

O

d

vd(σ/2)dλǫ2log

d

vd(σ/2)dλǫ2

?

,

Theorem 12 Pick any ǫ in (0,1/2), any d > 1, and any σ,λ > 0 such that λvd−1σd< 1/50. Then

there exist: an input space X ⊂ Rd; a finite family of densities Θ = {θi} on X; subsets Ai,A′

X such that Aiand A′

the following additional property.

i,Si⊂

iare (σ,ǫ)-separated by Sifor density θi, and infx∈Ai,σ∪A′

i,σθi(x) ≥ λ, with

Consider any algorithm that is given n ≥ 100 i.i.d. samples Xnfrom some θi ∈ Θ and, with

probability at least 1/2, outputs a tree in which the smallest cluster containing Ai∩ Xnis disjoint

from the smallest cluster containing A′

?

i∩ Xn. Then

n = Ω

1

vdσdλǫ2d1/2log

1

vdσdλ

?

.

PROOF: We start by constructing the various spaces and densities. X is made up of two disjoint

regions: a cylinder X0, and an additional region X1whose sole purpose is as a repository for excess

probability mass. Let Bd−1be the unit ball in Rd−1, and let σBd−1be this same ball scaled to have

radius σ. The cylinder X0stretches along the x1-axis; its cross-section is σBd−1and its length is

4(c + 1)σ for some c > 1 to be specified: X0= [0,4(c + 1)σ] × σBd−1. Here is a picture of it:

7

Page 8

0

4(c + 1)σ

4σ

8σ

12σ

σ

x1axis

We will construct a family of densities Θ = {θi} on X, and then argue that any cluster tree algorithm

that is able to distinguish (σ,ǫ)-separated clusters must be able, when given samples from some θI,

to determine the identity of I. The sample complexity of this latter task can be lower-bounded using

Fano’s inequality (typically stated as in [2], but easily rewritten in the convenient form of [15], see

appendix): it is Ω((log|Θ|)/β), for β = maxi?=jK(θi,θj), where K(·,·) is KL divergence.

The family Θ contains c − 1 densities θ1,...,θc−1, where θiis defined as follows:

• Density λ on [0,4σi+σ]×σBd−1and on [4σi+3σ,4(c+1)σ]×σBd−1. Since the cross-

sectional area of the cylinder is vd−1σd−1, the total mass here is λvd−1σd(4(c + 1) − 2).

• Density λ(1 − ǫ) on (4σi + σ,4σi + 3σ) × σBd−1.

• Point masses 1/(2c) at locations 4σ,8σ,...,4cσ along the x1-axis (use arbitrarily narrow

spikes to avoid discontinuities).

• The remaining mass, 1/2−λvd−1σd(4(c+1)−2ǫ), is placed on X1in some fixed manner

(that does not vary between different densities in Θ).

Here is a sketch of θi. The low-density region of width 2σ is centered at 4σi + 2σ on the x1-axis.

point mass 1/2c

density λ(1 − ǫ)

density λ

2σ

For any i ?= j, the densities θiand θjdiffer only on the cylindrical sections (4σi + σ,4σi + 3σ) ×

σBd−1and (4σj+σ,4σj+3σ)×σBd−1, which are disjoint and each have volume 2vd−1σd. Thus

K(θi,θj)=2vd−1σd

?

λlog

λ

λ(1 − ǫ)+ λ(1 − ǫ)logλ(1 − ǫ)

4

ln2vd−1σdλǫ2

λ

?

=2vd−1σdλ(−ǫlog(1 − ǫ)) ≤

(using ln(1 − x) ≥ −2x for 0 < x ≤ 1/2). This is an upper bound on the β in the Fano bound.

Now define the clusters and separators as follows: for each 1 ≤ i ≤ c − 1,

• Aiis the line segment [σ,4σi] along the x1-axis,

• A′

• Si= {4σi + 2σ} × σBd−1is the cross-section of the cylinder at location 4σi + 2σ.

Thus Aiand A′

that Aiand A′

With the various structures defined, what remains is to argue that if an algorithm is given a sample

Xnfrom some θI(where I is unknown), and is able to separate AI∩Xnfrom A′

effectively infer I. This has sample complexity Ω((logc)/β). Details are in the appendix. ?

iis the line segment [4σ(i + 1),4(c + 1)σ − σ] along the x1-axis, and

iare one-dimensional sets while Siis a (d − 1)-dimensional set. It can be checked

iare (σ,ǫ)-separated by Siin density θi.

I∩Xn, then it can

There remains a discrepancy of 2dbetween the upper and lower bounds; it is an interesting open

problem to close this gap. Does the (α = 1,k ∼ dlogn) setting (yet to be analyzed) do the job?

Acknowledgments. We thank the anonymous reviewers for their detailed and insightful comments,

and the National Science Foundation for support under grant IIS-0347646.

8

Page 9

References

[1] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture

Notes in Artificial Intelligence, 3176:169–207, 2004.

[2] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 2005.

[3] S. Dasgupta and Y. Freund. Random projection trees for vector quantization. IEEE Transac-

tions on Information Theory, 55(7):3229–3242, 2009.

[4] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Jour-

nal of Machine Learning Research, 10:281–299, 2009.

[5] J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American

Statistical Association, 76(374):388–394, 1981.

[6] M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs

for identifying noisy clusters. Theoretical Computer Science, 410:1749–1764, 2009.

[7] M. Penrose. Single linkage clustering and continuum percolation. Journal of Multivariate

Analysis, 53:94–109, 1995.

[8] D. Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135–140, 1981.

[9] P. Rigollet and R. Vert. Fast rates for plug-in estimators of density level sets. Bernoulli,

15(4):1154–1178, 2009.

[10] A. Rinaldo and L. Wasserman.Generalized density clustering.

38(5):2678–2722, 2010.

[11] A. Singh, C. Scott, and R. Nowak. Adaptive hausdorff estimation of density level sets. Annals

of Statistics, 37(5B):2760–2782, 2009.

[12] W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree

of a density. Journal of Computational and Graphical Statistics, 19(2):397–418, 2010.

[13] D. Wishart. Mode analysis: a generalization of nearest neighbor which reduces chaining ef-

fects. In Proceedings of the Colloquium on Numerical Taxonomy held in the University of St.

Andrews, pages 282–308, 1969.

[14] M.A. Wong and T. Lane. A kth nearest neighbour clustering procedure. Journal of the Royal

Statistical Society Series B, 45(3):362–368, 1983.

[15] B. Yu. Assouad, Fano and Le Cam. Festschrift for Lucien Le Cam, pages 423–435, 1997.

Annals of Statistics,

9

Page 10

5Appendix: using a uniformly consistent density estimate

One way to build a cluster tree is to return Cfn, where fnis a uniformly consistent density estimate.

Lemma 13 Suppose estimator fnof density f (on space X) satisfies

sup

x∈X|fn(x) − f(x)| ≤ ǫn.

Pick any two disjoint sets A,A′⊂ X and define

α= inf

x∈A∪A′f(x)

sup

AP

?A′

β=

inf

x∈Pf(x)

If α − β > 2ǫnthen A,A′lie entirely in disjoint connected components of Cfn(α − ǫn).

PROOF: A and A′are each connected in Cfn(α − ǫn). But there is no path from A to A′in Cfn(λ)

for λ > β + ǫn. ?

The problem, however, is that computing the level sets of fnis usually not an easy task. Hence we

adopt a different approach in this paper.

6 Appendix: single linkage

This procedure for building a hierarchical clustering takes as input a data set x1,...,xn∈ Rd.

1. For each data point xi, set r2(xi) = distance from xito its nearest neighbor.

2. As r grows from 0 to ∞:

(a) Construct a graph Grwith nodes {xi: r2(xi) ≤ r}.

Include edge (xi,xj) if ?xi− xj? ≤ r.

(b) Let?C(r) be the connected components of Gr.

7 Appendix: consistency

The following is a straightforward exercise in analysis.

Lemma 14 Suppose density f : Rd→ R is continuous and is zero outside a compact subset

X ⊂ Rd. Suppose further that for some λ, {x ∈ X : f(x) ≥ λ} has finitely many connected

components, among them A and A′. Then there exist σ,ǫ > 0 such that A and A′are (σ,ǫ)-

separated.

PROOF: Let A1,A2,...,Akbe the connected components of {f ≥ λ}, with A = A1and A′= A2.

First, each Aiis closed and thus compact. To see this, pick any x ∈ X \ Ai. There must be some x′

on the shortest path from x to Aiwith f(x′) < λ (otherwise x ∈ Ai). By continuity of f, there is

some ball B(x′,r) on which f < λ; thus this ball doesn’t touch Ai. Then B(x,r) doesn’t touch Ai.

Next, for any i ?= j, define ∆ij = infx∈Ai,y∈Aj?x − y? to be the distance between Aiand Aj.

We’ll see that ∆ij> 0. Specifically, define g : Ai× Aj→ R by g(a,a′) = ?a − a′?. Since g has

compact domain, it attains its infimum for some a ∈ Ai,a′∈ Aj. Thus ∆ij= ?a − a′? > 0.

Let ∆ = mini?=j∆ij> 0, and define S to be the set of points at distance exactly ∆/2 from A:

S = {x ∈ X : inf

y∈A?x − y? = ∆/2}.

S separates A from A′. Moreover, it is closed by continuity of ? · ?, and hence is compact. Define

λo= supx∈Sf(x). Since S is compact, f (restricted to S) is maximized at some xo∈ S. Then

λo= f(xo) < λ.

10

Page 11

To finish up, set δ = (λ − λo)/3 > 0. By uniform continuity of f, there is some σ > 0 such that f

doesn’t change by more than δ on balls of radius σ. Then f(x) ≤ λo+ δ = λ − 2δ for x ∈ Sσand

f(x) ≥ λ − δ for x ∈ Aσ∪ A′

Thus S is a (σ,δ/(λ − δ))-separator for A,A′. ?

σ.

8 Appendix: proof of Lemma 7

We start with a standard generalization result due to Vapnik and Chervonenkis; the following version

is a paraphrase of Theorem 5.1 of [1].

Theorem 15 Let G be a class of functions from X to {0,1} with VC dimension d < ∞, and P a

probability distribution on X. Let E denote expectation with respect to P. Suppose n points are

drawn independently at random from P; let Endenote expectation with respect to this sample. Then

for any δ > 0, with probability at least 1 − δ, the following holds for all g ∈ G:

−min(βn

where βn=?(4/n)(dln2n + ln(8/δ)).

By applying this bound to the class G of indicator functions over balls, we get the following:

Lemma 16 Suppose Xnis a sample of n points drawn independently at random from a distribution

f over X. For any set Y ⊂ X, let fn(Y ) = |Xn∩Y |/n. There is a universal constant Co> 0 such

that for any δ > 0, with probability at least 1 − δ, for any ball B ⊂ Rd,

f(B) ≥Co

n

?

?

?

Eng,β2

n+ βn

?

Eg) ≤ Eg − Eng ≤ min(β2

n+ βn

?

Eng,βn

?

Eg),

?

dlogn + log1

dlogn + log1

δ

??

??

?

=⇒

fn(B) > 0

f(B) ≥k

n+Co

n

dlogn + log1

δ+

?

?

k

?

?

δ

=⇒

fn(B) ≥k

n

f(B) <k

n−Co

n

dlogn + log1

δ+

kdlogn + log1

δ

=⇒

fn(B) <k

n

PROOF: The bound f(B) − fn(B) ≤ βn

?f(B) from Theorem 15 yields

f(B) > β2

n

=⇒

n+ βn

?

n

fn(B) > 0.

?fn(B). It follows that

=⇒

?f(B)) to get

=⇒

For the second bound, we use f(B) − fn(B) ≤ β2

f(B) ≥k

n+ β2

n+ βn

k

fn(B) ≥k

n.

For the last bound, we rearrange f(B) − fn(B) ≥ −(β2

f(B) <k

n+ βn

n− β2

n− βn

?

k

n

fn(B) <k

n.

?

Lemma 7 now follows immediately, by taking k ≥ dlogn. Since the uniform convergence bounds

have error bars of magnitude (dlogn)/n, it doesn’t make sense to take k any smaller than this.

9Appendix: proof of Lemma 10

Consider any x,x′∈ A ∩ Xn. Since A is connected, there is a path P in A with x

0 < γ < 1. Because the density of A is lower bounded away from zero, it follows by a volume

P

? x′. Fix any

11

Page 12

and packing-covering argument that A, and thus P, can be covered by a finite number of balls of

diameter γr. Thus we can choose finitely many points z1,z2,...,zk∈ P such that x = z0, x′= zk

and

?zi+1− zi? ≤ γr.

By Lemma 7, any ball centered in A with radius (α − γ)r/2 contains at least one data point if

?(α − γ)r

Assume for the moment that this holds. Then, every ball B(zi,(α − γ)r/2) contains at least one

point; call it xi.

By the upper bound on r, each such xilies in Aσ−r; therefore, by Lemma 8, the xiare all active in

Gr. Moreover, consecutive points xiare close together:

vd

2

?d

λ ≥

Cδdlogn

n

.

(1)

?xi+1− xi? ≤ ?xi+1− zi+1? + ?zi+1− zi? + ?zi− xi? ≤ αr.

Therefore, all edges (xi,xi+1) exist in Gr, whereby x is connected to x′in Gr.

All this assumes that equation (1) holds for some γ > 0. Taking γ → 0 gives the lemma.

10Appendix: Fano’s inequality

Considerthefollowinggameplayedwithapredefined, finiteclassofdistributionsΘ = {θ1,...,θℓ},

defined on a common space X:

• Nature picks I ∈ {1,2,...,ℓ}.

• Player is given n i.i.d. samples X1,...,Xnfrom θI.

• Player then guesses the identity of I.

Fano’s inequality [2, 15] gives a lower bound on the number of samples n needed to achieve a certain

success probability. It depends on how similar the distributions θiare: the more similar, the more

samples are needed. Define

ℓ

?

where K(·) is KL divergence. Then n needs to be Ω((logℓ)/β). Here’s the formal statement.

Theorem 17 (Fano) Let g : Xn→ {1,2,...,ℓ} denote Player’s computation. If Nature chooses I

uniformly at random from {1,2,...,ℓ}, then for any 0 < δ < 1,

n ≤(1 − δ)(logℓ) − 1

β

β =1

ℓ2

i,j=1

K(θi,θj)

=⇒

Pr(g(X1,...,Xn) ?= I) ≥ δ,

where the logarithm is base two.

11Appendix: proof details for Theorem 12

Once the various structures are defined, the remainder of the proof is broken into two phases. The

first will establish that if n is at least a small constant (say, 100), then it must be the case that

n = Ω(1/(vdσdλǫ2d1/2)). The second part of the proof will then extend this to show that if n is at

least this latter quantity, then in fact it must be even larger, the lower bound of the theorem statement.

To start with, choose c to be a small constant, such as 5. Then, even a small sample Xnis likely

to contain all of the c point masses on the x1-axis (each of which has mass 1/2c). Suppose the

algorithm is promised in advance that the underlying density is one of the c − 1 choices θI, and is

subsequently able (with probability at least 1/2) to separate AIfrom A′

all the point masses within AI, and all the point masses within A′

I. To do this, it must connect

I, and yet keep these two groups

12

Page 13

apart. In short, this algorithm must be able to determine (with probability at least 1/2) the segment

(4σI + σ,4σI + 3σ) of lower density, and hence the identity of I.

We can thus apply Fano’s inequality to conclude that we need

?logc

The last equality comes from the formula vd= πd/2/Γ((d/2)+1), whereupon vd−1= O(vdd1/2).

Now consider a larger value of c:

?

and apply the same construction. We have already established that we need n = Ω(c/ǫ2) samples,

so assume n is at least this large. Then, it is very likely that when the underlying density is θI, the

sample Xnwill contain the four point masses at 4σ, 4σI, 4σ(I + 1), and 4(c + 1)σ. Therefore, the

clustering algorithm must connect the point at 4σ to that at 4σI and the point at 4σ(I +1) to that at

4(c+1)σ, while keeping the two groups apart. Therefore, this algorithm can determine I. Applying

Fano’s inequality gives n = Ω((logc)/β), which is the bound in the theorem statement.

n = Ω

β

?

= Ω

?

1

vd−1σdλǫ2

?

= Ω

?

1

vdσdλǫ2d1/2

?

c =

1

8vd−1σdλ− 1

?

,

12 Appendix: better convergence rates in some instances

Our convergence rates (Theorem 6) contain a condition (*) for clusters of density ≥ λ that are

(σ,ǫ)-separated:

1

vd(σ/2)d·k

λ ≥

n·

?

1 +ǫ

2

?

.

Paraphrased, this means that in order for such clusters to be distinguished, it is sufficient that the

number of data points be at least

n ≥

d2d

vdσdλǫ2,

ignoring logarithmic factors. There is a 2dfactor here that does not appear in the lower bound

(Theorem 12). Can this term be removed?

We now show that this term can be improved in two particular cases.

• When the separation ǫ is not too small, in particular when ǫ > (9/10)d, the 2dterm can be

improved to (1 + 1/√2)d, which is roughly 1.7d.

• If the density f is Lipschitz with parameter ℓ, that is, if

|f(x) − f(x′)| ≤ ℓ?x − x′? for all x,x′∈ X,

and if ǫ ≥ 3ℓσ/λ, then the 2dterm can be removed altogether.

12.1Better rates when the separation is not too small

Theorem 18 Theorem 6 holds if ǫ >

?

7+4√2

16

?−d/2

vd(2σ/(α + 2))d·k

, and if the condition (*) is replaced by:

λ :=inf

x∈Aσ∪A′

σf(x) ≥

1

n·

?

1 +ǫ

2

?

.

The lower bound on ǫ is at most (9/10)d.

13

Page 14

v(q,r,σ)

h

θ

r

H

0

y

σ

Figure 5: Balls B(0,σ) and B(y,r), where ?y? = q.

Analysis overview

The main idea is to bound the volume B(x,r) \ Aσfor a point x that lies in Aσbut not in Aσ−r. If

this volume is small, then most of B(x,r) is inside Aσ, and thus has density λ or more. To establish

this bound, we begin with some notation.

Definition 19 For r,q ≤ σ, define v(q,r,σ) as the volume of the region B(y,r) \ B(0,σ), for any

y such that ||y|| = q. (By symmetry this volume is the same for all such y.) For an illustration, see

Figure 12.1.

Lemma 20 Let y ∈ Aqand r,q ≤ σ. Then vol(B(y,r) \ Aσ) ≤ v(q,r,σ).

PROOF: As y ∈ Aq, there exists some y′∈ A such that ||y − y′|| ≤ q. Let q′= ||y − y′||. As Aσ

contains B(y′,σ),

v(q′,r,σ) = vol(B(y,r) \ B(y′,σ)) ≥ vol(B(y,r) \ Aσ)

The observation that for q′≤ q, v(q′,r,σ) ≤ v(q,r,σ) concludes the lemma. ?

Lemma 21 Let r ≤ q = σ/(1 + 1/√2). Then,

v(q,r,σ) ≤

1

2

?

7 + 4√2

16

?−d/2

vdrd

PROOF: WLOG, suppose that y lies along the first coordinate axis. If r ≤ σ − q, the sphere

B(0,σ) ⊇ B(y,r), and thus v(q,r,σ) = 0. Thus, forthe restof the proof, we assume that r > σ−q.

Let H be the (d−1)-dimensional hyperplane that contains the intersection of the spherical surfaces

of B(0,σ) and B(y,r) – see Figure 12.1. By spherical symmetry, H is orthogonal to the first

coordinate axis. Let h be the distance between y and H, and let θ = arccos(h/r).

Any x ∈ B(y,r) \ B(0,σ) also lies to the right of the hyperplane H in Figure 12.1. v(q,r,σ) is

thus at most the volume of a spherical cap of B(y,r) that subtends an angle θ at the center y. If

0 < θ < π/2, thatis, ifthecenter y liestotheleftofH, thenwecanupper-bound thisvolume bythat

of the smallest enclosing hemisphere. A simple calculation shows that the latter is (1/2)vdrdsindθ.

14

Page 15

We first calculate sinθ for r = q. Let x1be the first coordinate of any point on the intersection of

the spherical surfaces of B(0,σ) and B(y,r). Then, x2

values of q and r in terms of σ, x1= σ2/2q =(1+√2)σ

√

4

, from which v(q,r,σ) ≤1

For smaller r, observe that for a fixed q, the distance h increases with decreasing r, along with

decreasing θ. As the volume of the spherical cap also decreases with decreasing θ, the lemma

follows. ?

1− (x1− q)2= σ2− r2. Plugging in the

, which we verify is to the left of H.

2√2

Simple algebra now shows that sinθ =

7+4√2

2

?

7+4√2

16

?−d/2

vdrd.

Analysis: separation

Lemma 22 Let α ≥√2, and let q = σ/(1 + 1/√2). Pick 0 < r < 2σ/(α + 2) such that

(vdrd− v(q,r,σ))λ

≥

k

n+Cδ

k

n−Cδ

n

?

?

kdlogn

vdrdλ(1 − ǫ)<

n

kdlogn

Then with probability > 1 − δ:

1. Grcontains all points in (Aq∪ A′

2. A ∩ Xnis disconnected from A′∩ Xnin Gr.

q) ∩ Xnand no points in Sσ−r∩ Xn.

PROOF: Notice first of all that r ≤ q. From Lemma 20, for any point x ∈ (Aq∪A′

most the volume of B(x,r) that lies outside Aσ∪A′

and thus, by Lemma 7, x has at least k neighbors within radius r. Likewise, any point x ∈ Sσ−r

has f(B(x,r)) < vdrdλ(1 − ǫ); and thus, by Lemma 7, has strictly fewer than k neighbors within

distance r. This establishes (1).

q), v(q,r,σ) is at

σ; therefore, f(B(x,r)) ≥ (vdrd−v(q,r,σ))λ,

For (2), since points in Sσ−rare absent from Gr, any path from A to A′in that graph must have an

edge across Sσ−r. But any such edge has length at least 2(σ − r) > αr and is thus not in Gr. ?

Definition 23 Define r(λ) to be the value of r for which (vdrd−v(q,r,σ))λ =k

n+Cδ

n

√kdlogn.

To satisfy the conditions of Lemma 22, recall that if ǫ >

16C2

?

7+4√2

16

?−d/2

, it suffices to take k ≥

δ(d/ǫ2)logn; this is what we use.

Analysis: connectedness

To show that points in A (and similarly A′) are connected in Gr(λ), we observe that as all x ∈ Aq∪

A′

Since α ≥√2, this condition holds for any r ≤ 2σ/(α + 2).

To complete the proof of Theorem 18, take k = 16C2

of Lemma 22 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 23)

then translates into

(vdrd− v(q,r,σ))λ =k

qare active, the arguments of Theorem 11 follow exactly as before, provided r ≤ σ/(1 + 1/√2).

δ(d/ǫ2)logn, which satisfies the requirements

n

?

1 +ǫ

2

?

.

This shows that clusters at density level λ emerge when the growing radius r of the cluster tree

algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,

we need this radius to be at most 2σ/(2 + α); this is what yields the final lower bound on λ.

15

Page 16

12.2 Better rates when the density is Lipschitz

In this section, we establish an even sharper upper bound on the rate of convergence, provided the

density f is smooth. In particular, we assume that f is Lipschitz, with a Lipschitz constant ℓ.

Theorem 24 Theorem 6 holds if the density f has Lipschitz constant ℓ and if the condition (*) is

replaced by:

λ :=inf

x∈Aσ∪A′

σf(x) ≥

1

vd˜ σd·k

n·

?

1 +ǫ

2

?

where ˜ σ = min(σ,λǫ

3ℓ).

As usual, for the analysis we first treat separation, then connectedness.

Lemma 25 Pick 0 < r ≤ ˜ σ, α < 2, such that

vdrd(λ − ˜ σℓ)

≥

k

n+Cδ

k

n−Cδ

n

?

?

kdlogn

vdrd(λ(1 − ǫ) + ˜ σℓ)<

n

kdlogn

(recall that vdis the volume of the unit ball in Rd). Then with probability > 1 − δ:

1. Grcontains all points in (Aσ∪ A′

2. A ∩ Xnis disconnected from A′∩ Xnin Gr.

σ) ∩ Xnand no points in Sσ∩ Xn.

PROOF: For (1), if r ≤ ˜ σ, for any point x ∈ (Aσ∪ A′

λ − ˜ σℓ. Therefore, f(B(x,r)) ≥ vdrd(λ − ˜ σℓ); and thus, by Lemma 7, has at least k neighbors

within radius r. Likewise, for r ≤ ˜ σ, any point x ∈ Sσhas f(B(x,r)) < vdrd(λ(1−ǫ)+ ˜ σℓ); and

thus, by Lemma 7, has strictly fewer than k neighbors within distance r.

σ), the density in any y ∈ B(x,r) is at least

For (2), since points in Sσare absent from Gr, any path from A to A′in that graph must have an

edge across Sσ. But any such edge has length at least 2σ > αr (as r ≤ ˜ σ ≤ σ, and α ≤ 2) and is

thus not in Gr. ?

Definition 26 Define r(λ) to be the value of r for which vdrd(λ − ˜ σℓ) =k

As ˜ σ ≤ λǫ/3ℓ, to satisfy the conditions of Lemma 25, it suffices to take k ≥ 36C2

is what we use.

n+Cδ

n

√kdlogn.

δ(d/ǫ2)logn; this

We now need to show that points in A (and similarly A′) are connected in Gr(λ). To show this, note

that under the conditions on ℓ the proof of Theorem 11 applies for r = ˜ σ, and α =√2.

To complete the proof of Theorem 24, take k = 36C2

of Lemma 25 as well as those of Theorem 11. The relationship that defines r(λ) (Definition 26)

then translates into

vdrd(λ − ˜ σℓ) =k

This shows that clusters at density level λ emerge when the growing radius r of the cluster tree

algorithm reaches roughly (k/(λvdn))1/d. In order for (σ,ǫ)-separated clusters to be distinguished,

we need this radius to be at most ˜ σ; this is what yields the final lower bound on λ.

δ(d/ǫ2)logn, which satisfies the requirements

n

?

1 +ǫ

2

?

.

16