Page 1

Rates of convergence for the cluster tree

Kamalika Chaudhuri

UC San Diego

kchaudhuri@ucsd.edu

Sanjoy Dasgupta

UC San Diego

dasgupta@cs.ucsd.edu

Abstract

For a density f on Rd, a high-density cluster is any connected component of {x :

f(x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy

called the cluster tree of f. We present a procedure for estimating the cluster tree

given samples from f. We give finite-sample convergence rates for our algorithm,

as well as lower bounds on the sample complexity of this estimation problem.

1Introduction

A central preoccupation of learning theory is to understand what statistical estimation based on a

finite data set reveals about the underlying distribution from which the data were sampled. For

classification problems, there is now a well-developed theory of generalization. For clustering,

however, this kind of analysis has proved more elusive.

Consider for instance k-means, possibly the most popular clustering procedure in use today. If

this procedure is run on points X1,...,Xnfrom distribution f, and is told to find k clusters, what

do these clusters reveal about f? Pollard [8] proved a basic consistency result: if the algorithm

always finds the global minimum of the k-means cost function (which is NP-hard, see Theorem 3

of [3]), then as n → ∞, the clustering is the globally optimal k-means solution for f. This result,

however impressive, leaves the fundamental question unanswered: is the best k-means solution to f

an interesting or desirable quantity, in settings outside of vector quantization?

In this paper, we are interested in clustering procedures whose output on a finite sample converges

to “natural clusters” of the underlying distribution f. There are doubtless many meaningful ways

to define natural clusters. Here we follow some early work on clustering (for instance, [5]) by

associating clusters with high-density connected regions. Specifically, a cluster of density f is any

connected component of {x : f(x) ≥ λ}, for any λ > 0. The collection of all such clusters forms

an (infinite) hierarchy called the cluster tree (Figure 1).

Are there hierarchical clustering algorithms which converge to the cluster tree? Previous theory

work [5, 7] has provided weak consistency results for the single-linkage clustering algorithm, while

other work [13] has suggested ways to overcome the deficiencies of this algorithm by making it

more robust, but without proofs of convergence. In this paper, we propose a novel way to make

single-linkage more robust, while retaining most of its elegance and simplicity (see Figure 3). We

establish its finite-sample rate of convergence (Theorem 6); the centerpiece of our argument is a

result on continuum percolation (Theorem 11). We also give a lower bound on the problem of

cluster tree estimation (Theorem 12), which matches our upper bound in its dependence on most of

the parameters of interest.

2Definitions and previous work

Let X be a subset of Rd. We exclusively consider Euclidean distance on X, denoted ? · ?. Let

B(x,r) be the closed ball of radius r around x.

1

Page 2

X

C3

C2

C1

λ1

λ2

λ3

f(x)

Figure 1: A probability density f on R, and three of its clusters: C1, C2, and C3.

2.1 The cluster tree

We start with notions of connectivity. A path P in S ⊂ X is a continuous 1 − 1 function P :

[0,1] → S. If x = P(0) and y = P(1), we write x

S. This relation – “connected in S” – is an equivalence relation that partitions S into its connected

components. We say S ⊂ X is connected if it has a single connected component.

The cluster tree is a hierarchy each of whose levels is a partition of a subset of X, which we will

occasionally call a subpartition of X. Write Π(X) = {subpartitions of X}.

Definition 1 For any f : X → R, the cluster tree of f is a function Cf: R → Π(X) given by

Cf(λ) = connected components of {x ∈ X : f(x) ≥ λ}.

Any element of Cf(λ), for any λ, is called a cluster of f.

P

? y, and we say that x and y are connected in

For any λ, Cf(λ) is a set of disjoint clusters of X. They form a hierarchy in the following sense.

Lemma 2 Pick any λ′≤ λ. Then:

1. For any C ∈ Cf(λ), there exists C′∈ Cf(λ′) such that C ⊆ C′.

2. For any C ∈ Cf(λ) and C′∈ Cf(λ′), either C ⊆ C′or C ∩ C′= ∅.

We will sometimes deal with the restriction of the cluster tree to a finite set of points x1,...,xn.

Formally, the restriction of a subpartition C ∈ Π(X) to these points is defined to be C[x1,...,xn] =

{C ∩ {x1,...,xn} : C ∈ C}. Likewise, the restriction of the cluster tree is Cf[x1,...,xn] : R →

Π({x1,...,xn}), where Cf[x1,...,xn](λ) = Cf(λ)[x1,...,xn]. See Figure 2 for an example.

2.2Notion of convergence and previous work

Suppose a sample Xn⊂ X of size n is used to construct a tree Cnthat is an estimate of Cf. Hartigan

[5] provided a very natural notion of consistency for this setting.

Definition 3 For any sets A,A′⊂ X, let An(resp, A′

A ∩ Xn(resp, A′∩ Xn). We say Cnis consistent if, whenever A and A′are different connected

components of {x : f(x) ≥ λ} (for some λ > 0), P(Anis disjoint from A′

It is well known that if Xnis used to build a uniformly consistent density estimate fn(that is,

supx|fn(x) − f(x)| → 0), then the cluster tree Cfnis consistent; see the appendix for details.

The big problem is that Cfnis not easy to compute for typical density estimates fn: imagine, for

instance, how one might go about trying to find level sets of a mixture of Gaussians! Wong and

n) denote the smallest cluster of Cncontaining

n) → 1 as n → ∞.

2

Page 3

X

f(x)

Figure 2: A probability density f, and the restriction of Cfto a finite set of eight points.

Lane [14] have an efficient procedure that tries to approximate Cfnwhen fnis a k-nearest neighbor

density estimate, but they have not shown that it preserves the consistency property of Cfn.

There is a simple and elegant algorithm that is a plausible estimator of the cluster tree: single

linkage (or Kruskal’s algorithm); see the appendix for pseudocode. Hartigan [5] has shown that it is

consistent in one dimension (d = 1). But he also demonstrates, by a lovely reduction to continuum

percolation, that this consistency fails in higher dimension d ≥ 2. The problem is the requirement

that A ∩ Xn⊂ An: by the time the clusters are large enough that one of them contains all of A,

there is a reasonable chance that this cluster will be so big as to also contain part of A′.

With this insight, Hartigan defines a weaker notion of fractional consistency, under which An(resp,

A′

be very close (at distance → 0 as n → ∞) to the remainder. He then shows that single linkage has

this weaker consistency property for any pair A,A′for which the ratio of inf{f(x) : x ∈ A∪A′} to

sup{inf{f(x) : x ∈ P} : paths P from A to A′} is sufficiently large. More recent work by Penrose

[7] closes the gap and shows fractional consistency whenever this ratio is > 1.

n) need not contain all of A∩Xn(resp, A′∩Xn), but merely a sizeable chunk of it – and ought to

A more robust version of single linkage has been proposed by Wishart [13]: when connecting points

at distance r from each other, only consider points that have at least k neighbors within distance r

(for some k > 2). Thus initially, when r is small, only the regions of highest density are available for

linkage, while the rest of the data set is ignored. As r gets larger, more and more of the data points

become candidates for linkage. This scheme is intuitively sensible, but Wishart does not provide a

proof of convergence. Thus it is unclear how to set k, for instance.

Stuetzle and Nugent [12] have an appealing top-down scheme for estimating the cluster tree, along

with a post-processing step (called runt pruning) that helps identify modes of the distribution. The

consistency of this method has not yet been established.

Several recent papers [6, 10, 9, 11] have considered the problem of recovering the connected com-

ponents of {x : f(x) ≥ λ} for a user-specified λ: the flat version of our problem. In particular,

the algorithm of [6] is intuitively similar to ours, though they use a single graph in which each point

is connected to its k nearest neighbors, whereas we have a hierarchy of graphs in which each point

is connected to other points at distance ≤ r (for various r). Interestingly, k-nn graphs are valuable

for flat clustering because they can adapt to clusters of different scales (different average interpoint

distances). But they are challenging to analyze and seem to require various regularity assumptions

on the data. A pleasant feature of the hierarchical setting is that different scales appear at different

levels of the tree, rather than being collapsed together. This allows the use of r-neighbor graphs, and

makes possible an analysis that has minimal assumptions on the data.

3Algorithm and results

In this paper, we consider a generalization of Wishart’s scheme and of single linkage, shown in

Figure 3. It has two free parameters: k and α. For practical reasons, it is of interest to keep these as

3

Page 4

1. For each xiset rk(xi) = inf{r : B(xi,r) contains k data points}.

2. As r grows from 0 to ∞:

(a) Construct a graph Grwith nodes {xi: rk(xi) ≤ r}.

Include edge (xi,xj) if ?xi− xj? ≤ αr.

(b) Let?C(r) be the connected components of Gr.

Figure 3: Algorithm for hierarchical clustering. The input is a sample Xn = {x1,...,xn} from

density f on X. Parameters k and α need to be set. Single linkage is (α = 1,k = 2). Wishart

suggested α = 1 and larger k.

small as possible. We provide finite-sample convergence rates for all 1 ≤ α ≤ 2 and we can achieve

k ∼ dlogn, which we conjecture to be the best possible, if α ≥√2. Our rates for α = 1 force k to

be much larger, exponential in d. It is a fascinating open problem to determine whether the setting

(α = 1,k ∼ dlogn) yields consistency.

3.1 A notion of cluster salience

Suppose density f is supported on some subset X of Rd. We will show that the hierarchical cluster-

ing procedure is consistent in the sense of Definition 3. But the more interesting question is, what

clusters will be identified from a finite sample? To answer this, we introduce a notion of salience.

The first consideration is that a cluster is hard to identify if it contains a thin “bridge” that would

make it look disconnected in a small sample. To control this, we consider a “buffer zone” of width

σ around the clusters.

Definition 4 For Z ⊂ Rdand σ > 0, write Zσ= Z + B(0,σ) = {y ∈ Rd: infz∈Z?y − z? ≤ σ}.

An important technical point is that Zσis a full-dimensional set, even if Z itself is not.

Second, the ease of distinguishing two clusters A and A′depends inevitably upon the separation

between them. To keep things simple, we’ll use the same σ as a separation parameter.

Definition 5 Let f be a density on X ⊂ Rd. We say that A,A′⊂ X are (σ,ǫ)-separated if there

exists S ⊂ X (separator set) such that:

• Any path in X from A to A′intersects S.

• supx∈Sσf(x) < (1 − ǫ)infx∈Aσ∪A′

Under this definition, Aσand A′

is zero. However, Sσneed not be contained in X.

σf(x).

σmust lie within X, otherwise the right-hand side of the inequality

3.2Consistency and finite-sample rate of convergence

Here we state the result for α ≥

1 ≤ α ≤ 2 and k ∼ (2/α)ddlogn.

Theorem 6 There is an absolute constant C such that the following holds. Pick any δ,ǫ > 0, and

run the algorithm on a sample Xnof size n drawn from f, with settings

√2 ≤ α ≤ 2 and k = C ·dlogn

Then there is a mapping r : [0,∞) → [0,∞) such that with probability at least 1−δ, the following

holds uniformly for all pairs of connected subsets A,A′⊂ X: If A,A′are (σ,ǫ)-separated (for ǫ

and some σ > 0), and if

√2 and k ∼ dlogn. The analysis section also has results for

ǫ2

· log21

δ.

λ :=inf

x∈Aσ∪A′

σf(x) ≥

1

vd(σ/2)d·k

n·

?

1 +ǫ

2

?

(*)

where vdis the volume of the unit ball in Rd, then:

4

Page 5

1. Separation. A ∩ Xnis disconnected from A′∩ Xnin Gr(λ).

2. Connectedness. A ∩ Xnand A′∩ Xnare each individually connected in Gr(λ).

The two parts of this theorem – separation and connectedness – are proved in Sections 3.3 and 3.4.

We mention in passing that this finite-sample result implies consistency (Definition 3): as n → ∞,

take kn= (dlogn)/ǫ2

Under mild conditions, any two connected components A,A′of {f ≥ λ} are (σ,ǫ)-separated for

some σ,ǫ > 0 (see appendix); thus they will get distinguished for sufficiently large n.

nwith any schedule of (ǫn: n = 1,2,...) such that ǫn→ 0 and kn/n → 0.

3.3 Analysis: separation

The cluster tree algorithm depends heavily on the radii rk(x): the distance within which x’s nearest

k neighbors lie (including x itself). Thus the empirical probability mass of B(x,rk(x)) is k/n. To

show that rk(x) is meaningful, we need to establish that the mass of this ball under density f is also,

very approximately, k/n. The uniform convergence of these empirical counts follows from the fact

that balls in Rdhave finite VC dimension, d + 1. Using uniform Bernstein-type bounds, we get a

set of basic inequalities that we use repeatedly.

Lemma 7 Assume k ≥ dlogn, and fix some δ > 0. Then there exists a constant Cδsuch that with

probability > 1 − δ, every ball B ⊂ Rdsatisfies the following conditions:

f(B) ≥Cδdlogn

n

f(B) ≥k

n

f(B) ≤k

n

Here fn(B) = |Xn∩ B|/n is the empirical mass of B, while f(B) =?

PROOF: See appendix. Cδ= 2Colog(2/δ), where Cois the absolute constant from Lemma 16. ?

=⇒

fn(B) > 0

n+Cδ

n−Cδ

?

?

kdlogn=⇒

fn(B) ≥k

fn(B) <k

n

kdlogn=⇒

n

Bf(x)dx is its true mass.

We will henceforth think of δ as fixed, so that we do not have to repeatedly quantify over it.

Lemma 8 Pick 0 < r < 2σ/(α + 2) such that

vdrdλ

≥

k

n+Cδ

k

n−Cδ

n

?

?

kdlogn

vdrdλ(1 − ǫ)<

n

kdlogn

(recall that vdis the volume of the unit ball in Rd). Then with probability > 1 − δ:

1. Grcontains all points in (Aσ−r∪ A′

2. A ∩ Xnis disconnected from A′∩ Xnin Gr.

σ−r) ∩ Xnand no points in Sσ−r∩ Xn.

PROOF: For (1), any point x ∈ (Aσ−r∪A′

at least k neighbors within radius r. Likewise, any point x ∈ Sσ−rhas f(B(x,r)) < vdrdλ(1−ǫ);

and thus, by Lemma 7, has strictly fewer than k neighbors within distance r.

σ−r) has f(B(x,r)) ≥ vdrdλ; and thus, by Lemma 7, has

For (2), since points in Sσ−rare absent from Gr, any path from A to A′in that graph must have an

edge across Sσ−r. But any such edge has length at least 2(σ − r) > αr and is thus not in Gr. ?

Definition 9 Define r(λ) to be the value of r for which vdrdλ =k

n+Cδ

n

√kdlogn.

To satisfy the conditions of Lemma 8, it suffices to take k ≥ 4C2

δ(d/ǫ2)logn; this is what we use.

5