Page 1

An information-theoretic derivation of min-cut based clustering

Anil Raj

Department of Applied Physics and Applied Mathematics

Columbia University, New York∗

Chris H. Wiggins

Department of Applied Physics and Applied Mathematics

Center for Computational Biology and Bioinformatics

Columbia University, New York†

(Dated: November 26, 2008)

Min-cut clustering, based on minimizing one of two heuristic cost-functions proposed by Shi and

Malik, has spawned tremendous research, both analytic and algorithmic, in the graph partitioning

and image segmentation communities over the last decade. It is however unclear if these heuristics

can be derived from a more general principle facilitating generalization to new problem settings.

Motivated by an existing graph partitioning framework, we derive relationships between optimizing

relevance information, as defined in the Information Bottleneck method, and the regularized cut in

a K-partitioned graph. For fast mixing graphs, we show that the cost functions introduced by Shi

and Malik can be well approximated as the rate of loss of predictive information about the location

of random walkers on the graph. For graphs generated from a stochastic algorithm designed to

model community structure, the optimal information theoretic partition and the optimal min-cut

partition are shown to be the same with high probability.

Keywords: graphs, clustering, information theory, min-cut, information bottleneck, graph diffusion

1.INTRODUCTION

Min-cut based graph partitioning has been used suc-

cessfully to find clusters in networks, with applications in

image segmentation as well as clustering biological and

sociological networks. The central idea is to develop fast

and efficient algorithms that optimally cut the edges be-

tween graph nodes, resulting in a separation of graph

nodes into clusters. Particularly, since Shi and Malik

successfully showed [1] that the average cut and the nor-

malized cut (defined below) were useful heuristics to be

optimized, there has been tremendous research in con-

structing the best normalized-cut-based cost function in

the image segmentation community.

The Information Bottleneck (IB) method [2, 3] is a

clustering technique, based on rate-distortion theory [4],

that has been successfully applied in a wide variety of

contexts including clustering word documents and gene-

expression profiles [5]. The IB method is also capable of

learning clusters in graphs and has been used successfully

for synthetic and actual networks [6]. In the hard clus-

tering case, given the diffusive probability distribution

over a graph, IB optimally assigns probability distribu-

tions, associated with nodes, into distinct groups. These

assignment rules define a separation of the graph nodes

into clusters.

We here illustrate how minimizing the two cut-based

heuristics introduced by Shi and Malik can be well-

approximated by the rate of loss of relevance information,

∗Electronic address: ar2384@columbia.edu

†Electronic address: chris.wiggins@columbia.edu

defined in the IB method applied to clustering graphs. To

establish these relations, we must first define the graphs

to be partitioned; we assume hard-clustering and the

cluster cardinality to be K. We show, numerically, that

maximizing mutual information and minimizing regular-

ized cut amount to the same partition with high probabil-

ity, for more modular 32-node graphs, where modularity

is defined by the probability of inter-cluster edge con-

nections in the Stochastic Block Model for graphs (See

Numerical Experiments). We also show that the op-

timization goal of maximizing relevance information is

equivalent to minimizing the regularized cut for 16-node

graphs.[12]

2. THE MIN-CUT PROBLEM

Following [7], for an undirected, unweighted graph G =

(V,E) with n nodes and m edges, represented[13] by

its adjacency matrix A := {Axy = 1

we define for two not necessarily disjoint sets of nodes

V+,V−⊆ V, the association

⇐⇒

x ∼ y},

W(V+,V−) =

?

x∈V+,y∈V−

Axy.

(2.1)

We define a bisection of V into V±if V+∪ V−= V

and V+∩ V− = ∅. For a bisection of V into V+ and

V−, the ‘cut’ is defined as c(V+,V−) = W(V+,V−).

We also quantify the size of a set V+⊆ V in terms of

the number of nodes in the set V+or the number of edges

arXiv:0811.4208v1 [stat.ML] 26 Nov 2008

Page 2

2

with at least one node in the set V+:

ω(V+) =

?

?

x∈V+

1

Ω(V+) =

x∈V+

dx,

(2.2)

where dxis the degree of node x.

Shi and Malik [1] defined a pair of regularized cuts, for

a bisection of V into V+and V−; the average cut was

defined as

A =W(V+,V−)

ω(V+)

+W(V+,V−)

ω(V−)

(2.3)

and the normalized cut was defined as

N =W(V+,V−)

Ω(V+)

+W(V+,V−)

Ω(V−)

.

(2.4)

This definition can be generalized, for a K-partition of

V into V1,V2,...,VK[7], to

A =

?

?

j

W(Vj,¯Vj)

ω(Vj)

W(Vj,¯Vj)

Ω(Vj)

(2.5)

N =

j

(2.6)

where¯Vj= V \ Vj.

For the graph G, we can define the graph Laplacian

∆ = D − A where D is a diagonal matrix of vertex

degrees. For a bisection of V, we also define the partition

indicator vector h [8]

?+1 ∀x ∈ V+

Specifying two ‘prior’ probability distributions over the

set of nodes V : (i) p(x) ∝ 1 and (ii) p(x) ∝ dx, we then

define the average of h to be

?

?h? =

hx=

−1 ∀x ∈ V−.

(2.7)

¯h =

x∈Vhx

n

?

x∈Vdxhx

2m

.

(2.8)

The cut, as defined by Fiedler [8], and the regularized

cuts, as defined by Shi and Malik [1], can then by written

in terms of h as (See Appendix)

c =

1

4hT∆h

1

n

1 −¯h2

1

2m

A =

hT∆h

(2.9)

N =

hT∆h

1 − ?h?2.

More generally, for a K-partition, we define the parti-

tion indicator matrix Q as

Qzx≡ p(z|x) = 1 ∀x ∈ z

(2.10)

where z ∈ {V1,V2,...,VK} and define P as a diago-

nal matrix of the ‘prior’ probability distribution over the

nodes. The regularized cut can then be generalized as

C =

?

j

[QT∆Q]jj

[QTPQ]jj

(2.11)

where for p(x) ∝ 1, C = A; and for p(x) ∝ dx, C = N.

Inferring the optimal h (or Q), however, has been

shown to be an NP-hard combinatorial optimization

problem [9].

3.INFORMATION BOTTLENECK

Rate-distortion theory, which provides the foundations

for lossy data compression, formulates clustering in terms

of a compression problem; it determines the code with

minimum average length such that information can be

transmitted without exceeding some specified distortion.

Here, the model-complexity, or rate, is measured by the

mutual information between the data and their represen-

tative codewords (average number of bits used to store a

data point). Simpler models correspond to smaller rates

but they typically suffer from relatively high distortion.

The distortion measure, which can be identified with loss

functions, usually depends on the problem; in the sim-

plest of cases, it is the variance of the difference between

data and their representatives.

The Information Bottleneck (IB) method [3] proposes

the use of mutual information as a natural distortion

measure. In this method, the data are compressed into

clusters while maximizing the amount of information that

the ‘cluster representation’ preserves about some speci-

fied relevance variable. For example, in clustering word

documents, one could use the ‘topic’ of a document as

the relevance variable.

For a graph G, let X be a random variable over graph

nodes, Y be the relevance variable and Z be the random

variable over clusters. Graph partitioning using the IB

method [6] learns a probabilistic cluster assignment func-

tion p(z|x) which gives the probability that a given node

x belongs to cluster z. The optimal p(z|x) minimizes the

mutual information between X and Z, while minimizing

the loss of predictive information between Z and Y . This

complexity–fidelity trade-off can be expressed in terms of

a functional to be minimized

F[p(z|x)] = −I [Y ;Z] + TI [X;Z]

where the temperature T parameterizes the relative im-

portance of precision over complexity. As T → 0, we

reach the ‘hard clustering’ limit where each node is as-

signed with unit probability to one cluster (i.e p(z|x) ∈

{0,1}).

(3.1)

Page 3

3

Graph clustering, as formulated in terms of the IB

method, requires a joint distribution p(y,x) to be defined

on the graph; we use the distribution given by continuous

graph diffusion as it naturally captures topological infor-

mation about the network [6]. The relevance variable Y

then ranges over the nodes of the graph and is defined as

the node at which a random walker ends at time t if the

random walker starts at node x at time 0. For continu-

ous time diffusion, the conditional distribution pt(y|x) is

given as

Gt= pt(y|x) = e−t∆P−1

(3.2)

where ∆ is the graph Laplacian and P a diagonal ma-

trix of the prior distribution over the graph nodes, as

described earlier. The characteristic diffusion time scale

τ of the system is given by the inverse of the smallest

non-zero eigenvalue of the diffusion operator exponent

∆P−1and characterizes the slowest decaying mode in

the system. To calculate the joint distribution p(y,x)

from the conditional Gt, we must specify an initial or

prior distribution[14]; we use the two different priors

p(x), used earlier to calculate the expected value of h

: (i) p(x) ∝ 1 and (ii) p(x) ∝ dx.

4. RATE OF INFORMATION LOSS IN GRAPH

DIFFUSION

We analyze here the rate of loss of predictive infor-

mation between the relevance variable Y and the cluster

variable Z, during diffusion on a graph G, after the graph

nodes have been hard-partitioned into K clusters.

A. Well-mixed limit of graph diffusion

For a given partition Q of the graph, defined in Eqn.

(2.10), we approximate the mutual information I [Y ;Z]

when diffusion on the graph reaches its well-mixed limit.

We introduce the dependence η(y,z) such that

p(y,z) = p(y)p(z)(1 + η).

(4.1)

This implies ?η?y= ?η?z= 0 and??η2?

??yand ??zdenote expectation over the corresponding

marginals.

In the well-mixed limit, we have η ? 1. The predic-

tive information (expressed in nats) can then be approx-

z

?

y= ?η? where

?? denotes expectation over the joint distribution and

imated as:

I [Y ;Z] =

?

?

??

??

1

2

1

2

ln

p(z,y)

p(z)p(y)

?

=

?(1 + η)ln(1 + η)?y

(1 + η)(η −1

?

?

z

≈

2η2)

?

z

y

?

z

≈

η +1

2η2

?

?

y

=

??η2?

?

??

y

z

(4.2)

=

y,z

p(y)p(z)

?

p(z,y)

p(z)p(y)− 1

?

?2

=

1

2

y,z

p(y,z)2

p(y)p(z)− 1

≡ ι.

(4.3)

Here, we define ι as a first-order approximation to I [Y ;Z]

in the well-mixed limit of graph diffusion.

1.Well-mixed K-partitioned graph

As in the IB method, the Markov condition Z−X −Y

allows us to make several simplifications for the condi-

tional distributions and associated information theoretic

measures. For a K-partition Q of the graph, we have

p(y,z) =

?

?

?

??

n

?

n

?

?

?

x

p(x,y,z)

=

x

p(z|y,x)p(y|x)p(x)

=

x

p(z|x)p(y|x)p(x) ≡ QPGtT.

(4.4)

p(y,z)2=

x

p(z|x)p(y|x)p(x)

?2

=

x,x?=1

p(z|x)p(y|x)p(x)p(z|x?)p(y|x?)p(x?)

=

x,x?=1

QzxGt

yxPxQzx?Gt

yx?Px?.

(4.5)

p(z) =

x

p(z|x)p(x)

=

x

QzxPx.

(4.6)

Graph diffusion being a Markov process, we have

?n

y=1Gt

x?yGt

yx = G2t

x?x.Using this and Bayes rule

Page 4

4

Gt

yxPx= Gt

xyPy, we have

ι =

1

2

??

??

?K

z=1

?K

z=1

y,z

?n

?n

?n

?n

x,x?=1QzxGt

(?

(?

yxPxQzx?Gt

x??Qzx??Px??)Py

yx?Px?

− 1

?

?

=

1

2

y,z

x,x?=1QzxQzx?PyGt

x??Qzx??Px??)Py

x,x?=1QzxQzx?(?n

x?yGt

yxPx

− 1

=

1

2

?

?

y=1Gt

x?yGt

yx)Px

(?

x??Qzx??Px??)

− 1

?

=

1

2

x,x?=1QzxQzx?G2t

(?

x?xPx

x??Qzx??Px??)

− 1

?

.

(4.7)

In the hard clustering case,

[QPQT]zzand we have

?

xQzxPx

= p(z) =

ι =

1

2

?K

z=1

?

[Q(G2tP)QT]zz

[QPQT]zz

− 1

?

.

(4.8)

2.Well-mixed 2-partitioned graph

We can re-write ι as

ι =

1

2

1

2

??η2?

y

?

z

=

??(p(z|y) − p(z))2

p(z)2

?

z

?

y

.

(4.9)

For a bisection h of the graph, z ∈ {+1,−1} and we have

1

2(1 ± hx) ≡1

p(z|y) =

p(y)

1

p(y)

1

2

x

1

2(1 + z?h|y?).

p(z) =

p(z,x) =

p(z|x) =

2(1 + zhx).

(4.10)

1

?

?

(1 + zhx)p(x|y)

x

p(z,y,x)

=

x

p(z|x)p(y|x)p(x)

=

?

= (4.11)

?

1

2

1

2(1 + z?h?).

1

2(1 + z?h|y?) −1

1

2z(?h|y? − ?h?).

x

?

x

p(z|x)p(x)

=

?

x

(1 + zhx)p(x)

=(4.12)

p(z|y) − p(z) =

2(1 + z?h?)

=

(4.13)

We then have

?(p(z|y) − p(z))2

p(z)2

?

z

=

K

?

(?h|y? − ?h?)2

2

z=1

1

4(?h|y? − ?h?)2

1

2(1 + z?h?)

=

K

?

.

z=1

1

1 + z?h?

=

(?h|y? − ?h?)2

1 − ?h?2

(4.14)

The mutual information I [Y ;Z] can then be approxi-

mated as

ι =

1

2

1

2

?(?h|y? − ?h?)2?

σ2

1 − ?h?2.

y

1 − ?h?2

y(?h|y?)

= (4.15)

Using Bayes rule pt(x|y)p(y) = pt(y|x)p(x), we have

?h|y? =

?

n

?

n

?

x

hxpt(x|y) =

?

x

hxpt(y|x)p(x)

p(y)

.(4.16)

??h|y?2?

y=

y=1

p(y)

n

?

x,x?=1

hxhx?pt(y|x)p(x)pt(x?|y)

p(y)

=

y=1

n

?

x,x?=1

hxhx?pt(y|x)pt(x?|y)p(x).(4.17)

Again, graph diffusion being a Markov process,

??h|y?2?

y=

n

?

??h|y?2?

= ?hxhx??2t− ?h?2.

ι =

2

x,x?=1

hxhx?p2t(x?|x)p(x)

= ?hxhx??2t.

(4.18)

σ2(?h|y?) =

y− ?h?2

(4.19)

1

?hxhx??2t− ?h?2

1 − ?h?2

.

(4.20)

B. Fast-mixing graphs

When diffusion on a graph reaches its well-mixed limit

in short times, we have G2t≈ I − 2t∆P−1. Thus, for a

K-partition of a graph

Q(G2tP)QT≈ Q(P − 2t∆)QT

= QPQT− 2tQ∆QT.

(4.21)

Page 5

5

For bisections, the short-time approximation of ?hxhx??2t

can be written as

?hxhx??2t =

n

?

x,x?=1

hx?p2t(x?,x)hx

= hTG2tPh

≈ hT(I − 2t∆P−1)Ph

= hTPh − 2thT∆h

= 1 − 2thT∆h.

(4.22)

For fast-mixing graphs, the long-time and short-time ap-

proximations for I [Y ;Z] and ?hxhx??2t, respectively, hold

simultaneously.

?

⇒dI[Y ;Z]

I [Y ;Z] ≈

ι

≈

1

2− thT∆h

?A ; p(x) ∝ 1

1−?h?2

?

dt

≈

dι

dt∝

N ; p(x) ∝ dx.

(4.23)

We have shown analytically that, for fast mixing

graphs, the heuristics introduced by Shi and Malik are

proportional to the rate of loss of relevance information.

The error incurred in the approximations I [Y ;Z] ≈ ι

and ?hxhx??2t≈ 1 − 2thT∆h can be defined as

?????

E1(t) =

I [Y ;Z](t)

E0(t) =

?hxhx??2t− (1 − 2thT∆h)

?hxhx??2t

I [Y ;Z](t) − ι(t)

?????

(4.24)

????

????.

(4.25)

5.NUMERICAL EXPERIMENTS

The validity of the two approximations can be seen in

a typical plot of E1(t) and E0(t) as a function of normal-

ized diffusion time˜t = t/τ, for the two different choices of

prior distributions over the nodes. E1, as seen in Fig. 1,

is often found to be non-monotonic and sometimes ex-

hibits oscillations. This suggests defining E∞, a modified

monotonic ‘E1’:

E∞(t) ≡ max

We don’t need to define a monotonic form for E0 since

this error is always found to be monotonically increasing

in time.

By fast-mixing graphs, we mean graphs which become

well-mixed in short times, i.e. graphs for which both the

long-time and short-time approximations hold simulta-

neously within a certain range of time˜t∗

illustrated in Fig. 1, where we define

t?≥tE1(t?).

(5.1)

−≤˜t ≤˜t∗

+, as

E(t) = max(E∞(t),E0(t))

E∗= min

t

˜t∗

(5.2)

(5.3)

E(t)

−= min(argmin

˜t∗

˜ t

E(˜t))

E(˜t)).

(5.4)

+= max(argmin

˜ t

(5.5)

FIG. 1: E1and E0vs normalized diffusion time for two choices

of priors over the graph nodes. E1 (red) typically tends to

have a non-monotonic behavior which motivates defining a

monotonic E∞ (green).

Note that the use of E∞instead of E1over-estimates the

value of E∗; the E∗’s calculated is an upper bound.

Graphs were drawn randomly from a Stochastic Block

Model (SBM) distribution [10], with block cardinality 2,

to analyze the distribution of E∗,˜t∗

monly done in community detection [11], for a graph of

n nodes, the average degree per node is fixed at n/4 for

graphs drawn from the SBM distribution: two nodes are

connected with probability p+if they belong to the same

block, but with probability p− < p+, if they belong to

different blocks. The two probabilities are, thus, con-

strained by the relation

?n

leaving only one free parameter p−that tunes the ‘mod-

ularity’ of graphs in the distribution. Starting with a

graph drawn from a distribution specified by a p−value

and specifying an initial cluster assignment as given by

the SBM distribution, we make local moves — adding or

deleting an edge in the graph and/or reassigning a node’s

cluster label — and search exhaustively over this move-

set for local minima of E∗. Fig. 2 compares the values of

E∗and

search, starting with a graph drawn from a distribution

with p− = 0.02 and n = {16,32,64}.

the scatter plots for graphs of different sizes collapse on

one another when E∗is plotted against normalized time,

confirming the Fiedler value 1/τ to be an appropriate

characteristic diffusion time-scale as used in [6]. A plot

of E∗against actual diffusion time shows that the scatter

plots of graphs of different sizes no longer collapse

Having shown analytically that for fast mixing graphs,

the regularized mincut is approximately the rate of loss of

relevance information, it would be instructive to compare

−and˜t∗

+. As is com-

p+

2− 1

?

+ p−

?n

2

?

=n

4

(5.6)

?˜t∗

−,˜t∗

+

?

for graphs obtained in this systematic

We note that