Page 1
An information-theoretic derivation of min-cut based clustering
Anil Raj
Department of Applied Physics and Applied Mathematics
Columbia University, New York∗
Chris H. Wiggins
Department of Applied Physics and Applied Mathematics
Center for Computational Biology and Bioinformatics
Columbia University, New York†
(Dated: November 26, 2008)
Min-cut clustering, based on minimizing one of two heuristic cost-functions proposed by Shi and
Malik, has spawned tremendous research, both analytic and algorithmic, in the graph partitioning
and image segmentation communities over the last decade. It is however unclear if these heuristics
can be derived from a more general principle facilitating generalization to new problem settings.
Motivated by an existing graph partitioning framework, we derive relationships between optimizing
relevance information, as defined in the Information Bottleneck method, and the regularized cut in
a K-partitioned graph. For fast mixing graphs, we show that the cost functions introduced by Shi
and Malik can be well approximated as the rate of loss of predictive information about the location
of random walkers on the graph. For graphs generated from a stochastic algorithm designed to
model community structure, the optimal information theoretic partition and the optimal min-cut
partition are shown to be the same with high probability.
Keywords: graphs, clustering, information theory, min-cut, information bottleneck, graph diffusion
1.INTRODUCTION
Min-cut based graph partitioning has been used suc-
cessfully to find clusters in networks, with applications in
image segmentation as well as clustering biological and
sociological networks. The central idea is to develop fast
and efficient algorithms that optimally cut the edges be-
tween graph nodes, resulting in a separation of graph
nodes into clusters. Particularly, since Shi and Malik
successfully showed [1] that the average cut and the nor-
malized cut (defined below) were useful heuristics to be
optimized, there has been tremendous research in con-
structing the best normalized-cut-based cost function in
the image segmentation community.
The Information Bottleneck (IB) method [2, 3] is a
clustering technique, based on rate-distortion theory [4],
that has been successfully applied in a wide variety of
contexts including clustering word documents and gene-
expression profiles [5]. The IB method is also capable of
learning clusters in graphs and has been used successfully
for synthetic and actual networks [6]. In the hard clus-
tering case, given the diffusive probability distribution
over a graph, IB optimally assigns probability distribu-
tions, associated with nodes, into distinct groups. These
assignment rules define a separation of the graph nodes
into clusters.
We here illustrate how minimizing the two cut-based
heuristics introduced by Shi and Malik can be well-
approximated by the rate of loss of relevance information,
∗Electronic address: ar2384@columbia.edu
†Electronic address: chris.wiggins@columbia.edu
defined in the IB method applied to clustering graphs. To
establish these relations, we must first define the graphs
to be partitioned; we assume hard-clustering and the
cluster cardinality to be K. We show, numerically, that
maximizing mutual information and minimizing regular-
ized cut amount to the same partition with high probabil-
ity, for more modular 32-node graphs, where modularity
is defined by the probability of inter-cluster edge con-
nections in the Stochastic Block Model for graphs (See
Numerical Experiments). We also show that the op-
timization goal of maximizing relevance information is
equivalent to minimizing the regularized cut for 16-node
graphs.[12]
2. THE MIN-CUT PROBLEM
Following [7], for an undirected, unweighted graph G =
(V,E) with n nodes and m edges, represented[13] by
its adjacency matrix A := {Axy = 1
we define for two not necessarily disjoint sets of nodes
V+,V−⊆ V, the association
⇐⇒
x ∼ y},
W(V+,V−) =
?
x∈V+,y∈V−
Axy.
(2.1)
We define a bisection of V into V±if V+∪ V−= V
and V+∩ V− = ∅. For a bisection of V into V+ and
V−, the ‘cut’ is defined as c(V+,V−) = W(V+,V−).
We also quantify the size of a set V+⊆ V in terms of
the number of nodes in the set V+or the number of edges
arXiv:0811.4208v1 [stat.ML] 26 Nov 2008
Page 2
2
with at least one node in the set V+:
ω(V+) =
?
?
x∈V+
1
Ω(V+) =
x∈V+
dx,
(2.2)
where dxis the degree of node x.
Shi and Malik [1] defined a pair of regularized cuts, for
a bisection of V into V+and V−; the average cut was
defined as
A =W(V+,V−)
ω(V+)
+W(V+,V−)
ω(V−)
(2.3)
and the normalized cut was defined as
N =W(V+,V−)
Ω(V+)
+W(V+,V−)
Ω(V−)
.
(2.4)
This definition can be generalized, for a K-partition of
V into V1,V2,...,VK[7], to
A =
?
?
j
W(Vj,¯Vj)
ω(Vj)
W(Vj,¯Vj)
Ω(Vj)
(2.5)
N =
j
(2.6)
where¯Vj= V \ Vj.
For the graph G, we can define the graph Laplacian
∆ = D − A where D is a diagonal matrix of vertex
degrees. For a bisection of V, we also define the partition
indicator vector h [8]
?+1 ∀x ∈ V+
Specifying two ‘prior’ probability distributions over the
set of nodes V : (i) p(x) ∝ 1 and (ii) p(x) ∝ dx, we then
define the average of h to be
?
?h? =
hx=
−1 ∀x ∈ V−.
(2.7)
¯h =
x∈Vhx
n
?
x∈Vdxhx
2m
.
(2.8)
The cut, as defined by Fiedler [8], and the regularized
cuts, as defined by Shi and Malik [1], can then by written
in terms of h as (See Appendix)
c =
1
4hT∆h
1
n
1 −¯h2
1
2m
A =
hT∆h
(2.9)
N =
hT∆h
1 − ?h?2.
More generally, for a K-partition, we define the parti-
tion indicator matrix Q as
Qzx≡ p(z|x) = 1 ∀x ∈ z
(2.10)
where z ∈ {V1,V2,...,VK} and define P as a diago-
nal matrix of the ‘prior’ probability distribution over the
nodes. The regularized cut can then be generalized as
C =
?
j
[QT∆Q]jj
[QTPQ]jj
(2.11)
where for p(x) ∝ 1, C = A; and for p(x) ∝ dx, C = N.
Inferring the optimal h (or Q), however, has been
shown to be an NP-hard combinatorial optimization
problem [9].
3.INFORMATION BOTTLENECK
Rate-distortion theory, which provides the foundations
for lossy data compression, formulates clustering in terms
of a compression problem; it determines the code with
minimum average length such that information can be
transmitted without exceeding some specified distortion.
Here, the model-complexity, or rate, is measured by the
mutual information between the data and their represen-
tative codewords (average number of bits used to store a
data point). Simpler models correspond to smaller rates
but they typically suffer from relatively high distortion.
The distortion measure, which can be identified with loss
functions, usually depends on the problem; in the sim-
plest of cases, it is the variance of the difference between
data and their representatives.
The Information Bottleneck (IB) method [3] proposes
the use of mutual information as a natural distortion
measure. In this method, the data are compressed into
clusters while maximizing the amount of information that
the ‘cluster representation’ preserves about some speci-
fied relevance variable. For example, in clustering word
documents, one could use the ‘topic’ of a document as
the relevance variable.
For a graph G, let X be a random variable over graph
nodes, Y be the relevance variable and Z be the random
variable over clusters. Graph partitioning using the IB
method [6] learns a probabilistic cluster assignment func-
tion p(z|x) which gives the probability that a given node
x belongs to cluster z. The optimal p(z|x) minimizes the
mutual information between X and Z, while minimizing
the loss of predictive information between Z and Y . This
complexity–fidelity trade-off can be expressed in terms of
a functional to be minimized
F[p(z|x)] = −I [Y ;Z] + TI [X;Z]
where the temperature T parameterizes the relative im-
portance of precision over complexity. As T → 0, we
reach the ‘hard clustering’ limit where each node is as-
signed with unit probability to one cluster (i.e p(z|x) ∈
{0,1}).
(3.1)
Page 3
3
Graph clustering, as formulated in terms of the IB
method, requires a joint distribution p(y,x) to be defined
on the graph; we use the distribution given by continuous
graph diffusion as it naturally captures topological infor-
mation about the network [6]. The relevance variable Y
then ranges over the nodes of the graph and is defined as
the node at which a random walker ends at time t if the
random walker starts at node x at time 0. For continu-
ous time diffusion, the conditional distribution pt(y|x) is
given as
Gt= pt(y|x) = e−t∆P−1
(3.2)
where ∆ is the graph Laplacian and P a diagonal ma-
trix of the prior distribution over the graph nodes, as
described earlier. The characteristic diffusion time scale
τ of the system is given by the inverse of the smallest
non-zero eigenvalue of the diffusion operator exponent
∆P−1and characterizes the slowest decaying mode in
the system. To calculate the joint distribution p(y,x)
from the conditional Gt, we must specify an initial or
prior distribution[14]; we use the two different priors
p(x), used earlier to calculate the expected value of h
: (i) p(x) ∝ 1 and (ii) p(x) ∝ dx.
4. RATE OF INFORMATION LOSS IN GRAPH
DIFFUSION
We analyze here the rate of loss of predictive infor-
mation between the relevance variable Y and the cluster
variable Z, during diffusion on a graph G, after the graph
nodes have been hard-partitioned into K clusters.
A. Well-mixed limit of graph diffusion
For a given partition Q of the graph, defined in Eqn.
(2.10), we approximate the mutual information I [Y ;Z]
when diffusion on the graph reaches its well-mixed limit.
We introduce the dependence η(y,z) such that
p(y,z) = p(y)p(z)(1 + η).
(4.1)
This implies ?η?y= ?η?z= 0 and??η2?
??yand ??zdenote expectation over the corresponding
marginals.
In the well-mixed limit, we have η ? 1. The predic-
tive information (expressed in nats) can then be approx-
z
?
y= ?η? where
?? denotes expectation over the joint distribution and
imated as:
I [Y ;Z] =
?
?
??
??
1
2
1
2
ln
p(z,y)
p(z)p(y)
?
=
?(1 + η)ln(1 + η)?y
(1 + η)(η −1
?
?
z
≈
2η2)
?
z
y
?
z
≈
η +1
2η2
?
?
y
=
??η2?
?
??
y
z
(4.2)
=
y,z
p(y)p(z)
?
p(z,y)
p(z)p(y)− 1
?
?2
=
1
2
y,z
p(y,z)2
p(y)p(z)− 1
≡ ι.
(4.3)
Here, we define ι as a first-order approximation to I [Y ;Z]
in the well-mixed limit of graph diffusion.
1.Well-mixed K-partitioned graph
As in the IB method, the Markov condition Z−X −Y
allows us to make several simplifications for the condi-
tional distributions and associated information theoretic
measures. For a K-partition Q of the graph, we have
p(y,z) =
?
?
?
??
n
?
n
?
?
?
x
p(x,y,z)
=
x
p(z|y,x)p(y|x)p(x)
=
x
p(z|x)p(y|x)p(x) ≡ QPGtT.
(4.4)
p(y,z)2=
x
p(z|x)p(y|x)p(x)
?2
=
x,x?=1
p(z|x)p(y|x)p(x)p(z|x?)p(y|x?)p(x?)
=
x,x?=1
QzxGt
yxPxQzx?Gt
yx?Px?.
(4.5)
p(z) =
x
p(z|x)p(x)
=
x
QzxPx.
(4.6)
Graph diffusion being a Markov process, we have
?n
y=1Gt
x?yGt
yx = G2t
x?x.Using this and Bayes rule
Page 4
4
Gt
yxPx= Gt
xyPy, we have
ι =
1
2
??
??
?K
z=1
?K
z=1
y,z
?n
?n
?n
?n
x,x?=1QzxGt
(?
(?
yxPxQzx?Gt
x??Qzx??Px??)Py
yx?Px?
− 1
?
?
=
1
2
y,z
x,x?=1QzxQzx?PyGt
x??Qzx??Px??)Py
x,x?=1QzxQzx?(?n
x?yGt
yxPx
− 1
=
1
2
?
?
y=1Gt
x?yGt
yx)Px
(?
x??Qzx??Px??)
− 1
?
=
1
2
x,x?=1QzxQzx?G2t
(?
x?xPx
x??Qzx??Px??)
− 1
?
.
(4.7)
In the hard clustering case,
[QPQT]zzand we have
?
xQzxPx
= p(z) =
ι =
1
2
?K
z=1
?
[Q(G2tP)QT]zz
[QPQT]zz
− 1
?
.
(4.8)
2.Well-mixed 2-partitioned graph
We can re-write ι as
ι =
1
2
1
2
??η2?
y
?
z
=
??(p(z|y) − p(z))2
p(z)2
?
z
?
y
.
(4.9)
For a bisection h of the graph, z ∈ {+1,−1} and we have
1
2(1 ± hx) ≡1
p(z|y) =
p(y)
1
p(y)
1
2
x
1
2(1 + z?h|y?).
p(z) =
p(z,x) =
p(z|x) =
2(1 + zhx).
(4.10)
1
?
?
(1 + zhx)p(x|y)
x
p(z,y,x)
=
x
p(z|x)p(y|x)p(x)
=
?
= (4.11)
?
1
2
1
2(1 + z?h?).
1
2(1 + z?h|y?) −1
1
2z(?h|y? − ?h?).
x
?
x
p(z|x)p(x)
=
?
x
(1 + zhx)p(x)
=(4.12)
p(z|y) − p(z) =
2(1 + z?h?)
=
(4.13)
We then have
?(p(z|y) − p(z))2
p(z)2
?
z
=
K
?
(?h|y? − ?h?)2
2
z=1
1
4(?h|y? − ?h?)2
1
2(1 + z?h?)
=
K
?
.
z=1
1
1 + z?h?
=
(?h|y? − ?h?)2
1 − ?h?2
(4.14)
The mutual information I [Y ;Z] can then be approxi-
mated as
ι =
1
2
1
2
?(?h|y? − ?h?)2?
σ2
1 − ?h?2.
y
1 − ?h?2
y(?h|y?)
= (4.15)
Using Bayes rule pt(x|y)p(y) = pt(y|x)p(x), we have
?h|y? =
?
n
?
n
?
x
hxpt(x|y) =
?
x
hxpt(y|x)p(x)
p(y)
.(4.16)
??h|y?2?
y=
y=1
p(y)
n
?
x,x?=1
hxhx?pt(y|x)p(x)pt(x?|y)
p(y)
=
y=1
n
?
x,x?=1
hxhx?pt(y|x)pt(x?|y)p(x).(4.17)
Again, graph diffusion being a Markov process,
??h|y?2?
y=
n
?
??h|y?2?
= ?hxhx??2t− ?h?2.
ι =
2
x,x?=1
hxhx?p2t(x?|x)p(x)
= ?hxhx??2t.
(4.18)
σ2(?h|y?) =
y− ?h?2
(4.19)
1
?hxhx??2t− ?h?2
1 − ?h?2
.
(4.20)
B. Fast-mixing graphs
When diffusion on a graph reaches its well-mixed limit
in short times, we have G2t≈ I − 2t∆P−1. Thus, for a
K-partition of a graph
Q(G2tP)QT≈ Q(P − 2t∆)QT
= QPQT− 2tQ∆QT.
(4.21)
Page 5
5
For bisections, the short-time approximation of ?hxhx??2t
can be written as
?hxhx??2t =
n
?
x,x?=1
hx?p2t(x?,x)hx
= hTG2tPh
≈ hT(I − 2t∆P−1)Ph
= hTPh − 2thT∆h
= 1 − 2thT∆h.
(4.22)
For fast-mixing graphs, the long-time and short-time ap-
proximations for I [Y ;Z] and ?hxhx??2t, respectively, hold
simultaneously.
?
⇒dI[Y ;Z]
I [Y ;Z] ≈
ι
≈
1
2− thT∆h
?A ; p(x) ∝ 1
1−?h?2
?
dt
≈
dι
dt∝
N ; p(x) ∝ dx.
(4.23)
We have shown analytically that, for fast mixing
graphs, the heuristics introduced by Shi and Malik are
proportional to the rate of loss of relevance information.
The error incurred in the approximations I [Y ;Z] ≈ ι
and ?hxhx??2t≈ 1 − 2thT∆h can be defined as
?????
E1(t) =
I [Y ;Z](t)
E0(t) =
?hxhx??2t− (1 − 2thT∆h)
?hxhx??2t
I [Y ;Z](t) − ι(t)
?????
(4.24)
????
????.
(4.25)
5.NUMERICAL EXPERIMENTS
The validity of the two approximations can be seen in
a typical plot of E1(t) and E0(t) as a function of normal-
ized diffusion time˜t = t/τ, for the two different choices of
prior distributions over the nodes. E1, as seen in Fig. 1,
is often found to be non-monotonic and sometimes ex-
hibits oscillations. This suggests defining E∞, a modified
monotonic ‘E1’:
E∞(t) ≡ max
We don’t need to define a monotonic form for E0 since
this error is always found to be monotonically increasing
in time.
By fast-mixing graphs, we mean graphs which become
well-mixed in short times, i.e. graphs for which both the
long-time and short-time approximations hold simulta-
neously within a certain range of time˜t∗
illustrated in Fig. 1, where we define
t?≥tE1(t?).
(5.1)
−≤˜t ≤˜t∗
+, as
E(t) = max(E∞(t),E0(t))
E∗= min
t
˜t∗
(5.2)
(5.3)
E(t)
−= min(argmin
˜t∗
˜ t
E(˜t))
E(˜t)).
(5.4)
+= max(argmin
˜ t
(5.5)
FIG. 1: E1and E0vs normalized diffusion time for two choices
of priors over the graph nodes. E1 (red) typically tends to
have a non-monotonic behavior which motivates defining a
monotonic E∞ (green).
Note that the use of E∞instead of E1over-estimates the
value of E∗; the E∗’s calculated is an upper bound.
Graphs were drawn randomly from a Stochastic Block
Model (SBM) distribution [10], with block cardinality 2,
to analyze the distribution of E∗,˜t∗
monly done in community detection [11], for a graph of
n nodes, the average degree per node is fixed at n/4 for
graphs drawn from the SBM distribution: two nodes are
connected with probability p+if they belong to the same
block, but with probability p− < p+, if they belong to
different blocks. The two probabilities are, thus, con-
strained by the relation
?n
leaving only one free parameter p−that tunes the ‘mod-
ularity’ of graphs in the distribution. Starting with a
graph drawn from a distribution specified by a p−value
and specifying an initial cluster assignment as given by
the SBM distribution, we make local moves — adding or
deleting an edge in the graph and/or reassigning a node’s
cluster label — and search exhaustively over this move-
set for local minima of E∗. Fig. 2 compares the values of
E∗and
search, starting with a graph drawn from a distribution
with p− = 0.02 and n = {16,32,64}.
the scatter plots for graphs of different sizes collapse on
one another when E∗is plotted against normalized time,
confirming the Fiedler value 1/τ to be an appropriate
characteristic diffusion time-scale as used in [6]. A plot
of E∗against actual diffusion time shows that the scatter
plots of graphs of different sizes no longer collapse
Having shown analytically that for fast mixing graphs,
the regularized mincut is approximately the rate of loss of
relevance information, it would be instructive to compare
−and˜t∗
+. As is com-
p+
2− 1
?
+ p−
?n
2
?
=n
4
(5.6)
?˜t∗
−,˜t∗
+
?
for graphs obtained in this systematic
We note that
Page 6
6
FIG. 2: E∗vs˜t∗for graphs of different sizes and different prior distributions over the graph nodes. In the above plot,˜t∗
˜t∗
−and
+are represented by · and ◦, respectively.
FIG. 3: p(hinf(t) ?= hcut) vs normalized diffusion time, aver-
aged over 500 graphs drawn from a distribution parameterized
by a given p−value, is plotted for different graph distributions
the actual partitions that optimize these goals. Graphs of
size n = 32 were drawn from the SBM distribution with
p−= {0.1,0.12,0.14,0.16}. Starting with an equal-sized
partition specified by the model itself, we performed iter-
ative coordinate descent to search (independently) for the
partition that minimized the regularized cut (hcut) and
one that minimized the relevance information (hinf(t));
i.e. we reassigned each node’s cluster label and searched
for the reassignment that gave the new lowest value for
the cost function being optimized. Plots comparing the
partitions hinf(t) and hcut, learnt by optimizing the two
goals (averaged over 500 graphs drawn from each distri-
bution), are shown in Fig. 3.
6. CONCLUDING REMARKS
We have shown that the normalized cut and average
cut, introduced by Shi and Malik as useful heuristics to
be minimized when partitioning graphs, are well approx-
imated by the rate of loss of predictive information for
fast-mixing graphs. Deriving these cut-based cost func-
tions from rate-distortion theory gives them a more prin-
cipled setting, makes them interpretable, and facilitates
generalization to appropriate cut-based cost functions in
new problem settings. We have also shown (see Fig. 2)
that the inverse Fiedler value is an appropriate normal-
ization for diffusion time, justifying its use in [6] to cap-
ture long-time behaviors on the network.
Absent from this manuscript is a discussion of how
not to overpartition a graph, i.e. a criterion for selecting
K. It is hoped that by showing how these heuristics can
be derived from a more general problem setting, lessons
learnt by investigating stablilty, cross-validation or other
approaches may benefit those using min-cut based ap-
proaches as well. Similarily, by showing how these heuris-
tics approximate costs functions from a separate opti-
mization problem, it is hoped that algorithms employed
for rate distortion theory, e.g. Blahut Arimoto, maybe
be brought to bear on min-cut minimization.
APPENDIX
Using the definition of ∆, for any general vector f over
the graph nodes, we have
fT∆f = fTDf − fTAf
=
?
x
dxf2
x−
n
?
x,y=1
fxfyAxy
Page 7
7
=
?
1
2
x
?
n
?
n
?
n
?
y=1
Axy
?
f2
x−
n
?
n
?
x,y=1
fxfyAxy
=
?
x,y=1
f2
xAxy− 2
x,y=1
fxfyAxy+
n
?
x,y=1
f2
yAxy
?
=
1
2
x,y=1
Axy(fx− fy)2.
(A.1)
Now, when f = h, we have
hT∆h =
1
2
?
hx×hy=−1
4Axy
= 4 × c.
(A.2)
The factor
nodes counts each adjacent pair of nodes twice.
Using the definitions of A and N, we have
?
x
? ?
= 2c ×
?
=
n
1
2disappears because summation over all
A = c ×
1
?
hx=+1
1
+
1
?
hx=−1
1
1
?1−hx
x(1 − hx)
= c ×
1
?
?
(n +?
n(1 +¯h)(1 −¯h)
c
1 −¯h2.
?1+hx
x(1 + hx)?
xhx)(n −?
2
? +
?
x
2
?
?
?
?
= 2c ×
x(1 − hx+ 1 + hx)
?
2n
xhx)
= 2c ×
4
2
?
(A.3)
N = c ×
?
1
?
hx=+1
dx
+
1
?
hx=−1
dx
= c ×
1
?1+hx
x(dx(1 + hx))?
(2m +?
m(1 + ?h?)(1 − ?h?)
c
1 − ?h?2.
?
?
xdx
2
? +
1
?1−hx
?
xdx
2
?
?
= 2c ×
?
?
?
?
xdx(1 − hx+ 1 + hx)
x(dx(1 − hx))
?
?
= 2c ×
4m
xhxdx)(2m −?
xhxdx)
= 2c ×
1
?
=
2
m
(A.4)
[1] J Shi and J Malik. Normalized cuts and image segmen-
tation. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, Jan 2000.
[2] N Tishby, F C Pereira, and W Bialek. The information
bottleneck method. arXiv preprint physics, Jan 2000.
[3] N Tishby and N Slonim. Data clustering by markovian
relaxation and the information bottleneck method. Ad-
vances in Neural Information Processing Systems, Jan
2000.
[4] C Shannon. A mathematical theory of communication.
ACM SIGMOBILE Mobile Computing and Communica-
tions Review, 5(1), Jan 2001.
[5] N Slonim.The Information Bottleneck:
Applications.PhD thesis, The Hebrew University of
Jerusalem, 2002.
[6] E Ziv, M Middendorf, and C H Wiggins. Information-
theoretic approach to network modularity. Physical Re-
view E, Jan 2005.
[7] U von Luxburg. A tutorial on spectral clustering. arXiv,
cs.DS, Nov 2007.
Theory and
[8] M Fiedler. Algebraic connectivity of graphs. Czechoslo-
vak Mathematical Journal, 1973.
[9] Dorothea Wagner and Frank Wagner. Between min cut
and graph bisection. In MFCS ’93: Proceedings of the
18th International Symposium on Mathematical Founda-
tions of Computer Science, pages 744–750, London, UK,
1993. Springer-Verlag.
[10] P W Holland and S Leinhardt. Local structure in social
networks. Sociological Methodology, Jan 1976.
[11] L Danon, A Diaz-Guilera, J Duch, and A Arenas. Com-
paring community structure identification.
Statistical Mechanics: Theory and Experiment, Jan 2005.
[12] We chose 16-node graphs so the network and its parti-
tions could be parsed visually with ease.
[13] We use the shorthand x ∼ y to mean x is adjacent to y.
[14] Strictly speaking, any diagonal matrix P that we spec-
ify determines the steady-state distribution. Since we are
modeling the distribution of random walkers at statisti-
cal equilibrium, we always use this distribution as our
initial or prior distribution.
Journal of