ThesisPDF Available

Predicting graph labellings in online learning

Authors:

Abstract

This report is submitted as part requirement for the MSc in Machine Learning at University College London. It is substantially the result of my own work except where explicitly indicated in the text. The report may be freely copied and distributed provided the source is explicitly acknowledged. We study and compare p-Laplacian interpolation, and prediction with Ising model marginals. Some limiting results about the behaviour of these algorithms is proven.
Predicting graph labellings in online learning
Msc Machine Learning
Giulio Filippi
Supervisor : Prof. Mark Herbster
This report is submitted as part requirement for the MSc in Machine Learning at University
College London. It is substantially the result of my own work except where explicitly indicated in
the text. The report may be freely copied and distributed provided the source is explicitly
acknowledged.
Department of Computer Science
UCL
August 2020
Abstract
The goal of this masters project is to study the prediction of binary graph labellings in
online fashion. The problem is formulated in the online learning framework as follows : we
are sequentially asked to predict the labelling (±1) of a vertex v, if the prediction is incorrect
we incur one mistake, we are then revealed the true labelling of the vertex and can use this
information to change the prediction function for future vertices. The goal is to minimize the
cumulative number of mistakes. It is clear that if nature is adversarial we will make a mistake
on each trial, the assumption is that the labels are distributed on the graph in a way that
makes sense topologically, meaning that most often neighbouring vertices will have the same
label, or equivalently the cut of the labelling is small. We specically focus on two types of
prediction algorithms : the rst is prediction via p-Laplacian interpolation, and the second is
prediction with the Ising model marginals.
We rst study the interplay between connectivity and geodesic distance for both these al-
gorithms, as a function of graph structure and the parameter of the model (pfor p-Laplacian
interpolation and temperature Tfor Ising). We nd a graph where p-Laplacian interpolation
with p=is not geodesic minimising (meaning it will not predict with the label of a nearest
labelled neighbour) disproving a conjecture made prior. In [10], Herbster and Pontil show that
there exists a graph on nvertices for which Lapalcian interpolation incurs nmistakes even
though the cut is 1, we prove an analogous result in the setting of p-Laplacian interpolation.
We prove a formula for reducing a path of arbitrary length in the Ising model, generalising
a formula found in Georey Grimmett’s book on the random cluster model [8], and use it to
prove a heuristic for the Ising prediction to be geodesic minimising in the innite temperature
limit.
Next we study the Ising prediction algorithm from the point of view of the halving algo-
rithm. We prove a bound on the number of mistakes the algorithm can make, which depends
on the partition function of the Ising model on the given graph. We also show an equivalence
between the partition function of the Ising model and the Tutte polynomial which is a fun-
damental object in graph theory (following method from [2]). We use this formula to nd an
explicit bound on trees for an optimally tuned temperature and show this bound is optimal
on paths. We also study how the bound can vary as we add or remove edges from a given
graph. We nally state and explore a link between the Ising partition function and a network
reliability, giving a probabilistic interpretation of the latter.
1
Contents
1 Introduction 3
1.1 Overview ......................................... 3
1.2 Notation and conventions ................................ 3
1.3 Motivations ........................................ 3
1.4 p-Laplacian interpolation and prediction ........................ 4
1.5 Ising distribution and prediction ............................ 4
1.6 Distance between nodes from p-resistances and Ising distribution .......... 5
1.6.1 p-resistances ................................... 5
1.6.2 Ising distances .................................. 6
1.6.3 Example ...................................... 6
1.7 Phase transitions in p-resistances ............................ 7
1.8 Laplacian interpolation isn’t perfect (Octopus graphs) ................ 8
1.9 Important properties of algorithms ........................... 8
2 Connectivity versus geodesic distance 10
2.1 Introduction ........................................ 10
2.2 Preliminaries ....................................... 10
2.3 Geodesic minimising counterexample for -Laplacian ................ 11
2.4 Case study : p-Laplacian vs Ising ............................ 12
2.4.1 p-resistance reductions and p-Laplacian case study .............. 12
2.4.2 Octopus graphs revisited in p-Laplacian interpolation setting ........ 13
2.4.3 Ising reductions .................................. 14
2.4.4 Ising case study .................................. 16
2.5 Geodesic minimising heuristic for Ising as T→ ∞ ................... 17
3 Understanding the algorithms with guarantees 18
3.1 Halving bound for Ising Model ............................. 18
3.2 Reformulating partition function as Tutte polynomial ................ 18
3.3 Working out the bound for trees ............................ 19
3.4 Bound is tight for path .................................. 20
3.5 Clique ........................................... 20
3.6 Bound variations as we add or remove edges ...................... 20
3.7 Exploring link between Ising partition function and network reliability ....... 21
4 Conclusion 23
5 Appendix A : max-norm interpolation on a graph 24
6 Appendix B : proof of Theorem 4 29
2
1 Introduction
1.1 Overview
In this section we provide motivation, denitions and discussion of some related literature that
will be useful for the rest of the thesis. We discuss some of the motivations for graph based semi-
supervised learning and dene the two algorithms that will provide the focus of the thesis : that
is p-Laplacian interpolation and predicting with Ising marginals. We also discuss three important
properties that are shared by these two algorithms : Markov, label monotone and permutation
independence.
Section 2 will study the interplay between connectivity and geodesic distance. We better un-
derstand the behaviour of p-Laplacian interpolation at p=: in particular we show that the
prediction is not always geodesic minimising (ie it doesn’t always predict with the label of the
nearest labelled neighbour). We nd a family of graphs on which we can force a certain number
of mistakes with p-Laplacian interpolation, extending the result for Laplacian interpolation of [10]
by Herbster and Pontil. We prove a formula for reducing a path of arbitrary length in the Ising
model, generalising a formula found in Georey Grimmett’s book on the random cluster model [8].
This allows us to prove that in the innite temperature limit, in the simplied scenario of a star
graph, predicting with Ising is geodesic minimising.
Section 3 will study the Ising prediction from the point of view of the halving algorithm. A
bound on the number of mistakes the algorithm can make is proven, which depends on the cut of
the true labelling and the partition function of the Ising model. We show an equivalence between
the Ising partition function and the Tutte polynomial, which is an active research area at the
intersection of graph theory and statistical physics. We use that link to nd a formula for the
bound on trees and prove it optimal on paths. We then discuss how the bound might change as
we increase or decrease connectivity. We also discuss a link between the Ising partition function,
and a network reliability polynomial, providing a probabilistic interpretation of the Ising partition
function that could provide insight to study it’s behaviour.
1.2 Notation and conventions
Gwill often denote an arbitrary network graph, E(G)is the set of edges of Gand V(G)is the
set of vertices of G. We will always color the nodes labelled 1red and the nodes labelled +1
blue. We will often write |E(G)|=mand |V(G)|=n. A binary labelling uon Gis a vector
u∈ {+1,1}|V(G)|. We denote the cut of a binary labelling uon Gas ϕG(u) = 1
2eE(G)|u(e+)
u(e)|. Distances on a graph Gare denoted dG(i, j ), and Gis omitted if clear from context.
1.3 Motivations
Methods for graph labelling have been proven useful and used extensively in semi-supervised learn-
ing : for instance [3] or [6]. Since many datasets have large amounts of unlabelled data, and small
amounts of labelled data, these methods are all the more relevant. We usually proceed by building
a k-nearest-neighbour graph or an ϵ-ball graph from the dataset, and use the topological struc-
ture of the graph to predict unlabelled data. The idea behind these methods is that the graph
holds information about the labels, and this inductive bias is represented mathematically by the
assumption that the cut of the labelling isn’t too large. Graph labelling methods have been used
in survey sampling, information retrieval, pattern recognition, etc.
A common example to motivate graph labelling methods is the two moons example. In the
two moons example we have data in the shape of two moons intertwined with each other, with
one labelled data point from each class at opposite ends of the moons. Figure 1 shows that many
natural classication algorithms will misinterpret this dataset. On the other hand building a graph
and predicting with a graph labelling method can result in the correct solution.
3
Figure 1: Two moons toy example (image from [4])
In [4], Belkin and Niyogi lay down the theory behind approximating manifolds using Laplacians.
The idea is that if we wish to predict data that lies in a lower dimensional manifold, then using
topological graphs can help encapsulate the structure in the data.
1.4 p-Laplacian interpolation and prediction
One approach to graph Labelling is using the p-Laplacian as a regularizer. The theoretical prop-
erties of graph labellings based on minimum p-Laplacian interpolation were studied by Herbster
and Lever in [9]. When p= 2 the theory of p-resistances reduces to electrical network theory and
the interpolation is called harmonic energy minimisation [14].
Let edge ij have weight Aij , we often use rij =1
Aij to be the inverse edge weight. Then p-
Laplacian interpolation works by minimising the norm below (which can be seen as an energy)
uG,p =
ijE(G)
|uiuj|p
rij
1/p
subject to the constraints on the allready observed nodes. We then predict the label of vertex i
using the sign of ui
pred =sign(ui)
For p-Laplacian interpolation, there is a second type of prediction that is useful to consider. We
will call it a forward prediction,
pred =min
y∈{±1}min
u:ui=yuG,p
As we will see in the later section on connectivity, as p→ ∞, the rst type of prediction is not
necessarily geodesic minimizing. However, in the second type of prediction as p→ ∞ the prediction
is geodesic minimizing or ill dened, as we prove in appendix A under the section corollaries.
1.5 Ising distribution and prediction
In [11], Herbster, Pasteris and Gosh study the prediction with the marginal of the Ising distribu-
tion, in the limit of 0 temperature. Since the computation of the marginals of the Ising distribution
is #P hard, the above paper uses heuristics to design a tractable prediction algorithm, with a guar-
antee. On the other hand in this master’s project we will study the algorithm for it’s theoretical
properties, hence we do not worry about the fact that it is intractable and consider the full tem-
perature parameter range.
4
We dene below the Ising distribution on a graph Gwith inverse temperature parameter β=1
T.
pG(β;u) = 1
ZeβϕG(u)(1)
In some cases we will need a more general denition of the Ising distribution allowing for dierent
coupling strengths J(e)on dierent edges. In that case the Ising distribution is
pG(β;u) = 1
Zeβe=ij J(e)δui,uj
Where we use the delta function notation δui,uj= 1 if ui=ujand 0 otherwise. If J(e) = 1 for all
edges, then this reduces to the denition we gave previously. Indeed
pG(β;u) = 1
Zeβe=ij δui,uj=1
Zeβ(nϕ(u)) =1
Zeβϕ(u)
We denote Z(β, G)the partition function of the above model, so
Z(β, G) =
u
pG(β;u)
The Ising model will give more mass to labellings with low cut, and how much so will depend
on the temperature. When predicting a given vertex iwith the Ising distribution (given some
observed data), we predict with the marginal of the Ising distribution as follows
pred =argmaxy∈{±1}p(ui=y|data)
If we were instead to predict with the maximum likelihood estimator of u
ˆuML =argmaxup(data|u)
we would always predict with a label consistent min-cut, for any value of the parameter. This
would not lead to an interesting prediction.
1.6 Distance between nodes from p-resistances and Ising distribution
1.6.1 p-resistances
In the p-Laplacian scenario, there is a natural distance measure between nodes given by the p-
resistance, which is shown to be a metric in [12]. We will use Herbster’s denition of p-resistance
rp(s, t)1=min
u{∥up
G,p :usut= 1}
As is seen from Herbster’s denition of p-resistance, we are minimising a p-norm over unit potential
dierences between sand t. It turns out that we can also think of p-resistances in terms of ows.
A unit ow from sto ton a graph Gis a function i:ERon the edges of Gsatisfying
e=x
i(e) = 0 for all x̸=s, t
and
e=s
i(e) =
e=t
i(e) = 1
Herbster’s denition of p-resistance reformulated as a ow minimisation problem would be (see
[1])
rp(s, t) = min
i{
eE(G)
r
1
p1
e|i(e)|1+ 1
p1:iunit ow from sto t}
Below is an image of the ows p-resistances give rise to for varying p. We observe that for larger
p, the minimising ow is concentrated over shorter paths, and for smaller pthe minimising ow
can use more paths.
5
Figure 2: p-resistance induced ows on grid with increasing pfrom left to right (image from [1])
1.6.2 Ising distances
In the case of the Ising model we must think of how to dene a distance between nodes on the
graph. One fairly natural denition is the probability under the Ising model (with no constraints)
that the two nodes have the same label. This distance function will capture both the traditional
distance but also connectivity information.
d(s, t) = pβ(us=ut)
If we have reduced the graph to a single edge between sand twith a coupling strength J(s, t), we
can rewrite the above as
pβ(us=ut) = pβ(us=ut= 1) + pβ(us=ut=1) = 2
2+2eβJ (s,t)
1.6.3 Example
To better understand the behaviour of these two distance functions we will study a simple example.
st
Figure 3: Example graph
Using series and parallel laws for p-resistances [9], we get a formula for the p-resistance between
sand t
rp(s, t) = (1 + 2 p2
p1)p1
Similarly we can use series and parallel laws for the coupling strength in the Ising model to nd the
eective coupling strength between sand t(see section 2.4.3 for a discussion of these reductions).
After some computation
J(s, t) = 1
βlog 1+2e2β+ 4e2β+e4β
eβ+ 4e2β+ 2e3β+e5β
Then the distance is
pβ(us=ut) = 2
2+2eβJ (s,t)
=1+2e2β+ 4e3β+e4β
1 + eβ+ 6e2β+ 6e3β+e4β+e5β
6
Figure 4: The red curve is rp(s, t)as a function of p. The blue curve is pβ(us=ut)as a function
of β
We have a few observations. The rst is that limp1rp(s, t) = 1 = 1/mincut(s, t)as discussed
in section 2.2. The second observation is that limp→∞ rp(s, t) = , however limp→∞ rp(s, t)1/p=
3 = dG(s, t)as discussed in 2.2. The nal observation is that limβ→∞ pβ(us=ut) = 1, this makes
sense because in the 0 temperature limit only labellings with minimum cut will have mass.
1.7 Phase transitions in p-resistances
In [1], Alamgir and von Luxburg showed that a phase transition occurs in p-resistances (in many
random graph models and limit of innite data). In that paper, the resistance distance Rp(s, t)is
separated into a local and a global contribution, the local contribution corresponds to the part of
the resistance coming from a neighbourhood of sand t, and the global contribution comes from
anything outside this neighbourhood. The interesting result is that for a certain range of the
parameter pthe global contribution is negligible, meaning that the distance function does not take
into account global properties of the graph and hence is not interesting. In particular the paper
shows the p-resistance is governed by meaningful global properties as long as p < p= 1 + 1
d1
whereas it converges to a trivial local quantity if p > p∗∗ = 1 + 1
d2(dis the dimension of the
underlying space).
It is important to note that Alamgir and von Luxburg use a dierent denition of p-resistance than
the one due to Herbster (that we use throughout this Msc project). Alamgir and von Luxburg use
the following
Rp(s, t) = min
i{
e
re|i(e)|p:iunit ow from s to t}
On the other hand Herbster’s p-resistance as a ow minimisation problem is
rp(s, t) = min
i{
eE(G)
r
1
p1
e|i(e)|1+ 1
p1:iunit ow from sto t}
Hence we see that there are two dierences between Alamgir and Herbster’s denitions. The rst
dierence is that Alamgir’s pand Herbster’s pare related by p1 = 1
p1. The second is that
in Herbster’s denition the edge weights have a power of 1
p1while in Alamgir’s they don’t. The
main point to note is that increasing Herbster’s pto innity is equivalent to letting Alamgir and
von Luxburg’s ptend to 1 and vice versa.
7
1.8 Laplacian interpolation isn’t perfect (Octopus graphs)
In [10], Herbster and Pontil show that for a certain type of graph on nvertices (with a cut of 1)
and given trial sequence, 2-Laplacian interpolation incurs θ(n)mistakes. Ideally we would like
bounds that grow with log nso this is an important limitation.
An octopus graph O1,r,s dened to be rpath graphs (the tentacles) of length sall adjoined at
a common end vertex (we call central vertex), to which a further single head vertex is attached,
so that n=|V|=rs + 2. We label the Octopus graph as follows. The head vertex is labelled +1
(blue) and all the tentacles end in a 1(red).
Figure 5: O1,2,2graph
In [10], Herbster and Pontil show that for Laplacian interpolation (Harmonic solution), and a
specic trial sequence, each tentacle of the O1,d,d results in one mistake, so a total of dmistakes
at least.
In section 2.4.2, we generalise the above to the setting of p-Laplacian interpolation predictions.
In particular we show that as long as p1 + log r
log s, the Octopus graph O1,r,s with a head labelled
+1 and each tentacle ending in a 1(queried in the order : head, and then end of tentacles) will
make rmistakes. By choosing r=s=dwe recover the result of [10].
1.9 Important properties of algorithms
We are interested in three properties of prediction algorithms
Markov : the label of a node depends only on the labelling of a Markov blanket of the node
permutation independent : It does not matter in which order the previous nodes were labelled
for the labelling of a given node
label monotone : Labelling a node +1 will only increase the probability of subsequent nodes
being labelled +1 : meaning either the label doesn’t change, or if it does it changes to +1
These three natural properties are shared by both p-Laplacian interpolation and prediction with
the Ising marginals, and hence give rise to similarities between the two prediction algorithms. Any
algorithm sharing these properties is called regular, and in [11] a bound is proven for any regular
algorithm. Below we provide proofs of some of these properties for both algorithms.
Ising is Markov.
If we are given the labels of a Markov blanket of some node x, then the Ising distribution can
be factored into an inner distribution over nodes within the Markov blanket and an outer distri-
bution over nodes outside the Markov blanket.
p(u) = pinner(Uinner )pouter(Uouter )
The outer nodes then have no impact on the marginal at node xas they can be marginalised from
the expression.
p(ux) =
Uinner
pinner(Uinner )
This depends only on the labelling of the Markov blanket. So the Ising prediction is Markov.
Ising is permutation independent.
The Ising prediction is as follows
pred =argmaxy∈{±1}p(ui=y|data)
8
The ordering of the data doesn’t matter for the nal prediction. So Ising is permutation indepen-
dent.
p-Laplacian interpolation is Markov.
p-Laplacian interpolation works by minimising
uG,p =
ijE(G)
|uiuj|p
rij
1/p
subject to constraints. If we are given the labels of a Markov blanket of some node x, then the
minimisation problem can be split in two : one inner minimisation problem concerning everything
whithin the Markov blanket, and one outer minimisation problem concerning everything outside
the Markov blanket.
uG,p =
ijinner edges
|uiuj|p
rij
+
ijouter edges
|uiuj|p
rij
1/p
Since both sums don’t interact, they can be minimised independently, hence the sum over outer
edges is irrelevant to the minimisation at node x. So the label at node xdepends only on the
labelling of the Markov blanket.
p-Laplacian interpolation is permutation independent.
p-Laplacian interpolation works by minimising
uG,p =
ijE(G)
|uiuj|p
rij
1/p
subject to the constraints on the allready observed nodes. Changing the order of the observed
nodes does not change the minimisation problem. So p-Laplacian interpolation is permutation
independent.
We are missing the proofs that Ising and p-Laplacian interpolation are label monotone. Intu-
itively it makes sense as labelling a node 1 should increase the positive inuence on other nodes
making them more likely to be labelled 1. However it is necessary to provide formal proofs, so
these two properties remain conjectures.
9
2 Connectivity versus geodesic distance
2.1 Introduction
We are interested in how the algorithms behave in the extremes of the parameter spectrum, that
is T0,for Ising and p1,for p-Laplacian interpolation. We also study the trade-o
between connectivity and geodesic distance, that means that as we run through the parameter
spectrum, either connectivity or geodesic distance will be the overwhelming factor, and we aim to
understand when the shift happens. As p1and T0, it seems that p-Laplacian interpolation
and prediction with Ising distribution become min-cut preserving algorithms, so the prediction is
based on connectivity. On the other hand as p→ ∞ and T→ ∞, it seems that both algorithms
become geodesic minimising, so a geodesic distance based prediction. In this section we discuss
this in more detail and prove/disprove some of the conjectures made prior.
2.2 Preliminaries
The T0prediction of the Ising model is dened as follows. For each T > 0, we dene ui(T)to
be the prediction of the Ising model with parameter T, the T0prediction is
ui=lim
T0ui(T)
In [11], this is shown to be well dened, and important properties of this type of prediction are
proven (for instance optimality on trees). The fact that this type of prediction is min-cut preserving
in the 0-temperature limit is a simple consequence of the density of the Ising model
pG(T;u)e1
TϕG(u)
As T0, labellings with minimal cut will start carrying all the mass, hence a label consistent
with a minimal cut labelling will be predicted.
In analogy to the above, we wish to dene the p1p-Laplacian interpolation prediction. For
each p > 1dene wi(T)to be the prediction of p-Laplacian interpolation with parameter p, the
p1prediction is
wi=lim
p1wi(T)
The above is not yet shown to be well dened or useful. Instead it is proven (for instance in [9])
that for any two nodes sand ton a graph G, we have the following
lim
p1rp(s, t) = 1
mincut(s, t)
where mincut(s, t)is the minimum size of an edge set that cuts sfrom t.
It is also unclear whether it makes sense to talk about the p→ ∞ prediction for p-Laplacian
interpolation, as this is not shown to be well dened or useful. However it is shown ([9]) that
lim
p→∞ rp(s, t)1
p=d(s, t)
where d(s, t)denotes the shortest path distance between sand t.
The above leads to the conjecture that p-Laplacian interpolation becomes geodesic minimising
as p→ ∞. We show in 2.3 that this isn’t quite true in general by giving a counterexample. How-
ever, it can still be considered as a heuristic. Indeed we show in appendix A that in the forward
type of prediction for p-Laplacian interpolation (described in 1.6), the algorithm is indeed geodesic
minimising or ill dened.
As T→ ∞, we only managed to prove a heuristic for the geodesic minimising property by re-
stricting our study to star graphs, that is graphs that are composed of a central unlabelled vertex
and a certain number of paths of varying lengths emanating from this vertex (see 2.5). It remains
to prove or disprove that as T→ ∞ the Ising prediction is geodesic minimising on a general graph.
10
2.3 Geodesic minimising counterexample for -Laplacian
We rst dene what we mean by -Laplacian. The p-norm of a vector xis given by
xp=
i|xi|p1
p
When p→ ∞ the pnorm becomes the max norm.
x=max
i|xi|
max norm interpolation consists in nding a labelling to minimize the max norm on a graph
u=argminumax
ijE(G)|uiuj|(2)
subject to the constraints on the allready labelled nodes.
We note that the max norm minimization is usually not unique as the objective is not convex.
We solve this by adding a condition to the minimisation which we will refer to as the labelling
being natural meaning all changes are as gradual as they can be, or mathematically we want
ijE(G)|uiuj|2
to be minimal, amongst all labellings solving (2).
We found an example of a graph where the most natural max-norm interpolating labelling does
not satisfy the geodesic minimising property. Furthermore we claim that this example is minimal
in terms of number of vertices and edges (this is proven but we do not provide the proof in this
write-up for simplicity).
-1 ?-1
+1
Figure 6: Example graph, labels are ±1and the vertex we are trying to label is denoted ?
Below we provide the most natural max norm interpolating labelling of this graph showing that
the unknown vertex denoted (?) is labelled negative even though it is closer to a +1 than a 1.
To understand why this is the interpolating labelling, one can read appendix A. The fast intuition
to understanding how the labelling below is achieved is that to minimize the max norm on the
graph, we need to make any change as gradual as possible : this means we rst interpolate along
the shortest path between a 1and a +1 and then interpolate again between the central node
and the second 1. The method described very briey here is shown in appendix A to provably
provide a max norm interpolating labelling.
-1 3
5
1
5
1
5
1
25
7
25
13
25
19
25 -1
3
5
+1
Figure 7: Counterexample
A natural question that arises is whether it makes sense to speak of -Laplacian interpolation
as a limit of p-Laplacian interpolation. Clearly -Laplacian interpolation is non-convex and hence
11
ill dened. However the solution we choose is arguably the most natural since each change is as
gradual as it can be. Nevertheless it is still unclear whether a formal limit can be dened for the
solution of p-Laplacian interpolation as p→ ∞, this remains an interesting open question.
2.4 Case study : p-Laplacian vs Ising
We will study the interplay between connectivity and geodesic distance in a specic case for both
Ising and p-Laplacian interpolation. The graph we will consider is as below in Figure 8. It consists
of one edge between vertex 0 and vertex 1, and then rpaths of length sbetween vertex 1 and
vertex 2. Vertex 0 is labelled +1 (Blue), and vertex 2 is labelled 1(Red).
01 2
Figure 8: one edge between vertex 0 and vertex 1, and rpaths of length sin Parallel between
vertex 1 and vertex 2
The question is what is the label of vertex 1 as a function of the parameters r, s, p for p-Laplacian
and r, s, β for Ising. We note that to minimise the cut, the central vertex should be red, on the
other hand to be geodesic distance minimising the central vertex should be blue. Hence this is a
perfect simplied scenario for studying the tradeo between connectivity and geodesic distance.
2.4.1 p-resistance reductions and p-Laplacian case study
As proved in [9] by Herbster and Lever, p-resistances satisfy series and parallel laws. The series
law for a path of length sis
Rp=s1
i=0
π
1
p1
i,i+1p1
and for redges in parallel we have
Rp=1
r
k=1 1
πk
So the paths of length shave resistance
R=sp1
And we have rof them in parallel so the total eective resistance between vertex 1 and 2 is
Ref f (1,2) = sp1
r
So the eective conductance is
Cef f (1,2) = r
sp1
It follows that the labelling at node 1 will be of the sign of the node having closest geodesic distance
(we refer to this as geodesic distance ’wins’) i
Cef f (1,2) 1
or
p1 + log r
log s
In the next section we use the above analysis to prove a result about the number of mistakes
p-Laplacian interpolation makes on Octopus graphs.
12
2.4.2 Octopus graphs revisited in p-Laplacian interpolation setting
We remind ourselves that an octopus graph O1,r,s dened to be rpath graphs (the tentacles) of
length sall adjoined at a common end vertex (we call central vertex), to which a further single
head vertex is attached, so that n=|V|=rs + 2.
Figure 9: O1,2,2graph
Figure 10 shows the transformation called merging symmetrical nodes. We can always merge
nodes that play a symmetrical role in a graph without changing any of the predictions of algorithms.
This means that Octopus graphs are actually the same as the graphs we consider in Figure 8 once
they are labelled.
Figure 10: Transformation of merging symmetrical nodes
Theorem 1 (p-Laplacian interpolation limitation).As long as p1 + log r
log s, the Octopus graph
O1,r,s with a head labelled +1 and each tentacle ending in a -1 (queried in the order : head, and
then end of tentacles) will make rmistakes.
Proof. The octopus graph O1,r,s is composed of a head (+1), a body and rtentacles of length s
ending in a (1). We note that each labelled end of tentacle plays a symmetric role so they can
be merged into a single vertex analogously to the transformation of gure 10. If geodesic distance
wins (meaning the body vertex is labelled +1), then the next end of tentacle will be labelled +1
as well resulting in an extra mistake. As shown in 2.3.1, if we choose p1 + log r
log sthen geodesic
distance will win. This remains true if only r< r tentacles have been labelled. So the process will
make a mistake on each tentacle.
The following corollaries are a simple consequence of Theorem 1. They show a limitation of
Laplacian interpolation for dierent ranges of the parameter spectrum, starting from the harmonic
solution of p= 2.
Corollary 2 (2-Laplacian interpolation limitation).By using p= 2 in the above theorem, we
recover the result of [10] proved by Herbster and Pontil, that there exists a graph on nvertices and
a trial sequence such that Laplacian interpolation makes θ(n)mistakes.
Corollary 3 (k-Laplacian interpolation limitation).We let kand dbe integers, and we consider
the O1,dk1,d Octopus graph on n=θ(dk)vertices. Then for
p= 1 + log dk1
log d=k
and the previously specied trial sequence, k-Laplacian interpolation will make θ(dk1) = θ(nk1
k)
mistakes.
Corollary 4 ((1 + a
b)-Laplacian interpolation limitation).We consider the O1,da,dbOctopus graph
on n=θ(da+b)vertices. Then for
p= 1 + log da
log db= 1 + a
b
13
and the previously specied trial sequence, (1+ a
b)-Laplacian interpolation will make θ(da) = θ(na
a+b)
mistakes.
Using corollary 4 : a family of high mistake Octopus graphs in the limit p1
Consider the sequence of O1,dai,dbiOctopus graphs for Corollary 4 with
ai=iand bi=i2
Then
pi= 1 + ai
bi
= 1 + 1
i1
and for any large enough value of the index i
dai=di> i2=bi
Since the number of mistakes is daiand the diameter of the O1,dai,dbigraph is dbi+ 1, we have a
family of trees (with given trial sequence) such that the number of mistakes Msatises
M > ϕ log D
And the gap between Mand ϕlog Dis
daibi=dii2→ ∞
which is far from negligible.
In the case of prediction with the Ising model in the 0 temperature limit, it is proven in [11]
that the algorithm obtains an optimal bound on trees, that is ϕlog D(where Dis the diameter
of the tree). An open problem is to prove or disprove that the p-Laplacian prediction in the limit
p1obtains the same bound (if that is a properly dened concept).
2.4.3 Ising reductions
We let each edge have a coupling strength J(e)and use the general denition of the Ising model
density.
p(u) = 1
Zeβe=ij J(e)δui,uj
For edges in parallel this reduction is very easy, we can join the terms corresponding to each edge
in one term and we get the eective energy is
J=J(e1) + J(e2)
The case for edges in series is slightly more complex. We note that this is the opposite of electrical
network theory where edges in series is the simpler case. On page 352 of Georey Grimmett’s book
on the random cluster model [8] we nd a formula for the reduction of two edges in series with
respective energies J1and J2. The eective energy (after adjusting to our notation) is given by
J=1
βlog 1 + eβ(J1+J2)
eβJ1+eβJ2
We provide a proof of this below
Proof. We consider a path of lenght 2 with coupling strengths J1and J2respectively. The 3
vertices have labels u1, u2, u3. The Ising distribution is
p(u)eβ(J1δσ12+J2δσ23)
We now use eλδui,ui+1 = (1 + δui,ui+1 (eλ1)) to get
p(u)(1 + δu1,u2(eβJ11))(1 + δu2,u3(eβJ21))
14
We now wish to marginalise out u2to get
p(u1, u3)
u2
(1 + δu1,u2(eβJ11))(1 + δu2,u3(eβJ21))
By checking cases we get
p(u1, u3)eβJ1+eβJ2if u1̸=u3
eβ(J1+J2)+ 1 if u1=u3
We rewrite this as
p(u1, u3)(eβJ1+eβJ2)1δu1,u3(eβ(J1+J2)+ 1)δu1,u3eβ(J1+J2)+ 1
eβJ1+eβJ2δu1,u3
Now we simply have
eβJ =eβ(J1+J2)+ 1
eβJ1+eβJ2
so
J=1
βlog eβ(J1+J2)+ 1
eβJ1+eβJ2=1
βlog 1 + eβ(J1+J2)
eβJ1+eβJ2
Theorem 5 provides a generalisation of the above formula to a path of length n. The proof
starts o similarly to the above but is inevitably more complex as we are dealing with a bigger
marginalisation.
Theorem 5 (Ising Path reduction).Consider a path of length n(nedges), each edge having a
coupling strength Ji= 1. We nd a formula to reduce the path to an edge with an equivalent couling
strength Jeff given below
Jeff =1
βlog 1 + 2(eβ1)n
(eβ+ 1)n(eβ1)n
Proof. The Ising distribution on the path of length nwith a binary vector u∈ {−1,1}nis
p(u)eβn
i=1 Jiδui,ui+1 =
n
i=1
eβJiδui,ui+1
By using eλδui,ui+1 = (1 + δui,ui+1 (eλ1)), we can rewrite this as
n
i=1
(1 + δui,ui+1 (eβ1))
This can be rewritten as a sum over all subsets Sof {1, ..., n 1}as follows
=
S
iS
δui,ui+1 (eβ1)
We now wish to marginalise over nodes v2, ..., vnto get the distribution on u1, un+1.
p(u1, un+1)
u2,...,un
S
iS
δui,ui+1 (eβ1)
We note that the term corresponding to Scontaining all edges is a special case, since it depends
on the boundary conditions u1and un. More specically if u1̸=un, then the product is always 0
so we remove it from the sum as a special case.
p(u1, un+1)δu1un+1
n
i=1
(eβ1) +
u2,...,un
|S=n
iS
δui,ui+1 (eβ1)
15
=δu1un+1 (eβ1)n+
u2,...,un
|S=n
iS
δui,ui+1 (eβ1)
We rewrite the second term as
|S=n
u2,...,un
iS
δui,ui+1 (eβ1)
We now notice that iSδui,ui+1 (eβ1) is non zero for exactly 2n1−|S|vectors u2, ..., un.
This is because each delta function removes exactly one degree of freedom. So the sum is exactly
|S=n
2n1−|S|(eβ1)|S|
Since there are n
ksets of k edges we can rewrite the above as a sum over k
n1
k=0 n
k2n1k(eβ1)k
= 2n1
n1
k=0 n
k(eβ1
2)k
= 2n11 + eβ1
2n
eβ1
2n
= 2n1eβ+ 1
2n
eβ1
2n
=1
2eβ+ 1neβ1n
Now we divide p(u1, un+1)by the above to get
p(u1, un+1)(eβ1)n
1
2(eβ+ 1)n(eβ1)nδu1un+1 + 1
Now we recognise the form eλδu1,un+1 = (1 + δu1,un+1 (eλ1)), which we now use in the opposite
way to get
eβJef f 1 = 2(eβ1)n
(eβ+ 1)n(eβ1)n
And nally this yields
Jeff =1
βlog 1 + 2(eβ1)n
(eβ+ 1)n(eβ1)n
2.4.4 Ising case study
We consider again the graph of Figure 8. The eective coupling strength between vertices 1 and 2
in the Ising model is
J(1,2) = r
βlog 1 + 2(eβ1)s
(eβ+ 1)s(eβ1)s
It follows that geodesic distance ’wins’ i
r
βlog 1 + 2(eβ1)s
(eβ+ 1)s(eβ1)s<1
It would be hard to invert this and solve for βthe way we did for p-Laplacian interpolation.
However given β,rand sit is easy to compute the formula above and determine whether geodesic
distance takes over or connectivity takes over. The main dierence from the formula for p-Laplacian
16
interpolation is that the coupling strength falls o exponentially with path length, while for p-
Laplacian interpolation it falls o polynomially. Indeed
J(1,2) = r
βlog 1 + 2(eβ1)s
(eβ+ 1)s(eβ1)sr
β
2(eβ1)s
(eβ+ 1)s(eβ1)s
The (eβ+ 1)sin the denominator is the dominating factor as sgets larger. So it seems that the
Ising prediction is more sensitive to path length.
2.5 Geodesic minimising heuristic for Ising as T→ ∞
What we would like to prove is that as T→ ∞ (β0), the Ising prediction is geodesic minimising
(meaning it will predict with the label of a nearest labelled neighbour). This is quite hard to show
in general so we will consider the simplied setting of a star graph. A star graph has a central
vertex, and a number mof paths emanating from it of varying lengths. We sketch single edges
emanating from central vertex 0for convenience but the paths could have arbitrary lengths.
0
1 2
3
4
Figure 11: Star graph
We suppose that the central vertex is unlabeled and all paths lead to vertices labelled either
+1 or 1. We assume that all kpaths leading to a 1have length land all kpaths leading to a
+1 have length l> l. The above simplication does not restrict generality because if path lengths
are dierent, we are proving a stronger result by taking the minimum path length leading to a +1
and the maximum path length leading to a 1. What we are trying to show is that for βsmall
enough, the following holds
J(0,1) > J(0,+1)
where J(0,1) is the total coupling strength to vertices labelled 1and J(0,+1) is the total
coupling strength to vertices labelled +1. So using the formula of 2.4.3.
klog 1 + 2(eβ1)l
(eβ+ 1)l(eβ1)l> klog 1 + 2(eβ1)l
(eβ+ 1)l(eβ1)l
We take a rst order Taylor expansion of log(1 + x), since in the limit as β0second order terms
are negligible. It suces to prove
k2(eβ1)l
(eβ+ 1)l(eβ1)l> k2(eβ1)l
(eβ+ 1)l(eβ1)l
So it suces to prove that for any constant C > 0
lim
β0(eβ1)ll< C
And this is trivially true since
lim
β0(eβ1)ll= 0
This shows that in the innite temperature limit, no matter how many paths lead to a +1, if
there is a path leading to a 1that is shorter than all of them the central vertex will be labelled
negative. This is a strong heuristic for the geodesic minimising property.
17
3 Understanding the algorithms with guarantees
3.1 Halving bound for Ising Model
Theorem 6 (halving bound).If we predict online with the argmax of the marginal of a distribution
p, given the observed labels. Then for any uconsistent with the trial sequence we have the following
bound on the number of mistakes M.
Mlog2(1
p(u))
Proof. We start with a distribution on a set of states. At each time step we predict with the
argmax of the marginal at the queried state. We let Wtdenote the total available mass at time t,
where available mass is dened as
ui1,...,uik∈{±1}
p(ui1, ..., uik|Data)p(Data)
where the sum is over non-observed states and Data contains information on observed states. On
each trial where a mistake occurs the wrong label carries more than half the mass, so when the
constraint is added the total available mass more than halves. So at the end of the trial sequence
the total available mass satisifes
WT1
2M
Since the total available mass still contains the mass of any label consistent conguration, for any
uconsistent with the trial sequence we have
WTp(u)
hence
p(u)1
2M
or
Mlog2(1
p(u))
Corollary 7. If we predict the graph labellings with the argmax of the marginal of the Ising
distribution at the given vertex, then the total number of mistakes Msatises
Mlog2(Z(β, G)) + βϕG(u)
Proof. For the Ising model
p(u) = 1
Z(β, G)eβ ϕG(u)
So applying Theorem 6, we get that predicting with Ising model marginals can make at most M
mistakes where
Mlog2(1
p(u))
=log2(Z(β, G)) + βϕG(u)
as required
3.2 Reformulating partition function as Tutte polynomial
The Tutte polynomial (denition 8) is one of the most important and fundamental objects in
graph theory. It happens to be intricately linked to the partition function of the Ising distribution
(Theorem 9), creating a bridge between graph theory and statistical physics which is an active
research area. Theorem 9 and it’s proof were inspired (but not fully identical) to the analysis in
the following paper [2].
18
Denition 8. The Tutte polynomial for a graph Gis a polynomial in two variables dened by
TG(x, y) =
AE(G)
(x1)k(A)k(E)(y1)k(A)+|A|−|V|
where k(A)denotes the number of connected components of the graph (V, A).
Theorem 9. We may rewrite the partition function of the Ising model as a Tutte polynomial
Z(β, G) = 2eβ(mn+1) (1 eβ)n1TG(1 + eβ
1eβ, eβ)
Theorem 9 will help us nd explicit expressions for the halving bound on trees. The proof is
provided in appendix B. In general we expect this link to the Tutte polynomial to be useful to
predict theoretical properties of the partition function, and hence of the halving bound. We also
observe that the parameters x, y =1+eβ
1eβ, eβof the Tutte polynomial satisfy
(x1)(y1) = (1 + eβ
1eβ1)(eβ1) = 2
Hence studying the Ising partition function is equivalent to studying the Tutte polynomial on the
hyperbola (x1)(y1) = 2.
3.3 Working out the bound for trees
The Tutte polynomial of a tree with nvertices evaluates to xn1. This is a simple consequence of
the deletion contraction formula for Tutte polynomials (see for instance [5]).
TG(x, y) = TGe(x, y) + TG/e(x, y )
where Geis the graph Gwith edge eremoved, and G/eis the graph Gwith edge econtracted.
Hence, using theorem 9, the partition function of a tree is
Z(β, G) = 2eβ(mn+1) (1 eβ)n1T(G;1 + eβ
1eβ, eβ)
= 2(1 eβ)n11 + eβ
1eβn1
= 2 1 + eβn1
We can then work out the bound of the halving algorithm
Mlog 2+(n1) log(1 + eβ) + βϕG(u)
We take a derivative with respect to βand set it equal to 0
(n1)
1 + eβeβ=ϕG(u)
solving for β
β=log n1
ϕG(u)1
Choosing the tuning of β=log n1
ϕ(u)(near optimal), the bound simplies to
M1+(n1) log 1 + ϕ(u)
n1+log n1
ϕ(u)ϕ(u)
1+(n1) ϕ(u)
n1+log n1
ϕ(u)ϕ(u)
= 1 + ϕ(u) + log n1
ϕ(u)ϕ(u)
The family of trees includes paths and star graphs so we immediately get the bound for these
graphs as well.
19
3.4 Bound is tight for path
The above bound is near-tight for a path. To observe why this is true we play an adversarial game.
We are given a path of length nwith unlabelled vertices. The unknown labelling uis chosen to
have cut kwith 1 from position 1 to n
k, then -1 from position n
kto 2n
ketc.
01 2 345678
Figure 12: Example of labelling with n= 9,k= 2
We rst force k+1 mistakes by querying the labels at the positions in
kfor i= 0, . . . , k. Then for
each of the sections [in
k,(i+ 1) n
k]we can force log2n
kmistakes by a binary search always querying
a vertex closer to the right hand vertex of the section. So this is a total of approximately
M= 1 + k+klog n
k
mistakes, which is tight with the upper bound found above.
3.5 Clique
1
2
3
4
5
6
7
8
For a clique on nvertices, we can write the partition function of the Ising model as follows
Z(β, G) =
u
eβϕ(u)
=
n
k=0 n
keβk(nk)
The behaviour of the above formula is complex. It can also be seen as the partition function of a
fully connected Boltzmann machine. So the bound on the number of mistakes for a clique is
Mlog n
k=0 n
keβk(nk)+βϕG(u)
It remains open to nd an optimal tuning for βin the bound given above.
3.6 Bound variations as we add or remove edges
An interesting question is whether the bound gets better or worst as we add more edges (or possibly
this depends intricately on the topology). We consider rst the act of adding one edge to a graph
G. This could increase the cut of the true labelling (ϕG(u)) by 1 or by 0. On the other hand
the partition function will change, it can only decrease as any labelling will have higher cut in the
new graph with an extra edge. In particular it cannot decrease more than the scenario where each
labelling will have it’s cut increase by 1. So
Z(β, G)Z(β , G +e) =
u
eβϕG+e(u)
u
eβ(ϕG(u)+1) =eβ
u
eβϕG(u)=Z(β, G)eβ
20
and hence
Z(β, G)Z(β , G +e)Z(β, G)eβ
This in turn shows that
log Z(β, G)log Z(β , G +e)log Z(β, G)βlog2e
So
log Z(β, G) + βϕG(u) + βlog Z(β, G +e) + βϕG+e(u)log Z(β, G) + βϕG(u)βlog2e
To make the above easier to understand we introduce the following notation for the bound given
by the halving algorithm
B(β, G, u) = log Z(β , G) + βϕG(u)
Then we rewrite the result as
B(β, G, u) + βB(β , G +e, u)B(β, G, u)βlog2e
What this means is that adding an edge can only increase or decrease the bound by a xed amount
βlog2eat most. So at very high temperatures adding an edge has little importance while at low
temperatures adding an edge can change the bound drastically.
3.7 Exploring link between Ising partition function and network relia-
bility
Network reliability is the probability that a dynamical system composed of discrete elements in-
teracting on a network will be found in a conguration that satises a particular property. The
analysis of Theorem 11 was rst done in [13], it reformulates the partition function of the Ising
model as a network reliability for a given property called Ising feasibility (this property is dened
in the theorem statement). This oers a dierent, probabilistic way to think about the Ising par-
tition function, that could provide insight in it’s behaviour. In particular we use this theorem to
discover a fun fact about graphs.
Denition 10. The Network reliability polynomial of a graph Gwith property rand parameter
x[0,1] is the probability that a subgraph of Gobtained by keeping edges in Gwith probability
1xhas property r.
RG(r;x) =
|E(G)|
j=0
aj(1 x)jx|E(G)|−j
where ajdenotes the number of subgraphs of Gwith jedges that have property r
Theorem 11 (Partition Funtion of Ising model as Network Reliability).We may rewrite the
partition function of the Ising model (1) as a Network reliability
Z(β, G) = 2RG(r;x)x−|E(G)|
where x=1
1+eβ. And the property ris called Ising feasibility ; a subgraph sof Gis Ising feasible
if and only if there is a labelling of the vertices of Gsuch that the cut edges are exactly the edges
of s.
Proof.
Z(β, G) =
u
eβϕ(u)
Since each Ising feasible conguration with jedges is associated to 2 (opposite) binary labellings
with cut j, we can rewrite the above as follows
=
|E(G)|
j=0
2ajeβj
21
where ajis the number of Ising feasible congurations with jedges. We let 1x
x=eβso x=1
1+eβ
=
|E(G)|
j=0
2aj(1 x)jxj
=x−|E(G)|
|E(G)|
j=0
2aj(1 x)jx|E(G)|−j
= 2RG(r;x)x−|E(G)|
Figure 13: Image taken from [13] : (a) An Ising-feasible conguration with three spin-up vertices
(red dots) and eight discordant spin pairs or edges (solid line segments). (b) A conguration
that is not Ising-feasible because of the inconsistent edges (red line segments). Independently
choosing edges and spin-up vertices will rarely produce an Ising-feasible conguration, but any set
of randomly chosen spin-up vertices uniquely determines a set of edges.
Theorem 11 allows us to discover an interesting fun fact about Graphs. Suppose we set β= 0,
then x=1
2. The partition function for β= 0 is just 2nso Theorem 4 states
2n= 2RG(r;1
2)2|E(G)|
or equivalently
RG(r;1
2) = 2n1
2|E(G)|
We remind ouselves that RG(r; 1/2) is the probability that a subgraph of Gobtained by keeping
edges with probability 1/2 is Ising feasible. So we get the interesting fact that the probability P
that a conguration is Ising feasible when the probability of keeping edges is 1/2 depends only on
the number of vertices and edges, and not on any topological properties of the graph, this is a
fairly surprising fact.
P=2n1
2m
22
4 Conclusion
In this thesis we studied p-Laplacian interpolation and predicting with the Ising model with tem-
perature T. We proved some results about the behaviour of these algorithms in the extremes
ranges of the parameter spectrum (T→ ∞ and p=). For instance we proved that p-Laplacian
interpolation with p=is not geodesic minimising in itself, however if we use the forward type of
prediction it is geodesic minimising (or ill dened). We also showed a heuristic for Ising predictions
to be geodesic minimising as T→ ∞ by using the simplied setting of star graphs. We also stud-
ied the interplay between connectivity and geodesic distance in a simplied scenario, and found
formulas to determine which one will take over as a function of the parameter and topological
properties.
It remains open to understand whether the limits of p-Laplacian interpolation as p→ ∞ and
p1are well dened and useful. It would be interesting to prove or disprove that p-Laplacian
interpolation as p1achieves a ϕlog Dbound. It is also not proven in the general case that the
Ising prediction is geodesic minimising as T→ ∞.
One interesting direction for future research is in understanding the implications of the link
between the Ising partition function and the Tutte polynomial for the halving bound. The Tutte
polynomial is a very fundamental concept in graph theory and has been studied extensively. Hence
thinking of the bound in terms of the Tutte polynomial might help us discover interesting prop-
erties. Similarly the link between the Ising partition function and the Ising feasibility network
reliability oers a probabilistic way to think of the Ising partition function that could also provide
insight.
23
5 Appendix A : max-norm interpolation on a graph
Introduction
This section is made to better understand what happens to pLaplacian interpolation as p→ ∞.
The p-norm of a vector xis given by
xp=
i|xi|p1
p
When p→ ∞ the pnorm becomes the max norm.
x=max
i|xi|
Stating the problem
We are interested in what happens to pLaplacian interpolation when p=. That is, given
observed labels (Dataset D={(i, yi) : iL}and Lis the set of labelled vertices), we wish to nd
a labelling ugiven by
u=argminumax
ijE(G)|uiuj|
subject to
ui=yifor iL
The value of the labelling is then
val(G, D) = max
ijE(G)|u
iu
j|
In most cases uis not unique as the minimization is non-convex.
Some Illustrative examples
Intuitively, to minimize the max norm we want all changes to be as gradual as possible. To
illustrate some properties of this type of interpolation, we will study a few simple but enlightening
examples.
Example 1 : the path graph
We consider a path of length l, from vertex 0 to l, with respective labels u0and u1.
012345l
Figure 14: Path graph
It is clear that the labelling uwith minimum max-norm is given by linearly interpolating from
u0to u1in equal steps. This means that the vertex kwill have label
uk=u0+(u1u0)
lk
=lk
lu0+k
lu1
=d(k, l)
d(0, l)u0+d(k, 0)
d(0, l)u1
Where ddenotes distance between vertices. The last formulation is the most interesting as it shows
that the labellings are weighed averages of the end vertices. The value of the path graph is
val(P;u0, u1) = |u1u0|
d(0, l)
This result is very simple but will be useful later when we nd an algorithm to nd a labelling for
a general tree.
24
Example 2 : the star graph
A star graph has a central vertex, and a number mof paths emanating from it of varying lengths.
We sketch single edges emanating from central vertex 0for convenience but the paths could have
arbitrary lengths.
0
1 2
3
4
Figure 15: Star graph
We suppose that the central vertex is unlabeled and all paths lead to vertices labelled either +1
or 1. We claim that the unknown label of the central vertex in max norm minimisation depends
only on the shortest path ending in +1 and the shortest path ending in 1. The proof is simple,
no matter what the central label is, there will be more time to linearly interpolate on the longer
paths, so the only paths that matter are the shortest paths ending in 1and 1. The label is then
found using the path graph solution of Example 1.
An algorithm for max-norm interpolation on trees and proof of work
To my knowledge the following algorithm and it’s proof are novel.
Statement of the algorithm
We work on a tree G, that is partially labelled with real numbers (not necessarily binary). We
denote by Lthe set of labelled vertices. For two vertices i, j in L, connected by a non labelled
path, we dene the value of that connecting path by
val(i, j) = |uiuj|
d(i, j)
This is essentially the value of the path when labelled optimally as in Example 1.
Algorithm (Max norm interpolation) :
Step 1 : Find the fully unlabelled path between vertices i, j in Lthat has maximum value
(choose at random if not unique). That is i, j =argmaxi,jLval(i, j)
Step 2 : Label the path between i, j using linear interpolation, as in Example 1.
Step 3 : Add all newly labelled vertices to the set Lof labelled vertices.
Step 4 : Return to step 1
Proof of work
We will show that when we label a maximum value path, the newly labelled vertices can only
create paths which have value lower than the previously labelled maximum value path. Suppose
that vertices 0and 1labelled u0and u1respectively form a path of maximum value in G. We
assume without loss of generality that u1u0(if not interchange the labels).
25
u0uxu1
uk
Figure 16: Sketch for proof
Let xbe any vertex on the newly labelled path with new label ux. Let kwith label ukbe
another labelled vertex connected to xthrough an unlabelled path. We wish to show that
val(x, k)val(0,1)
Since 0,1is a maximum value path in G, we have with the following useful information,
val(0, k)val(0,1)
and
val(1, k)val(0,1)
proof
val(x, k) = |uxuk|
d(x, k)
By the formula of Example 1 in section 6, we can rewrite uxas follows
ux=d(x, 1)
d(0,1) u0+d(x, 0)
d(0,1) u1
So
val(x, k) = |d(x,1)
d(0,1) u0+d(x,0)
d(0,1) u1uk|
d(x, k)
=|d(x, 1)u0+d(x, 0)u1d(0,1)uk|
d(x, k)d(0,1)
=|d(x, 1)u0+d(x, 0)u1(d(x, 1) + d(x, 0))uk|
d(x, k)d(0,1)
=|d(x, 1)(u0uk) + d(x, 0)(u1uk)|
d(x, k)d(0,1)
We now use d(x, 0) = d(0,1) d(x, 1)
=|d(x, 1)(u0uk)+(d(0,1) d(x, 1))(u1uk)|
d(x, k)d(0,1)
we can rewrite it as
=|d(x, 1)(u0u1) + d(0,1)(u1uk)|
d(x, k)d(0,1) (3)
or
=|d(x, 0)(u1u0) + d(0,1)(u0uk)|
d(x, k)d(0,1) (4)
We need to do some case checking to remove the absolute values without loosing valuable sign
information. There will be four cases to check. This is tedious but necessary to get the correct
inequalities. We remind ourselves that by assumption u0u1.
26
We will also need the following equations
val(1, k)val(0,1)
and
val(0, k)val(0,1)
imply that
|u1uk| ≤ |u1u0|d(1, k)
d(0,1) (5)
and
|u0uk| ≤ |u1u0|d(0, k)
d(0,1) (6)
Case 1 : if uku0, then we use equation (1).
|d(x, 1)(u0u1) + d(0,1)(u1uk)|
d(x, k)d(0,1) =d(0,1)|u1uk| − d(x, 1)|u0u1|
d(x, k)d(0,1)
We now use equation (3)
d(1, k)|u1u0| − d(x, 1)|u1u0|
d(x, k)d(0,1)
=|u1u0|
d(0,1)
Because d(1, k)d(x, 1) = d(x, k).
Case 2 : if u1uk, then we use equation (2).
|d(x, 0)(u1u0) + d(0,1)(u0uk)|
d(x, k)d(0,1) =d(0,1)|uku0| − d(x, 0)|u1u0|
d(x, k)d(0,1)
We now use equation (4)
d(0, k)|u1u0| − d(x, 0)|u1u0|
d(x, k)d(0,1) =|u1u0|
d(0,1)
Case 3 : if u0uku1and d(0,1)(u1uk)d(x, 1)|u0u1|, then we use equation (1).
|d(x, 1)(u0u1) + d(0,1)(u1uk)|
d(x, k)d(0,1) =d(0,1)|u1uk| − d(x, 1)|u0u1|
d(x, k)d(0,1)
We now use equation (3)
d(1, k)|u1u0| − d(x, 1)|u1u0|
d(x, k)d(0,1)
=|u1u0|
d(0,1)
Case 4 : if u0uku1and d(0,1)(u1uk)d(x, 1)|u0u1|, then we use equation (2).
We rst need to note that
d(0,1)(u1uk)d(x, 1)|u0u1|
is equivalent to
d(0,1)(uku0)d(x, 0)|u1u0|
so |d(x, 0)(u1u0) + d(0,1)(u0uk)|
d(x, k)d(0,1) =d(0,1)|u0uk| − d(x, 0)|u1u0|
d(x, k)d(0,1)
We now use equation (4)
d(0, k)|u1u0| − d(x, 0)|u1u0|
d(x, k)d(0,1) =|u1u0|
d(0,1)
This covers all possible cases. So we are done with the proof.
27
Generalisation to arbitrary graph
We note that the above analysis does not heavily use the fact that we are on a Tree. That
assumption is only used because it makes the denitions of the values easier as there is only a path
between any two vertices on a tree. That being said we can generalise the value function to
val(i, j) = max
P|uiuj|
len(P)
where Pdenotes a path from ito jand len(P)it’s length. With this new denition, the algorithm
of section (8.1) and it’s proof of section (8.2) still hold. The only dierence in the proof is that
now we must show
val(x, k) = max
P|uxuk|
len(P)val(0,1)
So we must consider every possible path from xto k. However for every single one of these paths
Pthe analysis of 8.2 can be applied to show
|uxuk|
len(P)val(0,1)
Showing
val(x, k) = max
P|uxuk|
len(P)val(0,1)
as required
Corollaries
The above result has two immediate and enlightening corollaries explaining how this kind of in-
terpolation behaves. The rst is that the value of an interpolating labelling depends only on the
smallest distance between a 1and a +1 in the given data.
And a further conclusion is that when predicting with
ui=argmina∈{±1}val(u|Data + (ui, a))
then the prediction is indeed geodesic minimising (or ill dened).
28
6 Appendix B : proof of Theorem 4
We remind ourselves the formula we wish to prove
Z(β, G) = 2eβ(mn+1) (1 eβ)n1TG(1 + eβ
1eβ, eβ)
By looking at the partition function Zβ(G)as a function of the graph G, we observe that it satises
an interesting recursive property
Zβ(G) = (1 eβ)Zβ(G/e) + eβZβ(Ge)
where e=ij denotes any edge (not loop or bridge), G/eis a new graph where edge ewas contracted
and Geis new graph with edge eremoved. We show this below
Zβ(G) =
u
eβϕG(u)
=
ui=uj
eβϕG(u)+
ui̸=uj
eβϕG(u)
If ui=ujthen the cut doesnt change when we contract edge e. However if ui̸=ujthe cut
decreases by 1 when removing edge e.
=
ui=uj
eβϕG/e(u)+eβ
ui̸=uj
eβϕGe(u)
We are missing some terms so we add and substract them
=
ui=uj
eβϕG/e(u)eβ
ui=uj
eβϕGe(u)+eβ
ui=uj
eβϕGe(u)+eβ
ui̸=uj
eβϕGe(u)
We have ϕG/e(u) = ϕGe(u)if ui=uj. So
= (1 eβ)
u
eβϕG/e(u)+eβ
u
eβϕGe(u)
As desired.
Theorem 12 (Tutte Polynomial Universality Theorem [7]).Let fbe a function on graphs satis-
fying the three following properties.
Property 1 : f(G) = 1 if G is a vertex
Property 2 : f(GH) = f(G)f(H)if GHdenotes either the disjoint union of Gand Hor a
union where Gand Hshare exactly one vertex.
Property 3 : f(G) = af(Ge) + bf (G/e)with ab ̸= 0 whenever eis neither a loop nor a bridge
Then fis an evaluation of the Tutte polynomial of the form
f(G) = an(G)br(G)T(G;x0
b,y0
a)
with n(G)the corank of G,r(G)the rank of Gand x0=f(K2)and y0=f(L).K2is the graph
with one edge. Lis just a loop, one vertex with a loopy edge. Tis the Tutte polynomial.
We can now apply this theorem to our specic case. Zβ(G)does not satisfy the three properties,
but if we dene ˜
Zβ(G) = 2k(G)Zβ(G)
29
then we can check that ˜
Zsatises the conditions for the theorem with
a=eβ
b= (1 eβ)
x0= 1 + eβ
y0= 1
We then get the formula
Zβ(G) = 2k(G)eβ(|E(G)|−n+k(G))(1 eβ)nk(G)T(G;1 + eβ
1eβ, eβ)
For a connected graph with medges and nvertices this is
Zβ(G) = 2eβ(mn+1)(1 eβ)n1T(G;1 + eβ
1eβ, eβ)
The parameters of the Tutte polynomials range x, y ont the hyperbola
(x1)(y1) = 2
30
References
[1] Morteza Alamgir and Ulrike Luxburg. “Phase transition in the family of p-resistances”. In:
Advances in Neural Information Processing Systems 24 (Jan. 2011).
[2] Laura Beaudin et al. “A little statistical mechanics for the graph theorist”. In: Discrete
Mathematics 310.13 (2010), pp. 2037–2053. issn: 0012-365X. doi:https: / /doi.org/10 .
1016/ j.disc. 2010.03. 011.url:http:// www.sciencedirect.com /science/article /
pii/S0012365X10000890.
[3] M. Belkin, I. Matveeva, and P. Niyogi. “Tikhonov regularization and semi-supervised learning
on large graphs”. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal
Processing. Vol. 3. 2004, pp. iii–1000.
[4] Mikhail Belkin and Partha Niyogi. “Semi-Supervised Learning on Riemannian Manifolds”.
In: Mach. Learn. 56.1–3 (June 2004), pp. 209–239. issn: 0885-6125. doi:10.1023/B:MACH.
0000033120.25363.1e.url:https://doi.org/10.1023/B:MACH.0000033120.25363.1e.
[5] Thomas Brylawski. “The Tutte Polynomial Part I: General Theory”. In: (Jan. 2011). doi:
10.1007/978-3-642-11110- 5_3.
[6] Avrim Edu and Shuchi Edu. “Learning from Labeled and Unlabeled Data using Graph Min-
cuts”. In: (Jan. 2002).
[7] Joanna A. Ellis-Monaghan and C. Merino. “Graph Polynomials and Their Applications I:
The Tutte Polynomial”. In: Structural Analysis of Complex Networks. 2011.
[8] Grimmet. Random Cluster Model. 2006.
[9] Mark Herbster and Guy Lever. “Predicting the Labelling of a Graph via Minimum p-
Seminorm Interpolation.” In: Dec. 2009.
[10] Mark Herbster, Guy Lever, and Massimiliano Pontil. “Online Prediction on Large Diameter
Graphs.” In: Jan. 2008, pp. 649–656.
[11] Mark Herbster, Stephen Pasteris, and Shaona Ghosh. “Online Prediction at the Limit of Zero
Temperature”. In: Advances in Neural Information Processing Systems 28. Ed. by C. Cortes
et al. Curran Associates, Inc., 2015, pp. 2935–2943. url:http://papers.nips.cc/paper/
5690-online-prediction-at- the-limit-of- zero-temperature.pdf.
[12] Douglas Klein and Milan Randic. “Resistance Distance”. In: Journal of Mathematical Chem-
istry 12 (Dec. 1993), pp. 81–95. doi:10.1007/BF01164627.
[13] Yihui Ren, Stephen Eubank, and Madhurima Nath. “From network reliability to the Ising
model: A parallel scheme for estimating the joint density of states”. In: Physical Review
E94.4 (Oct. 2016). issn: 2470-0053. doi:10 . 1103 / physreve . 94 . 042125.url:http :
//dx.doi.org/10.1103/PhysRevE.94.042125.
[14] Xiaojin Zhu, Zoubin Ghahramani, and John D. Laerty. “Semi-Supervised Learning Using
Gaussian Fields and Harmonic Functions”. In: ICML. 2003.
31
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Network reliability is the probability that a dynamical system composed of discrete elements interacting on a network will be found in a configuration that satisfies a particular property. We introduce a new reliability property, Ising feasibility, for which the network reliability is the Ising model s partition function. As shown by Moore and Shannon, the network reliability can be separated into two factors: structural, solely determined by the network topology, and dynamical, determined by the underlying dynamics. In this case, the structural factor is known as the joint density of states. Using methods developed to approximate the structural factor for other reliability properties, we simulate the joint density of states, yielding an approximation for the partition function. Based on a detailed examination of why naive Monte Carlo sampling gives a poor approximation, we introduce a novel parallel scheme for estimating the joint density of states using a Markov chain Monte Carlo method with a spin exchange random walk. This parallel scheme makes simulating the Ising model in the presence of an external field practical on small computer clusters for networks with arbitrary topology with 10 to 6 energy levels and more than 10 to 308 microstates.
Conference Paper
Full-text available
We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large di- ameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the "best of both worlds": it performs well locally in the presence of cluster structure, and globally on large diameter graphs.
Article
Full-text available
In this survey of graph polynomials, we emphasize the Tutte polynomial and a selection of closely related graph polynomials. We explore some of the Tutte polynomial's many properties and applications and we use the Tutte polynomial to showcase a variety of principles and techniques for graph polynomials in general. These include several ways in which a graph polynomial may be defined and methods for extracting combinatorial information and algebraic properties from a graph polynomial. We also use the Tutte polynomial to demonstrate how graph polynomials may be both specialized and generalized, and how they can encode information relevant to physical applications. We conclude with a brief discussion of computational complexity considerations.
Article
We study the family of p-resistances on graphs for p ≥ 1. This family generalizes the standard resistance distance. We prove that for any fixed graph, for p = 1 the p-resistance coincides with the shortest path distance, for p = 2 it coincides with the standard resistance distance, and for p → ∞ it converges to the inverse of the minimal s-t-cut in the graph. Secondly, we consider the special case of random geometric graphs (such as k-nearest neighbor graphs) when the number n of vertices in the graph tends to infinity. We prove that an interesting phase transition takes place. There exist two critical thresholds p * and p ** such that if p < p * , then the p-resistance depends on meaningful global properties of the graph, whereas if p > p ** , it only depends on trivial local quantities and does not convey any useful information. We can explicitly compute the critical values: p * = 1 + 1/(d − 1) and p ** = 1 + 1/(d − 2) where d is the dimension of the underlying space (we believe that the fact that there is a small gap between p * and p ** is an artifact of our proofs). We also relate our findings to Laplacian regularization and suggest to use q-Laplacians as regularizers, where q satisfies 1/p * + 1/q = 1.
Article
Matroid theory (sometimes viewed as the theory of combinatorial geometries or geometric lattices) is reasonably young as a mathematical theory (its traditional birthday is given as 1935 with the appearance of [159]) but has steadily developed over the years and shown accelerated growth recently due, in large part, to two applications. The first is in the field of algorithms. To coin an oversimplification: “when a good algorithm is known, a matroid structure is probably hidden away somewhere.” In any event, many of the standard good algorithms (such as the greedy algorithm) and many important ones whose complexities are currently being scrutinized (e.g., existence of a Hamiltonian path) can be thought of as matroid algorithms. In the accompanying lecture notes of Professor Welsh the connections between matroids and algorithms are presented. Another important application of matroids is the theory of the Tutte polynomial {\text{t}}\left( {{\text{M;}}\,{\text{x, y}}} \right) = \sum {{\text{a}}_{{\text{ij}}} \left( {{\text{x - 1}}} \right)^{\text{i}} \left( {{\text{y - 1}}} \right)^{\text{j}} }
Conference Paper
We consider the problem of labeling a partially labeled graph. This setting may arise in a number of situations from survey sampling to information retrieval to pattern recognition in manifold settings. It is also, especially, of potential practical importance when data is abundant, but labeling is expensive or requires human assistance. Our approach develops a framework for regularization on such graphs parallel to Tikhonov regularization on continuous spaces. The algorithms are very simple and involve solving a single, usually sparse, system of linear equations. Using the notion of algorithmic stability, we derive bounds on the generalization error and relate it to the structural invariants of the graph.
Article
In this survey, we give a friendly introduction from a graph theory perspective to the q-state Potts model. The Potts model is an important statistical mechanics tool for analyzing complex systems in which nearest neighbor interactions determine the aggregate behavior of the system. We present the surprising equivalence of the Potts model partition function and one of the most renowned graph invariants, the Tutte polynomial. This relationship has resulted in a remarkable synergy between the two fields of study. We highlight some of these interconnections, such as computational complexity results that have alternated between the two fields. The Potts model captures the effect of temperature on the system and plays an important role in the study of thermodynamic phase transitions. We discuss the equivalence of the chromatic polynomial and the zero-temperature antiferromagnetic partition function, and how this has led to the study of the complex zeros of these functions. We also briefly describe Monte Carlo simulations commonly used for Potts model analysis of complex systems. The Potts model has applications in areas as widely varied as magnetism, tumor migration, foam behaviors, and social demographics, and we provide a sampling of these that also demonstrates some variations of the Potts model. We conclude with some current areas of investigation that emphasize graph theoretic approaches.
Conference Paper
We study the problem of predicting the labelling of a graph. The graph is given and a trial sequence of (vertex,label) pairs is then incrementally revealed to the learner. On each trial a vertex is queried and the learner predicts a boolean label. The true label is then returned. The learner's goal is to min- imise mistaken predictions. We propose minimum p-seminorm interpolation to solve this problem. To this end we give a p-seminorm on the space of graph labellings. Thus on every trial we predict us- ing the labelling which minimises the p-seminorm and is also consistent with the revealed (vertex, la- bel) pairs. When p = 2 this is the harmonic en- ergy minimisation procedure of (22), also called (Laplacian) interpolated regularisation in (1). In the limit as p! 1 this is equivalent to predicting with a label-consistent mincut. We give mistake bounds relative to a label-consistent mincut and a resistive cover of the graph. We say an edge is cut with respect to a labelling if the connected ver- tices have disagreeing labels. We find that min- imising the p-seminorm with p = 1 + where ! 0 as the graph diameter D ! 1 gives a bound ofO( 2 logD) versus a bound ofO( D) whenp = 2 where is the number of cut edges. y2; and so forth. The learner's goal is to minimise the total number of mistakes M = jft : ^
Conference Paper
An approach to semi-supervised learning is pro- posed that is based on a Gaussian random field model. Labeled and unlabeled data are rep- resented as vertices in a weighted graph, with edge weights encoding the similarity between in- stances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propa- gation. The resulting learning algorithms have intimate connections with random walks, elec- tric networks, and spectral graph theory. We dis- cuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text clas- sification tasks.
Article
Many application domains suffer from not having enough labeled training data for learning. However, large amounts of unlabeled examples can often be gathered cheaply. As a result, there has been a great deal of work in recent years on how unlabeled data can be used to aid classification. We consider an algorithm based on finding minimum cuts in graphs, that uses pairwise relationships among the examples in order to learn from both labeled and unlabeled data.