PreprintPDF Available

Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Cellular sheaves equip graphs with "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain state-of-the-art results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
Content may be subject to copyright.
Neural Sheaf Diffusion:
A Topological Perspective on Heterophily and Oversmoothing in GNNs
Cristian Bodnar1Francesco Di Giovanni2Benjamin P. Chamberlain 2Pietro Li`
o1Michael M. Bronstein 2 3
Abstract
Cellular sheaves equip graphs with “geometrical”
structure by assigning vector spaces and linear
maps to nodes and edges. Graph Neural Net-
works (GNNs) implicitly assume a graph with a
trivial underlying sheaf. This choice is reflected
in the structure of the graph Laplacian operator,
the properties of the associated diffusion equation,
and the characteristics of the convolutional mod-
els that discretise this equation. In this paper, we
use cellular sheaf theory to show that the underly-
ing geometry of the graph is deeply linked with
the performance of GNNs in heterophilic settings
and their oversmoothing behaviour. By consider-
ing a hierarchy of increasingly general sheaves,
we study how the ability of the sheaf diffusion
process to achieve linear separation of the classes
in the infinite time limit expands. At the same
time, we prove that when the sheaf is non-trivial,
discretised parametric diffusion processes have
greater control than GNNs over their asymptotic
behaviour. On the practical side, we study how
sheaves can be learned from data. The result-
ing sheaf diffusion models have many desirable
properties that address the limitations of classi-
cal graph diffusion equations (and corresponding
GNN models) and obtain state-of-the-art results in
heterophilic settings. Overall, our work provides
new connections between GNNs and algebraic
topology and would be of interest to both fields.
1. Introduction
Graph Neural Networks (GNNs) (Sperduti,1994;Goller
& Kuchler,1996;Gori et al.,2005;Scarselli et al.,2008;
Bruna et al.,2014;Defferrard et al.,2016;Kipf & Welling,
Work done as a research intern at Twitter
Proved the results in
Section 3
1
Department of Computer Science, University of Cam-
bridge, Cambridge, UK
2
Twitter, UK
3
Department of Computer
Science, University of Oxford, Oxford, UK. Correspondence to:
Cristian Bodnar <cb2015@cam.ac.uk>.
Figure 1.
A sheaf
(G, F)
illustrated for a single edge of the graph.
The stalks
F(v),F(u),F(e)
are isomorphic to
R2
. The restric-
tion maps
FvEe
,
FuEe
and their adjoints move the vector features
between these spaces. In practice, we learn the sheaf from data via
a parametric function Φas indicated by the red dotted lines .
2017;Gilmer et al.,2017) have recently become very popu-
lar in the ML community as a model of choice to deal with
relational and interaction data due to their multiple success-
ful applications in domains ranging from social science and
particle physics to structural biology and drug design.
We focus on two main problems often observed in GNNs:
their poor performance in heterophilic graphs (Zhu et al.,
2020) and their oversmoothing behaviour (Nt & Maehara,
2019;Oono & Suzuki,2020). The former arises from the
fact that GNNs are usually built on the strong assumption of
homophily, i.e., that nodes tend to connect to other similar
nodes. The latter refers to a phenomenon of deeper GNNs
producing features that are too smooth to be useful.
In this work, we show that these two fundamental problems
are linked by a common cause: the underlying “geometry”
of the graph (used here in a very loose sense). When this ge-
ometry is trivial, as is typically the case, the two phenomena
described above emerge. We make these statements precise
through the lens of cellular sheaf theory (Curry,2014), a
subfield of algebraic topology (Hatcher,2000). Intuitively,
a cellular sheaf associates a vector space to each node and
edge of a graph, and a linear map between these spaces for
each incident node-edge pair (Figure 1).
In Section 4we analyse how by considering a hierarchy of
increasingly general sheaves, starting from a trivial one, a
diffusion equation based on the sheaf Laplacian (Hansen
& Ghrist,2019) can solve increasingly more complicated
node-classification tasks in the infinite time limit. In this
regime, we show that oversmoothing can be avoided by
equipping the graph with the right sheaf structure for a task.
arXiv:2202.04579v1 [cs.LG] 9 Feb 2022
Neural Sheaf Diffusion
In Section 5, we study the behaviour of a non-linear, para-
metric, and discrete version of this process. This results in
aSheaf Convolutional Network (Hansen & Gebhart,2020)
that generalises GCN of Kipf & Welling (2017). We prove
that this discrete diffusion process does not necessarily con-
verge to the harmonic space of the sheaf Laplacian, even
when the norm of the weights is controlled from above,
in contrast to GCNs (Cai & Wang,2020;Oono & Suzuki,
2019). In other words, the weights can steer the diffusion
process to a larger extent, which is important when the
underlying sheaf is only approximately correct.
All these results are based on the properties of the harmonic
space of the sheaf Laplacian. We begin by studying these
properties from a spectral perspective in Section 3. We
provide a new Cheeger-type inequality for the spectral gap
of the sheaf Laplacian and note that these results might be
of independent interest for spectral sheaf theory.
Finally, in Section 6, we apply our theory for designing
simple and practical GNN models. We describe how to con-
struct such models by learning sheaves from data. The
resulting sheaf models obtain state-of-the-art results in
heterophilic graphs and show strong performance in ho-
mophilic ones.
To summarise our contributions: We (1) introduce a
novel Graph ML framework based on cellular sheaves;
(2) Prove new results about the spectral properties of the
sheaf Laplacian that improve our understanding of its
harmonic space; (3) Study the linear separation power
of sheaf diffusion equations in node classification tasks
and demonstrate it can avoid oversmoothing; (4) Prove
that a nonlinear sheaf-based GCN-like operator can
steer the diffusion process, unlike in the classical case;
(5) Show how to learn sheaves from data and construct
sheaf diffusion-based GNN models that obtain state-of-
the-art results on heterophilic datasets.
2. Background
Acellular sheaf (Curry,2014) over a graph (Figure 1) is a
mathematical object associating a space with each node and
edge in the graph and a map between these spaces for each
incident node-edge pair. We define this formally below:
Definition 1.
A cellular sheaf
(G, F)
on an undirected
graph G= (V, E )consists of:
A vector space F(v)for each vV.
A vector space F(e)for each eE.
A linear map
FvEe:F(v)→ F(e)
for each incident
vEenode-edge pair.
The vector spaces of the nodes and edges are called stalks,
while the linear maps are referred to as restriction maps.
The space formed by all the spaces associated to the nodes of
the graph is called the space of 0-cochains and is denoted by
C0(G, F)
. Similarly,
C1(G, F)
– the space of 1-cochains
contains the data associated with all the edges of the graph.
Definition 2.
For a sheaf
(F, G)
we define the space
of
0
-cochains
C0(G;F) := LvVF(v)
and
1
-cochains
C1(G;F) := LeEF(e).
For a 0-cochain
xC0(G;F)
, we use
xv
to refer to the
vector in
F(v)
of node
v
and similarly for 1-cochains. From
an opinion dynamics perspective (Hansen & Ghrist,2021),
xv
can be thought of as the private opinion of node
v
, while
FvEe
expressed how that opinion manifests publicly in a
discourse space formed by
F(e)
. It is natural to define a lin-
ear co-boundary map
δ
between
C0(G, F)
and
C1(G, F)
,
which measures the disagreement between all nodes in the
discourse space.
Definition 3.
For some arbitrary choice of orientation for
each edge
e=uvE
,
δ:C0(G, F)C1(G, F)
,
δ(x)e:= FvEexv− FuEexu.
Given a cellular sheaf
(G, F)
, using the co-boundary opera-
tor
δ
, one can define a Sheaf Laplacian operator associated
with the sheaf (Hansen & Ghrist,2019).
Definition 4.
The sheaf Laplacian of a sheaf
(G, F)
is a
map LF:C0(G, F)C0(G, F)given by LF:= δ>δ.
LF(x)v:= X
v,uEeF>
vEe(FvEexv− FuEexu)
The sheaf Laplacian is a positive semi-definite block matrix.
The diagonal blocks are
LFvv =PvEeF>
vEeFvEe
, while
the non-diagonal blocks
LFvu =−F>
vEeFuEe
. A nor-
malised version of this Laplacian can also be defined. Let
D
be the block-diagonal of
LF
. Then the normalised-sheaf
Laplacian F:= D1/2LFD1/2.
For simplicity, we assume from now on that all the stalks
have dimension
d
. In that case, the sheaf Laplacian is a
nd ×nd
real matrix, where
n
is the number of nodes of
G
.
When the vector spaces are set to
R
and the linear maps to
id
, the underlying sheaf is trivial,
d= 1
and one recovers
the well-known
n×n
(normalised) graph Laplacian matrix.
The harmonic space of
LF
is characterised by the global
sections of the sheaf.
Definition 5. The global sections of a sheaf form the set:
H0(G;F) := {xC0(G;F)| FvEexv=FuEexu}
This set corresponds to the signals that agree with the re-
striction maps globally (i.e. along all edges of the graph).
The central theorem of discrete Hodge theory says that the
space formed by all these
0
-cochains is isomorphic to the
kernel of the sheaf Laplacian.
Theorem 6. H0(G;F)and ker(LF)are isomorphic.
Neural Sheaf Diffusion
Figure 2.
Analogy between parallel transport on a sphere and trans-
port on a discrete vector bundle. A tangent vector is moved from
F(w) F (v) F (u)
and back. In this particular case, the
transport is not path-independent because the vector arrives back
in a different position from where it started. In the continous case,
this happens because of the curvature of the surface. In the sheaf
case, this is purely a consequence of the restriction maps.
The sheaves
(G, F)
with orthogonal restriction maps (i.e.
FvEeO(d)
the Lie group of
d×d
orthogonal matrices),
will play an important role in our analysis. Such sheaves
are called discrete
O(d)
-bundles since they can be seen as
a discrete equivalent of vector bundles from differential ge-
ometry (Tu,2011). Intuitively, these are objects describing
how vector spaces are attached to the points of a manifold.
In our case, the role of the manifold is played by the graph.
The sheaf Laplacian of a discrete
O(d)
-bundle is equivalent
to a connection Laplacian (Singer & Wu,2012), describ-
ing how the elements of a vector space are transported via
rotations in another neighbouring vector space. This is
analogous to how tangent vectors are transported across a
manifold (see Figure 2).
3. Harmonic Space of Sheaf Laplacians
In this section we study general properties of the harmonic
space
ker(∆F)
of the normalised sheaf Laplacian. As with
the normalised graph Laplacian, the normalised sheaf Lapla-
cian is preferred for most practical purposes because its
spectrum is bounded. Since a major role in the analysis
below is played by discrete vector bundles, we focus on this
case. We note though, that our results below generalise to
the general linear group
FvEeGL(d)
, the Lie group of
d×d
invertible matrices, provided we can also control the
norm of the restriction maps from below.
Given a discrete
O(d)
-bundle, the block diagonal of
LF
has a block structure since
LFvv =dvId
, where
dv
is the
degree of node
v
. Accordingly, if a signal
˜
xker(LF)
,
then the signal
x:v7→ dvxvker(∆F)
and similarly
for the inverse transformation. In general the harmonic
space may be trivial: this happens when the constraints
FvEexv=FuEexu
are not compatible with each other.
Since the harmonic space will be related to the linear sepa-
ration power of sheaf diffusion, we investigate below when
this space is non-empty and derive to what extent the graph
structure affects that. Key to our analysis is studying trans-
port operators induced by the restriction maps of the sheaf.
Given nodes
v, u V
and a path
γvu= (v, v1, . . . , v`, u)
from
v
to
u
, we consider a notion of
transport
from the
stalk F(v)to the stalk F(u)via map composition:
Pγ
vu:= (F>
uEeFvLEe). . . (F>
v1EeFvEe) : F(v)→ F(u).
In general then, transport maps act on node stalks and are
constructed by composing single restriction maps (and their
transpose) along edges.
For general sheaf structures, the graph transport is path
dependent, meaning that how the vectors are transported
across two nodes depends on the path between them (see
Figure 2). In fact, we show that this property characterises
the spectral gap of a sheaf Laplacian (we note again that,
differently from the classical case, the kernel of
F
may be
trivial, so by ‘spectral gap’ we refer here to the value of the
smallest eigenvalue of F).
Proposition 7.
If
F
is a discrete
O(d)
bundle over a con-
nected graph and
r:= maxγvu0
vu||Pγ
vuPγ0
vu||,
then we have λF
0r2
2.
A consequence of the previous result is that there is al-
ways a non-trivial harmonic space (i.e.
λF
0= 0
) if the
transport maps generated by an orthogonal sheaf are path-
independent. Next, we address the opposite direction.
Proposition 8.
If
F
is a discrete
O(d)
bundle over a con-
nected graph and
xH0(G, F)
, then for any cycle
γ
based at vVwe have xvker(Pγ
vvI).
The previous proposition highlights the interplay between
the graph and the sheaf structure. In fact, a simple conse-
quence of this result is that for any cycle-free subset
SV
,
we have that any connection Laplacian restricted to
S
al-
ways admits a non-trivial harmonic space.
A natural question connected to the previous result is
whether a Cheeger-like inequality holds in the other direc-
tion. This turns out to be the case.
Proposition 9.
Let
F
be a discrete
O(d)
bundle over
a connected graph
G
with
n
nodes and let
||(Pγ
vv
I)x|| ≥ ||x||
for all cycles
γvv
. Then
λF
0
2(2diam(G)n dmax)1.
While the bound above is of little use in practice, it shows
how the spectral gap of a sheaf Laplacian is indeed re-
lated to the deviation of the transport maps from being
path-independent. We note that the Cheeger-like inequality
presented here is not unique and other types of bounds on
λF
0have been derived by Bandeira et al. (2013).
Neural Sheaf Diffusion
We conclude this section by further analysing the dimen-
sionality of the harmonic space of discrete O(d)-bundles.
Lemma 10.
Let
F
be a discrete
O(d)
bundle over a con-
nected graph
G
. Then
dim(H0)d
and
dim(H0) = d
iff
the transport is path-independent.
4. The Separation Power of Sheaf Diffusion
Preliminaries
Let
G= (V, E )
be a graph and consider
that all nodes have features that are
d
-dim vectors
xv
F(v)
. The features of all nodes are represented as a single
vector
xC0(G;F)
stacking all the individual
d
-dim
vectors. Additionally, if we allow for
f
feature channels,
everything can be represented as a matrix
XR(nd)×f
,
whose columns are vectors in C0(G;F).
In this section, we are interested in the spatially discretised
sheaf difusion process governed by the PDE:
X(0) = X,˙
X(t) = FX(t)(1)
It can be shown that in the time limit, each feature channel
is projected into the harmonic space of the sheaf Laplacian
ker(∆F)
(Hansen & Ghrist,2019). Up to a
D1/2
normal-
isation, this space contains the signals that agree with the
restriction maps of the sheaf along all the edges. Thus, sheaf
diffusion can be seen as a global ‘synchronisation’ process.
We now analyse the ability of certain classes of sheaf Lapla-
cian operators to linearly separate the features in the limit
of the diffusion processes they induce. We consider this a
proxy for the capacity of certain diffusion processes to avoid
oversmoothing.
Definition 11.
A hypothesis class of sheaves with
d
-
dimensional stalks
Hd
has linear separation power over
a set of graphs
G
if for any labelled graph
G= (V, E )∈ G
,
there is a sheaf
(F, G)∈ Hd
that can linearly separate
the classes of
G
in the time limit of Equation 1for a dense
subset XFRnd×fof initial conditions.
The restriction to a set of initial conditions that is dense in
the ambient space is necessary because, in the limit, diffu-
sion behaves like a projection in the harmonic space and
there will always be degenerate initial conditions that will
yield a zero projection.
Definition 12.
Consider the class of sheaves with symmetric
and invertible transport maps and d-dimensional stalks.
Hd
sym := {(F, G)| FvEe=FuEe,det(FvEe)6= 0}
We note that for
d= 1
, the sheaf Laplacians induced by this
class of sheaves coincides with the set of the well-known
weighted graph Laplacians with strictly positive weights,
which also includes the usual graph Laplacian (see proof in
Appendix B). Therefore, this hypothesis class is of particular
interest since it includes those graph Laplacians typically
used by graph convolutional models such as GCN (Kipf &
Welling,2017) and ChebNet (Defferrard et al.,2016).
We first show that this class of sheaf Laplacians can linearly
separate the classes in binary classification settings under
certain homophily assumptions:
Proposition 13.
Let
G
be the set of connected graphs
G=
(V, E )
with two classes
A, B V
such that for each
vA
,
there exists
uA
and an edge
(v, u)E
. Then
H1
sym
has
linear separation power over G.
In contrast, under certain heterophilic conditions, this hy-
pothesis class is not powerful enough to linearly separate
the two classes no matter what the initial conditions are:
Proposition 14.
Let
G
be the set of connected bipartite
graphs
G= (A, B, E )
, with partitions
A, B
forming two
classes and
|A|=|B|
. Then
H1
sym
cannot linearly separate
any graph in Gfor any initial conditions X(0) Rn×f.
We now consider a larger hypothesis class that also includes
non-symmetric relations.
Definition 15.
The the class of sheaves with invertible maps
and d-dimensional stalks:
Hd:= {(F, G)|det(FvEe)6= 0}
Proposition 16.
Let
G
contain all the connected graphs
with two classes. Then,
H1
has linear separation power
over G.
These results show that heterophilic settings require an
asymmetric transport of the features between neighbouring
nodes belonging to different classes. This provides a sheaf-
theoretic explanation for why a recent body of work (Yan
et al.,2021;Chien et al.,2021;Bo et al.,2021) has found
negatively-weighted edges to help in heterophilic settings.
From this perspective, negatively-signed edges constrain
the product
FvEeFuEe
of an implicit underlying sheaf to be
negative and hence
FvEe6=FuEe
. As the next result shows,
in
d= 1
, signed relations are a particularly important way
to achieve this asymmetry:
Definition 17.
The class of sheaves over
G
with non-zero
maps, one-dimensional stalks, and similarly signed restric-
tion maps:
H1
+:= {(F, G)| FvEeFuEe>0}
Proposition 18.
Let
G
be the connected graph with two
nodes belonging to two different classes. Then
H1
+
cannot
linearly separate the two nodes for any initial conditions
XR2×f.
This shows that even the simplest heterophilic graph still
cannot be distinguished in this case even with asymmetric
transport maps unless FvEe,FuEehave opposite signs.
Neural Sheaf Diffusion
So far we have only studied the effects of changing the type
of sheaves in dimension one. We now consider the effects
of adjusting the dimension of the stalks and begin by stating
a fundamental limitation of (sheaf) diffusion when d= 1.
Proposition 19.
Let
G
be a connected graph with
C3
classes. Then H1cannot linearly separate any XRn×f.
This is essentially a consequence of
dimker(∆F)
being
at most one in this case, as described by Lemma 10. From
a GNN perspective, this means that in the infinite depth
setting, sufficient stalk width (i.e., dimension) is needed in
order to solve tasks involving more than two classes. Note
that this notion of width given by
d
is different from the
classical notion given by the number of feature channels
f
.
As the result above shows, the latter has no effect on the
linear separability of the classes. Next, we will see that the
former does.
Definition 20.
Consider the class of sheaves with diagonal
invertible maps and d-dimensional stalks:
Hd
diag := {(F, G)| FvEe=invertible diagonal matrix}
Proposition 21.
Let
G
be the set of connected graphs with
nodes belonging to
C3
classes. Then for
dC
,
Hd
diag
has linear separation power over G.
This proposition illustrates the benefits of using higher-
dimensional stalks, while maintaining a simple and compu-
tationally convenient class of diagonal restriction maps.
By using more complex restriction maps, we can show that
lower-dimensional stalks can be used to achieve linear sepa-
ration in the presence of even more classes.
Definition 22. The class of discrete O(d)-bundles:
Hd
orth := {(F, G)| FvEeO(d)}
We show that for stalks of dimension
d∈ {2,4}
, one can
classify at least C= 2dclasses.
Theorem 23.
Let
G
be the class of connected graphs with
C2d
classes. Then, for all
d∈ {2,4}
,
Hd
orth
has linear
separation power over G.
This shows that orthogonal maps are able to make more
efficient use of the space available to them than diagonal
restriction maps. Our proof relies on the algebra of quater-
nions and their generalisations and also shows the result
cannot be extended to other dimensions. We note that with
the additional assumption that the graph is regular, this hy-
pothesis class can distinguish an arbitrary number of classes
(see Appendix B).
We conclude this section by highlighting that the limitations
of symmetric relations persist even in higher dimensions
and that Proposition 14 also extends to
Hd
orth,sym
, the class
of O(d)-bundles where for all v, u Ee,FvEe=FuEe.
In summary, we showed that: (1) Sheaf diffusion on a
trivial sheaf with symmetric (scalar) restriction maps
(as implicitly used in standard graph convolutions) has
linear separation power only in certain (homophilic)
settings. (2) Dealing with heterophilic data and over-
smoothing requires non-symmetric restriction maps. (3)
Higher-dimensional stalks and more complex restric-
tion maps lead to stronger separation power.
5. Asymptotics of Sheaf Convolutions
In Section 4we analysed the ability of sheaf diffusion to lin-
early separate the node-classes in the limit. However, when
considering a discrete, parametric and non-linear version of
this process, it is important to know how much the weights
can steer it. This is particularly relevant if the underlying
sheaf is only approximately correct for the task to be solved.
The continous diffusion process from Equation 1has the
following Euler discretisation with unit step-size:
X(t+ 1) = X(t)FX(t)=(Ind F)X(t)(2)
Assuming
XRnd×f1
, we can equip the right side with
weight matrices
W1Rd×d,W2Rf1×f2
and a non-
linearity
σ
to arrive at the following model originally pro-
posed by Hansen & Gebhart (2020):
Y=σInd F(InW1)XW2Rnd×f2,(3)
where
f1, f2
are the number of input and output feature
channels, and
denotes the Kronecker product. Here,
W1
multiplies from the left the vector feature of all the nodes
in all channels (i.e.
W1xi
v
for all
v
and channels
i
), while
W2
multiplies the features from the right and can adjust the
number of feature channels, just like in GCNs.
It is natural to call this model a
Sheaf Convolutional Net-
work
(SCN) since when
F
is the usual normalised graph
Laplacian,
W1
becomes a scalar and one recovers GCN of
Kipf & Welling (2017). To study this prototypical nonlinear
and parametric discrete diffusion model in the limit of an
infinite number of layers, we adapt the proof technique of
Cai & Wang (2020) to track the sheaf Dirichlet energy as
the signal is passed through multiple SCN layers.
Definition 24.
The sheaf Dirichlet energy
EF(x)
of a sig-
nal xC0(G, F)is defined as:
x>Fx=1
2X
e:=(v,u)kFvEeD1/2
vxv− FuEeD1/2
uxuk2
2
Similarly, for multiple channels the energy is
trace(X>FX)
. It is easy to see that
xker(∆F)
EF(x)=0
. Therefore, we can use the energy of a signal as
a metric to measure its distance from the harmonic space.
Neural Sheaf Diffusion
We begin by studying the sheaves for which the features
end up asymptotically in
ker(∆F)
. The first such example
is for
O(d)
-bundles with symmetric relations. Let
λ=
max (λmin 1)2,(λmax 1)2
, where
λmin, λmax
are the
smallest and largest non-zero eigenvalues of F.
Theorem 25.
Let
(F, G)
be an
O(d)
-bundle in
Hd
orth,sym
and assume
σ=ReLU
or LeakyReLU. Then
EF(Y)
λkW1k2
2kW>
2k2
2EF(X).
In particular, this means that if
λkW1k2
2kW>
2k2
2<1
, the
signal converges exponentially fast to
ker(∆F)
. In some
sense, this is not surprising because when we have sym-
metric relations along all edges (i.e.
Fve=Fue
) and
hence the conditions in Theorem 25 are satisfied, then the
harmonic space
ker(∆F)
contain the same information as
the kernel of the classical normalised Laplacian 0.
Proposition 26.
If
F
is an
O(d)
-bundle in
Hd
orth,sym
, then
xker∆Fif and only if xkker 0for all 1kd.
Importantly, the symmetry of the relations along all edges is
a necessary condition in Theorem 25. As soon as we have an
asymmetric transport map, we can find an arbitrarily small
linear transformation Wthat increases the energy.
Proposition 27.
For any connected graph
G
and
ε > 0
,
there exist a sheaf
(G, F)
,
W
with
kWk2< ε
and feature
vector xsuch that EF((IW)x)> EF(x).
Beyond
O(d)
-bundles, we also have the following result
for general sheaves with stalks of dimension
d= 1
which
generalises that of Cai & Wang (2020); Oono & Suzuki
(2019) for GCNs:
Theorem 28.
Let
(F, G)
be a sheaf in
H1
+
and as-
sume
σ
is ReLU or LeakyReLU. Then
EF(Y)
λkW1k2
2kW>
2k2
2EF(X).
As before, having positively-signed relations is a necessary
condition in the non-linear case to ensure oversmoothing
happens. However, in this case, the proof also holds for
negatively-signed relations when using a linear model (i.e.
when
σ
is the identity). Due to this result, we note that
Propositons 14 and 19 also generalise to GCNs. Any GCN
using a weighted graph Laplacian that oversmooths as in
Theorem 28 cannot linearly separate more than two classes.
Furthermore, such a GCN cannot separate the classes of a
bipartitie graph with equally-sized partitions.
We have two main takeaways: (1) Discrete sheaf dif-
fusion models can avoid (in general) the kernel of the
diffusion operator even in the deep linear case. (2)
The asymmetry of the transport maps play (again) an
important role in that.
6. Sheaf Learning
In the previous sections we discussed the various advan-
tages one obtains from using the right sheaf-structure for a
particular node classification task. However, in general, this
ground truth sheaf is unknown or unspecified. Therefore,
we aim to learn the underlying sheaf from data.
6.1. Sheaf-based GNN Model
We consider the following diffusion-type equation, which
contains the sheaf diffusion equation as a particular case.
˙
X(t) = σF(t)(InW1)X(t)W2,(4)
Crucially, the sheaf Laplacian
F(t)
is that of a sheaf
(G, F(t))
that evolves over time. More specifically, the
evolution of the sheaf structure is described by a learnable
function of the data (G, F(t)) = g(G, X(t); θ).
We also consider a discrete version of this equation, using a
new set of weights at each layer t.
Xt+1 =XtσF(t)(IWt
1)XtWt
2,(5)
For both, models we use an initial MLP to compute
X(0)
from the raw features and a final linear layer to perform the
node classification.
Overall, this represents an entirely new framework for
learning on graphs, which does not only evolve the fea-
tures at each new layer, but also evolves the underlying
‘geometry’ of the graph (i.e., the sheaf structure).
6.2. Learning the restriction maps
The advantage of learning a sheaf is that one does not re-
quire any sort of embedding of the nodes in an ambient
space (as e.g. done in (Chamberlain et al.,2021a)). Instead,
everything can be learned locally. Each
d×d
matrix
FvEe
is learned via a parametric function Φ : Rd×2Rd×d:
FvEe:=(v,u)= Φ(xv,xu)(6)
For simplicity, the equation above uses a single feature chan-
nel, but in practice, all channels are supplied as input. This
function retains the inductive bias of locality specific to
GNNs since it only utilises the features of the nodes form-
ing the edge. At the same time, it is important that this
function is non-symmetric in order to be able learn asym-
metric transport maps along each edge. In what follows, we
distinguish between several types of functions
Φ
depending
on the type of matrix they learn.
Diagonal
The main advantage of this parametrization is
that fewer parameters need to be learned per edge and the
Neural Sheaf Diffusion
sheaf Laplacian ends up being a matrix with diagonal blocks,
which also results in fewer operations in sparse matrix mul-
tiplications. The main disadvantage is that the
d
dimensions
of the stalks do not interact.
Orthogonal
In this case, the model effectively learns a
discrete vector bundle. Orthogonal matrices provide several
advantages: (1) they are able to mix the various dimension
of the stalks, (2) the orthogonality constraint prevents over-
fitting while reducing the number of parameters, (3) they
have better understood theoretical properties, and (4) the
resulting Laplacians are easier to normalise numerically
since the diagonal entries correspond to the degrees of the
nodes. In our model, we build orthogonal matrices from a
composition of Householder reflections (Mhammedi et al.,
2017).
General
Finally, we consider the most general option of
learning arbitrary matrices. The maximal flexibility pro-
vided by these maps can be useful, but it also comes with
the danger of overfitting. At the same time, the sheaf Lapla-
cian is more challenging to normalise numerically since
one has to compute
D1/2
for a positive semi-definite ma-
trix
D
. To perform this at scale, one has to rely on SVD,
whose gradients can be infinite if
D
has repeated eigenval-
ues. Therefore, this model is more challenging to train.
Computational Complexity
Treating the number of
channels and layers as constant, a typical message pass-
ing GNN scales with
O(n+m)
, where
n
is the number
of nodes and
m
is the number of edges. A sheaf diffusion
model with diagonal maps has complexity
O(d(n+m))
because the matrix multiplication required to compute the
transport maps reduces to a scalar multiplication. When
learning orthogonal or general
d×d
matrices, the complex-
ity becomes
O(d3(n+m))
because matrix multiplication
is
O(d3)
. We note that additional costs for the latter meth-
ods such as that of SVD for computing
D1/2
for general
matrices and that of parametrising the orthogonal group can
be done in
O(d3)
, so the overall complexity is not affected.
In practice, we use
1d5
and all the operations above
benefit from batched GPU computations in PyTorch (Paszke
et al.,2019) which effectively results in a constant over-
head. For learning orthogonal matrices, we rely on the
library Torch Householder (Obukhov,2021) which provides
support for fast transformations with very large batch sizes.
7. Experiments
Synthetic experiments
We first consider a simple syn-
thetic setup given by a connected bipartite graph, where the
two partitions form two equally sized classes. We sample
the features from two overlapping isotropic Gaussian distri-
Figure 3.
(Left) Test accuracy as a function of diffusion time.
(Right) Histogram of the learned scalar transport maps. The perfor-
mance of the sheaf diffusion model is superior to that of weighted-
graph diffusion. The model correctly learns to invert the features
of the two classes via the transport maps.
butions in order to make the classes linearly non-separable
at initialisation time. From Proposition 14 we know that
diffusion models using symmetric restriction maps cannot
separate the classes in the limit, while a diffusion process
using asymmetric maps should be able to.
Therefore, we consider two simple versions of the model
from Equation 4, where we set
W1=Id
,
W2=If
and
σ= id
. In the first, the maps
FvEe=(v,u)
are learned by a
simple layer of the form
σ(w>[xv||xu])
, where
σ= tanh
.
For the second model, we use a similar layer but constraint
FvEe=FuEe>0
, which results in a weighted graph
Laplacian.
Figure 3presents the results of this experiment across five
seeds. As expected, for diffusion time zero (i.e. no dif-
fusion), we see that a linear classifier cannot separate the
classes. Also as expected, the diffusion process using sym-
metric maps cannot perfectly fit the data. In contrast, with
the more general sheaf diffusion, as more diffusion is per-
formed and the signal approaches the harmonic space, the
model gets better and the features become linearly separa-
ble. In the second subfigure we take a closer look at the
sheaf that the model learns in the time limit by plotting a
histogram of all the transport (scalar) maps
F>
vEeFuEe
. In
accordance with Propositions 16 and 18, the model learns
a negative transport map for all edges. We discuss this in
more detail together with additional synthetic experiments
in higher-dimensions in Appendix E.
Real-world experiments
We test our models on real-
world datasets proposed by Rozemberczki et al. (2021);
Pei et al. (2020) to evaluate heterophilic learning. These
datasets have an edge homophily coefficient
h
ranging from
h= 0.11
(very heterophilic) to
h= 0.81
(very homophilic)
and therefore offer a view of how a model performs in both
regimes. We evaluate our models on the 10 fixed splits pro-
vided by Pei et al. (2020) and report the mean accuracy and
standard deviation. Each split contains
48%/32%/20%
of
nodes per class for training, validation and testing, respec-
tively.
Neural Sheaf Diffusion
Texas Wisconsin Film Squirrel Chameleon Cornell Citeseer Pubmed Cora
Hom level 0.11 0.21 0.22 0.22 0.23 0.30 0.74 0.80 0.81
#Nodes 183 251 7,600 5,201 2,277 183 3,327 18,717 2,708
#Edges 295 466 26,752 198,493 31,421 280 4,676 44,327 5,278
#Classes 5 5 5 5 5 5 7 3 6
Diag-SD 85.67±6.95 88.63±2.75 37.79±1.01 54.78±1.81 68.68±1.73 86.49±7.35 77.14±1.85 89.42±0.43 87.14±1.06
O(d)-SD 85.95±5.51 89.41±4.74 37.81±1.15 56.34±1.32 68.04±1.58 84.86±4.71 76.70±1.57 89.49±0.40 86.90±1.13
Gen-SD 82.97±5.13 89.21±3.84 37.80±1.22 53.17±1.31 67.93±1.58 85.68±6.51 76.32±1.65 89.33±0.35 87.30±1.15
GGCN 84.86±4.55 86.86±3.29 37.54±1.56 55.17±1.58 71.14±1.84 85.68±6.63 77.14±1.45 89.15±0.37 87.95±1.05
H2GCN 84.86±7.23 87.65±4.98 35.70±1.00 36.48±1.86 60.11±2.15 82.70±5.28 77.11±1.57 89.49±0.38 87.87±1.20
GPRGNN 78.38±4.36 82.94±4.21 34.63±1.22 31.61±1.24 46.58±1.71 80.27±8.11 77.13±1.67 87.54±0.38 87.95±1.18
FAGCN 82.43±6.89 82.94±7.95 34.87±1.25 42.59±0.79 55.22±3.19 79.19±9.79 N/A N/A N/A
MixHop 77.84±7.73 75.88±4.90 32.22±2.34 43.80±1.48 60.50±2.53 73.51±6.34 76.26±1.33 85.31±0.61 87.61±0.85
GCNII 77.57±3.83 80.39±3.40 37.44±1.30 38.47±1.58 63.86±3.04 77.86±3.79 77.33±1.48 90.15±0.43 88.37±1.25
Geom-GCN 66.76±2.72 64.51±3.66 31.59±1.15 38.15±0.92 60.00±2.81 60.54±3.67 78.02±1.15 89.95±0.47 85.35±1.57
PairNorm 60.27±4.34 48.43±6.14 27.40±1.24 50.44±2.04 62.74±2.82 58.92±3.15 73.59±1.47 87.53±0.44 85.79±1.01
GraphSAGE 82.43±6.14 81.18±5.56 34.23±0.99 41.61±0.74 58.73±1.68 75.95±5.01 76.04±1.30 88.45±0.50 86.90±1.04
GCN 55.14±5.16 51.76±3.06 27.32±1.10 53.43±2.01 64.82±2.24 60.54±5.30 76.50±1.36 88.42±0.50 86.98±1.27
GAT 52.16±6.63 49.41±4.09 27.44±0.89 40.72±1.55 60.26±2.50 61.89±5.05 76.55±1.23 87.30±1.10 86.33±0.48
MLP 80.81±4.75 85.29±3.31 36.53±0.70 28.77±1.56 46.21±2.99 81.89±6.40 74.02±1.90 75.69±2.00 87.16±0.37
Cont Diag-SD 82.97±4.37 86.47±2.55 36.85±1.21 38.17±9.29 62.06±3.84 80.00±6.07 76.56±1.19 89.47±0.42 86.88±1.21
Cont O(d)-SD 82.43±5.95 84.50±4.34 36.39±1.37 40.40±2.01 63.18±1.69 72.16±10.40 75.19±1.67 89.12±0.30 86.70±1.24
Cont Gen-SD 83.78±6.62 85.29±3.31 37.28±0.74 52.57±2.76 66.40±2.28 84.60±4.69 77.54±1.72 89.67±0.40 87.45±0.99
BLEND 83.24±4.65 84.12±3.56 35.63±0.89 43.06±1.39 60.11±2.09 85.95±6.82 76.63±1.60 89.24±0.42 88.09±1.22
GRAND 75.68±7.25 79.41±3.64 35.62±1.01 40.05±1.50 54.67±2.54 82.16±7.09 76.46±1.77 89.02±0.51 87.36±0.96
CGNN 71.35±4.05 74.31±7.26 35.95±0.86 29.24±1.09 46.89±1.66 66.22±7.69 76.91±1.81 87.70±0.49 87.10±1.35
Table 1.
Results on node classification datasets sorted by their homophily level. The first section includes discrete GNN models, while the
second section includes continuous models. Top three models are coloured by First,Second,Third. Our models are marked -SD.
Baselines
As baselines, we use an ample set of GNN
models that can be placed in four categories: (1) classical:
GCN (Kipf & Welling,2017), GAT (Veli
ˇ
ckovi
´
c et al.,2018),
GraphSAGE (Hamilton et al.,2017); (2) models specifically
designed for heterophilic settings: GGCN (Yan et al.,2021),
Geom-GCN (Pei et al.,2020), H2GCN (Zhu et al.,2020),
GPRGNN (Chien et al.,2021), FAGCN (Bo et al.,2021),
MixHop (Abu-El-Haija et al.,2019); (3) models addressing
oversmoothing: GCNII (Chen et al.,2020), PairNorm (Zhao
& Akoglu,2020); (4) continuous models: CGNNs (Xhon-
neux et al.,2020), GRAND (Chamberlain et al.,2021b),
and BLEND (Chamberlain et al.,2021a). For (4) we fine-
tune and evaluate the models ourselves. The other results
are taken from Yan et al. (2021), except for FAGCN and
MixHop, which come from Lingam et al. (2021) and Zhu
et al. (2020), respectively. All of these were evaluated on
the same set of splits as ours.
Results
From the results in Table 1we see that the discrete
version of our models are first in
5/6
benchmarks with high
heterophily (
h < 0.3
) and second-ranked in the other
1/6
.
At the same time, the models exhibit strong performance
on the homophilic graphs (Cora, Pubmed, Citeseer) being
within approximately
1%
of the top-performing model. The
second part of Table 1includes continuous models, which
our model outperforms on
7/9
benchmarks with particularly
large improvements on Chameleon and Squirrel. Among the
discrete diffusion models, the
O(d)
-vector bundle diffusion
model performs best overall confirming the intuition that
it can better avoid overfitting, while also transforming the
vectors in sufficiently complex ways. We also remark the
strong performance of the model learning diagonals maps.
8. Discussion and Conclusion
Sheaf Neural Networks
Sheaf Neural Networks with a
fixed sheaf Laplacian were originally introduced by Hansen
& Gebhart (2020) in a toy experimental setting. Since then,
they have remained completely unexplored. We hope that
this paper will fill this lacuna. In contrast to their work, we
provide an ample theoretical analysis justifying the use of
sheaves in Graph ML. On the practical side, we study for the
first time how sheaves can be learned from data using neural
networks. Finally, this is the first successful application of
sheaf-based neural networks in a real-world setting.
Limitations
One of the main limitations of our theoreti-
cal analysis is that it does not discuss the learnability and
generalisation properties of sheaves, only their existence.
Nonetheless, this setting was sufficient to produce many
valuable insights about heterophily and oversmoothing and
a basic understanding of what various types of sheaves can
do. Much more theoretical work remains to be done in this
direction, and we expect to see further cross-fertilization
between ML and algebraic topology research in the future.
Conclusion
In this work, we used cellular sheaf theory to
provide a novel topological perspective on heterophily and
oversmoothing in GNNs. We showed that the underlying
sheaf structure of the graph is intimately connected with
both of these important factors affecting the performance
of GNNs. To mitigate this, we proposed a new paradigm
for graph representation learning where models do not only
evolve the features at each layer but also the underlying
geometry of the graph. In practice, we demonstrated that
this framework can achieve state-of-the-art results in het-
erophilic settings.
Neural Sheaf Diffusion
Acknowledgements
We are grateful to Iulia Duta, Dobrik Georgiev and Jacob
Deasy for valuable comments on an earlier version of this
manuscript. CB would also like to thank the Twitter Cortex
team for making the research internship a fantastic experi-
ence. This research was supported in part by ERC Consol-
idator grant No. 724228 (LEMAN).
References
Abu-El-Haija, S., Perozzi, B., Kapoor, A., Haru-
tyunyan, H., Alipourfard, N., Lerman, K., Steeg,
G. V., and Galstyan, A. Mixhop: Higher-order
graph convolutional architectures via sparsified neigh-
borhood mixing. In The Thirty-sixth International
Conference on Machine Learning (ICML), 2019.
URL
http://proceedings.mlr.press/v97/
abu-el-haija19a/abu-el-haija19a.pdf.
Bandeira, A. S., Singer, A., and Spielman, D. A. A Cheeger
inequality for the graph connection laplacian. SIAM Jour-
nal on Matrix Analysis and Applications, 34(4):1611–
1630, 2013.
Bishop, C. M. Pattern Recognition and Machine Learning
(Information Science and Statistics). Springer-Verlag,
Berlin, Heidelberg, 2006. ISBN 0387310738.
Bo, D., Wang, X., Shi, C., and Shen, H. Beyond low-
frequency information in graph convolutional networks.
In AAAI. AAAI Press, 2021.
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral
networks and locally connected networks on graphs. In
ICLR, 2014.
Cai, C. and Wang, Y. A note on over-smoothing for graph
neural networks. arXiv:2006.13318, 2020.
Chamberlain, B. P., Rowbottom, J., Eynard, D., Di Giovanni,
F., Xiaowen, D., and Bronstein, M. M. Beltrami flow and
neural diffusion on graphs. In NeurIPS, 2021a.
Chamberlain, B. P., Rowbottom, J., Goronova, M., Webb,
S., Rossi, E., and Bronstein, M. M. Grand: Graph neural
diffusion. In ICML, 2021b.
Chen, M., Wei, Z., Huang, Z., Ding, B., and Li,
Y. Simple and deep graph convolutional networks.
In III, H. D. and Singh, A. (eds.), Proceedings of
the 37th International Conference on Machine Learn-
ing, volume 119 of Proceedings of Machine Learn-
ing Research, pp. 1725–1735. PMLR, 13–18 Jul
2020. URL
https://proceedings.mlr.press/
v119/chen20v.html.
Chien, E., Peng, J., Li, P., and Milenkovic, O. Adaptive
universal generalized pagerank graph neural network. In
International Conference on Learning Representations,
2021. URL
https://openreview.net/forum?
id=n6jl7fLxrP.
Curry, J. M. Sheaves, cosheaves and applications. Univer-
sity of Pennsylvania, 2014.
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
volutional neural networks on graphs with fast localized
spectral filtering. In NIPS, 2016.
Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y.,
and Bresson, X. Benchmarking graph neural networks.
arXiv:2003.00982, 2020.
Ghrist, R. and Riess, H. Cellular sheaves of lattices and the
tarski laplacian. arXiv:2007.04099, 2020.
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
Dahl, G. E. Neural message passing for quantum chem-
istry. In ICML, 2017.
Goller, C. and Kuchler, A. Learning task-dependent
distributed representations by backpropagation through
structure. In ICNN, 1996.
Gori, M., Monfardini, G., and Scarselli, F. A new model for
learning in graph domains. In IJCNN, 2005.
Hamilton, W. L., Ying, R., and Leskovec, J. Representation
learning on graphs: Methods and applications. IEEE
Data Engineering Bulletin, 2017.
Hansen, J. and Gebhart, T. Sheaf neural networks. In
NeurIPS 2020 Workshop on Topological Data Analysis
and Beyond, 2020.
Hansen, J. and Ghrist, R. Toward a spectral theory of cellular
sheaves. Journal of Applied and Computational Topology,
3(4):315–358, 2019.
Hansen, J. and Ghrist, R. Opinion dynamics on discourse
sheaves. SIAM Journal on Applied Mathematics, 81(5):
2033–2060, 2021.
Hatcher, A. Algebraic topology. Cambridge Univ. Press,
Cambridge, 2000.
Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In ICLR, 2015.
Kipf, T. N. and Welling, M. Semi-supervised classification
with graph convolutional networks. In ICLR, 2017.
Lee, H. D. On some matrix inequalities. Korean Journal of
Mathematics, 16(4):565–571, 2008.
Neural Sheaf Diffusion
Lingam, V., Ragesh, R., Iyer, A., and Sellamanickam, S.
Simple truncated svd based model for node classification
on heterophilic graphs. arXiv preprint arXiv:2106.12807,
2021.
Mhammedi, Z., Hellicar, A., Rahman, A., and Bailey, J.
Efficient orthogonal parametrisation of recurrent neu-
ral networks using householder reflections. In Interna-
tional Conference on Machine Learning, pp. 2401–2409.
PMLR, 2017.
Nt, H. and Maehara, T. Revisiting graph neural net-
works: all we have is low pass filters. arXiv preprint
arXiv:1812.08434v4, 2019.
Obukhov, A. Efficient householder transformation in
pytorch, 2021. URL
www.github.com/toshas/
torch-householder
. Version: 1.0.1, DOI:
10.5281/zenodo.5068733.
Oono, K. and Suzuki, T. Graph neural networks expo-
nentially lose expressive power for node classification.
arXiv:1905.10947, 2019.
Oono, K. and Suzuki, T. Graph neural networks exponen-
tially lose expressive power for node classification. In
International Conference on Learning Representations,
2020.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,
L., Bai, J., and Chintala, S. Pytorch: An imperative
style, high-performance deep learning library. In Wal-
lach, H., Larochelle, H., Beygelzimer, A., d
'
Alch
´
e-Buc,
F., Fox, E., and Garnett, R. (eds.), Advances in Neural In-
formation Processing Systems 32, pp. 8024–8035. Curran
Associates, Inc., 2019.
Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B.
Geom-gcn: Geometric graph convolutional networks.
arXiv preprint arXiv:2002.05287, 2020.
Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scale at-
tributed node embedding. Journal of Complex Networks,
9(2):cnab014, 2021.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
Monfardini, G. The graph neural network model. IEEE
Trans. Neural Networks, 20(1):61–80, 2008.
Schafer, R. D. An introduction to nonassociative algebras.
Courier Dover Publications, 2017.
Singer, A. and Wu, H.-T. Vector diffusion maps and the con-
nection laplacian. Communications on pure and applied
mathematics, 65(8):1067–1144, 2012.
Sperduti, A. Encoding labeled graphs by labeling RAAM.
In NIPS, 1994.
Tu, L. W. Manifolds. In An Introduction to Manifolds, pp.
47–83. Springer, 2011.
Veli
ˇ
ckovi
´
c, P., Cucurull, G., Casanova, A., Romero, A., Lio,
P., and Bengio, Y. Graph attention networks. In ICLR,
2018.
Xhonneux, L.-P., Qu, M., and Tang, J. Continuous graph
neural networks. In ICML, 2020.
Yan, Y., Hashemi, M., Swersky, K., Yang, Y., and Koutra,
D. Two sides of the same coin: Heterophily and
oversmoothing in graph convolutional neural networks.
arXiv:2102.06462, 2021.
Zhao, L. and Akoglu, L. Pairnorm: Tackling oversmooth-
ing in gnns. In International Conference on Learning
Representations, 2020. URL
https://openreview.
net/forum?id=rkecl1rtwB.
Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and
Koutra, D. Beyond homophily in graph neural networks:
Current limitations and effective designs. Advances in
Neural Information Processing Systems, 33, 2020.
Neural Sheaf Diffusion
A. Harmonic Space Proofs
Proof of Proposition 7.
We first note that on a discrete
O(d)
bundle the degree operator
Dv=dvI
since by orthogonality
F>
vEeFvEe=I. We can use the Rayleigh quotient to characterize λF
0as
λF
0= min
xRnd hx,Fxi
||x||2.
Fix
vV
and choose a minimal path
γvu
for all
uV
. For an arbitrary non-zero
zv
, consider the signal
zu=Pγ
vuzv
and we set ˜
zudu=zu.
||FuEe
zu
du− FwEe
zw
dw||2=||˜
zu(F>
uEeFwEe)˜
zw||2=||Pγ
vu˜
zv(F>
uEeFwEe)Pγ
vw˜
zv||2,
where we have again used that the maps are orthogonal. Since
(F>
uEeFwEe)Pγ
vw=Pγ0
vu
we find that the right hand
side can be bound from above by r2||˜
zv||2. Therefore, by using Definition 24 we finally obtain
λF
0= min
xRnd hx,Fxi
||x||2hz,Fzi
||z||2=1
2Puw||FuEe
zu
du− FwEe
zw
dw||2
||z||2r2
2Puw||˜
zv||2
||z||2.
Since the transport maps are all orthogonal we get
||z||2=X
u
du||Pγ
vu˜
zv||2=X
u
du||˜
zv||2=X
uw||˜
zv||2.
We conclude that
λF
0r2
2Puw||˜
zv||2
||z||2=r2
2.
Proof of Proposition 8.
Assume that
xH0(G, F)
and consider
vV
and any cycle based at
v
denoted by
γvv=
(v0=v, v1, . . . , vL=v). According to Theorem 6we have that
Fvi+1Eexvi+1 =FviEexvi=xvi+1 = (F>
vi+1 Fvi)xvi:= ρvivi+1 xvi.
By composing all the maps we find:
xv=ρvL1vL···ρv0v1xv=Pγ
vvxv
which completes the proof.
Proof of Proposition 9.
If
= 0
there is nothing to prove. Assume that
 > 0
. By Proposition 8we derive that the harmonic
space is trivial and hence
λF
0>0
. Consider a unit eigenvector
xker(∆FλF
0I)
and let
vV
such that
||xv|| ≥ ||xu||
for
u6=v
. There exists a cycle
γ
based at
v
such that
Pγ
vxv6=xv
for otherwise we could extend
xv6= 0
to any other node
independently of the path choice and hence find a non-trivial harmonic signal. In particular, we can assume this cycle to be
non-degenerate, otherwise if there existed a non-trivial degenerate loop contained in
γ
that does not fix
x
we could consider
this loop instead of
γ
for our argument. Let us write this path as
(v0=v, v1, . . . , vL=v)
and consider the rescaled signal
˜
xvdv=xv. By assumption we have
||˜
xv|| ≤ ||(Pγ
vvI)˜
xv|| =||(ρvL1vL···ρv0v1I)˜
xv||
=||FvL1ρvL2vL1···ρv0v1˜
xv− FvL=v˜
xv||
=||FvL1ρvL2vL1···ρv0v1˜
xv− FvL1˜
xvL1+FvL1˜
xvL1− FvL=v˜
xv||
≤ ||ρvL2vL1···ρv0v1˜
xv˜
xvL1|| +||FvL1˜
xvL1− FvL=v˜
xv||.
Neural Sheaf Diffusion
By iterating the approach above we find:
||˜
xv|| ≤
L
X
i=0||Fvi˜
xvi− Fvi+1 ˜
xvi+1 || ≤ L L
X
i=0||Fvi˜
xvi− Fvi+1 ˜
xvi+1 ||2!1
2
=L L
X
i=0||Fvi
xvi
pdvi− Fvi+1
xvi+1
pdvi+1 ||2!1
2
.
From Definition 24 we derive that the last term can be bounded from above by
p2LEF(x) = p2Lhx,Fxi
. Therefore,
we conclude:
||xv||
dvp2Lhx,Fxi=q2F
0||x|| ≤ 2qdiam(G)λF
0.
By construction we get ||xv|| ≥ 1/n, meaning that
λF
02
2diam(G)
1
n dmax
.
Proof of Lemma 10.
We first note that the argument below extends to weighted
O(d)
-bundles as well. Let
xH0(G, F)
.
According to Proposition 8, given
v, u V
, we see that
xu=Pγ
vuxv
for any path
γvu
. It means that the harmonic
space is uniquely determined by the choice of
xv∈ F(v)
. Explicitly, given any cycle
γ
based at
v
, we know that
xvker(Pγ
vvI)
. If the transport is everywhere path-independent, then the kernel coincides with the whole stalk
F(v)
and hence we can extend any basis
{xvi} ∈ F(v)
=Rd
to a basis in
H0(G, F)
via the transport maps, i.e.
dim(H0(G, F)) = d
. If instead there exists a transport map over a cycle
γvv
with non-trivial fixed points, then
ker(Pγ
vvI)<F(v)
=Rdand hence dim(H0(G, F)) < d.
B. Proofs for the Power of Sheaf Diffusion
Definition 29.
Let
G= (V, W)
be a weighted graph, where
W
is a matrix with
wvu =wuv 0
for all
v6=uV
,
wvv = 0 for all vV, and (v, u)is an edge if and only if wvu >0.
The graph Laplacian of a weighted graph is
L=DW
, where
D
is the diagonal matrix of weighted degrees (i.e.
dv=Puwvu). Its normalised version is e
L=D1/2LD1/2.
Proposition 30.
Let
G
be a graph. The set
{F|(G, F)∈ H1
sym}
is isomorphic to the set of all possible weighted graph
Laplacians over G.
Proof.
We prove only one direction. Let
W
be a choice of valid weight matrix for the graph
G
. We can construct a
sheaf
(G, F)∈ H1
sym
such that for all edges
v, u Ee
we have that
FvEe=FuEe=±wvu
. Then,
Lvu =wvu
and
Lvv =PekFvEek2=Puwvu . The equality for the normalised version of the Laplacians follows directly.
We state the following Lemma without proof based on Theorem 3.1 in Hansen & Ghrist (2021).
Lemma 31.
Solutions
X(t)
to the diffusion in Equation 1converge as
t→ ∞
to the orthogonal projection of
X(0)
onto
ker(∆F).
Due to this Lemma, the proofs below rely entirely on the structure of ker(∆F)that one obtains for certain (G, F).
Proof of Proposition 13.
Let
G= (V, E )
be a graph with two classes
A, B V
such that for each
vA
, there exists
uAand an edge (v, u)E. Additionally, let x(0) be any channel of the feature matrix X(0) Rn×f.
We can construct a sheaf
(F, G)∈ H1
sym
as follows. For all nodes
vV
and edges
eE
,
F(v)
=F(e)
=R
. For all
v, u Aand edge (u, v)E, set FvEe=FuEe=α > 0. Otherwise, set FvEe= 1.
Neural Sheaf Diffusion
Denote by
hv
the number of neighbours of node
v
in the same class as
v
. Note that based on the assumptions,
hv>1
if
vA. Then the only harmonic eigenvector of Fis:
av=(pdv+hv(α1), v A
dv, v B(7)
Denote its unit-normalised version
e
a:= a
kak
. In the limit of the diffusion process, the features converge to
h=hx(0),e
aie
a
by
Lemma 31. Assuming,
x(0) /ker(∆F)
, which is nowhere dense in
Rn
and, without loss of generality, that
hx(0),e
ai>0
,
for sufficiently large α,e
ave
aufor all vA, u B.
Proof of Proposition 14.
Let
G= (A, B, E )
be a bipartite graph with
|A|=|B|
and let
x(0) Rn
be any channel of the
feature matrix X(0) Rn×f.
Consider an arbitrary sheaf
(G, F)∈ H1
sym
. Since the graph is connected, the only harmonic eigenvector of
F
is
yRn
with
yv=qPvEekFvEek2
(i.e. the square root of the weighted degree). Based on Lemma 31, the diffusion process
converges in the limit (up to a scaling) to
hx,yiy
. For the features to be linearly separable we require that
hx,yi 6= 0
and,
without loss of generality, for all vA, u Bthat yv<yuPvEekFvEek2<PuEekFuEek2.
Suppose for the sake of contradiction there exists a sheaf in
H1
sym
with such a harmonic eigenvector. Then, because
|A|=|B|:
X
vAX
vEekFvEek2<X
uBX
uEekFuEek2X
vAX
vEekFvEek2X
uBX
uEekFuEek2<0X
eEkFvEek2− kFuEek2<0
However, because (F, G)∈ H1
sym, we have FvEe=FuEeand the sum above is zero.
Proof of Proposition 16.
Let
G= (V, E )
be a connected graph with two classes
A, B V
. Additionally, let
x(0)
be any
channel of the feature matrix X(0) Rn×f.
We can construct a sheaf
(F, G)∈ H1(G)
as follows. For all nodes
v
and edges
e
,
F(v)
=F(e)
=R
. For all
vA
, set
FvEe=α. For all uB, set FuEe=β. Additionally, let α < 0< β.
Since the graph is connected, by Lemma 10, the only harmonic eigenvector of Fis:
yv=
βqPvEeα2
αqPvEeβ2=(β|α|dv, v A
α|β|dv, v B(8)
Assume
x(0) /ker(∆F)
, which is nowhere dense in
Rn
and, without loss of generality, that
hx(0),yi>0
. Then,
yv>0>yufor all vA, u B.
Proof of Proposition 18.
Let
G
be the connected graph with two nodes
V={v, u}
. Then any sheaf
(F, G)∈ H1
+(G)
has restriction maps of the form
FvEe=α, FuEe=β
and (without loss of generality)
α, β > 0
. As before, the only
(unormalised) harmonic eigenvector for a sheaf of this form is
y= (|α|β, α|β|)=(αβ , αβ)
. Since this is a constant vector,
the two nodes are not separable in the diffusion limit.
We state the following result without proof (see Exercise 4.1 in Bishop (2006)).
Lemma 32.
Let
A
and
B
be two sets of points in
Rn
. If their convex hulls intersect, the two sets of points cannot be linearly
separable.
Proof of Proposition 19.
If the sheaf has a trivial global section, then all features converge to zero in the diffusion
limit. Suppose
H0(G, F)
is non-trivial. Since
G
is connected and all the restriction maps are invertible, by Lemma 10,
dim(H0)=1.
In that case, let
h
be the unit-normalised harmonic eigenvector of
F
. By Lemma 31, for any node
v
, its scalar feature in
channel
kf
is given by
xk
v() = hxk(0),hihv
. Note that we can always find three nodes
v, u, w
belonging to three
Neural Sheaf Diffusion
different classes such that
hvhuhv
. Then, there exists a convex combination
hu=αhv+ (1 α)hw
, with
α[0,1]
.
Therefore:
xk
u() = hxk(0),hihu=αhxk(0),hihv+ (1 α)hxk(0),hihw=αxk
v() + (1 α)xk
w().(9)
Since this is true for all channels
k < f
, it follows that
xu() = αxv() + (1 α)xw()
. Because
xu()
is in the
convex hull of the points belonging to other classes, by Lemma 32, the class of
v
is not linearly separable from the other
classes.
Proof of Proposition 21.
Let
G= (V, E )
be a connected graph with
C
classes and
(F, G)
, an arbitrary sheaf in
HC(G)
.
Because
F
has diagonal restriction maps there is no interaction during diffusion between the different dimensions of the
stalks. Therefore, the diffusion process can be written as
d
independent diffusion processes, where the
i
-th process uses
a sheaf
Fi
with all stalks isomorphic to
R
and
Fi
vEe=FvEe(i, i)
for all
vV
and incident edges
e
. Therefore, we can
construct
d
sheaves
Fi∈ H1(G)
with
i<d
as in Proposition 16, where (in one vs all fashion) the two classes are given by
the nodes in class iand the nodes belonging to the other classes.
It remains to restrict that the projection of
x(0)
on any of the harmonic eigenvectors of
F
in the standard basis is non-zero.
Formally, we require
xi(0) /ker(∆Fi)
for all positive integers
id
. Since
ker(∆Fi)
is nowhere dense in
Rn
,
x(0)
belongs to the direct sum of dense subspaces, which is dense.
Lemma 33.
Let
G= (V, E )
be a graph and
(F, G)
a (weighted) orthogonal vector bundle over
G
with path-independent
parallel transport and edge weights
αe
. Consider an arbitrary node
vV
and denote by
ei
the
i
-th standard basis vector
of Rd. Then {h1,...,hd}form an orthogonal eigenbasis for the harmonic space of F, where:
hi
v=(eipdF
v
PvweipdF
v
=
eiqPvEeα2
e, v =v
Pv∗→weiqPvEeα2
e,otherwise (10)
Proof. First, we show that hi
vis harmonic.
EF(hi
v) = 1
2X
v,u,e:=(v,u)k1
pdF
vFvehv1
pdF
uFuEehuk2
2(11)
=1
2X
v,u,e:=(v,u)kFvePvvei− FuEePvueik2
2(12)
=1
2X
v,u,e:=(v,u)kFvePuvPvuei− FuEePvueik2
2By path independence (13)
=1
2X
v,u,e:=(v,u)kFveF>
vEeFuEePvuei− FuEePvueik2
2By definition of Puv(14)
=1
2X
v,u,e:=(v,u)kFuEePvuei− FuEePvueik2
2= 0 Orthogonality of FvEe(15)
For orthogonality, notice that for any i, j dand vV, it holds that:
hhi
v,hj
vi=hPv∗→weiqdF
v,Pv∗→wejqdF
vi=qdF
vqdF
vhei,eji= 0 (16)
Lemma 34.
Let
R1,R2
be two 2D rotation matrices and
e1,e2
the two standard basis vectors of
R2
. Then
hR1e1,R2e2i=
−hR1e2,R2e1i.
Proof.
The angle between
e1
and
e2
is
π
2
. Letting
φ, θ
be the positive rotation angles of the two matrices, the first inner
product is equal to
cos(π/2 + (φθ))
while the second is
cos(π/2(φθ))
. The result follows from applying the
trigonometric identity cos(π/2 + x) = sin x.
Neural Sheaf Diffusion
(a) Aligning the features of each class
with the axis of coordinates in a 2D
space. Dotted lines indicate linear deci-
sion boundaries for each class.
(b) Separating an arbitrary number of
classes when the graph is regular. Dotted
line shows an example decision boundary
for one of the classes.
Figure 4. Proof sketch for Lemma 35 and Proposition 36.
We first prove Theorem 23 in dimension two in the following lemma and then we will look at the general case.
Lemma 35.
Let
G
be the class of connected graphs with
C4
classes. Then,
H2
orth(G)
has linear separation power over
G.
Proof.
Idea: We can use rotation matrices to align the harmonic features of the classes with the axis of coordinates as in
Figure 4a. Then, for each side of each axis, we can find a separating hyperplane separating each class from all the others.
Let
G
be a connected graph with
C4
classes. Denote by
P
the following set of rotation matrices together with their
signed-flipped counterparts:
R1=1 0
0 1,R2=01
1 0 (17)
and by
C={1, . . . , C}
the set of all class labels. Then, fix a node
vV
and construct an injective map
g:C→ P
assigning each class label one of the signed basis vectors such that
g(c(v)) = R1
, where
c(v)
denotes the class of node
v.
Then, we can construct an sheaf
(G, F)∈ H2
orth(G)
in terms of certain parallel transport maps along each edge, that will
depend on
P
. For all nodes
v
and edges
e
,
F(v)
=F(e)
=R2
. For each
uV
, we set
Pvu=g(c(u))
. Then for all
v, u V
, set
Pvu=PvvP1
uv
. It is easy to see that the resulting parallel transport is path-independent because it
depends purely on the classes of the endpoints of the path.
Based on Lemma 33, the
i
-th eigenvector of
F
is
hiR2×n
with
hi
u=Pvueidu
. Now we will show that the
projection of x(0) in this subspace will have a configuration as in Figure 4a up to a rotation.
Let
u, w
be two nodes belonging to two different classes. Denote by
αi=hx(0),hii
. Then the inner product between the
features of nodes u, w in the limit of the diffusion process is:
hPvuX
i
αieipdu,PvwX
j
αjejpdwi=
=pdudwhX
i6=j
αiαjhPvuei,Pvweji+X
k
α2
khPvuek,Pvwekii
=pdudwhX
i<j
αiαjhPvuei,Pvweji+hPvuej,Pvweii+X
k
α2
khPvuek,Pvwekii(18)
=X
k
α2
khPvuek,Pvweki(by Lemma 34)
Neural Sheaf Diffusion
It can be checked that by substituting the transport maps
Pvu,Pvw
with any
Ra,Rb
from
P
such that
Ra6=
±Rb
, the inner product above is zero. Similarly, substituting any
Ra=Rb
, the inner product is
dudwPkα2
k=
dudwkx(0)k2
, which is equal to the product of the norms of the two vectors. Therefore, the diffused features of different
classes are positioned at π
2, π, 3π
2from each other, as in Figure 4a.
Proof of Theorem 23.
To generalise the proof, we need to find a set
P
of size
d
containing rotation matrices that make
the projected features of different classes be pairwise orthogonal for any projection coefficients
α
. For that, each term in
Equation 18 must be zero for any coefficients α.
Therefore, P={P0,...,Pd1}must satisfy the following requirements:
1. P0=I∈ P
, since transport for neighbours in the same class must be the identity. Therefore,
P0Pk=PkP0=Pk
for all k.
2. Since hP0ei,Pkeii= 0 for all iand k6= 0, it follows that the diagonal elements of Pkare zero.
3.
From
hP0ei,Pkeji=−hP0ej,Pkeii
for all
i6=j, k 6= 0
and point (2) it follows that
P1
k=P>
k=Pk
. Therefore,
PkPk=Ifor all k6= 0.
4.
We have
hPkei,Pleii= 0
for all
i
and
k6=l
. Together with (3), it follows that the diagonal elements of
PkPl
are
zero.
5.
We have
hPkei,Pleji=−hPkej,Pleii
for all
i6=j
, and
k6=l
, with
k, l 6= 0
. Together with point (4) it follows that
(PkPl)>=PkPl
. Similarly, from point (3) we have that
(PkPl)>=P>
lP>
k= (Pl)(Pk) = PlPk
. Therefore,
the two matrices are anti-commutative: PkPl=PlPk.
We remark that points (1), (3), (5) coincide with the defining algebraic properties of the algebra of complex numbers,
quaternions, octonions, sedenions and their generalisations based on the Cayley-Dickson construction (Schafer,2017).
Therefore, the matrices in
P
must be a representation of one of these algebras. Firstly, such algebras exist only for
d
that are
powers of two. Secondly, matrix representations for these algebras exist only in dimension two and four. This is because the
algebra of octonions and their generalisations, unlike matrix multiplication, is non-associative. As a sanity check, note that
the matrices R1,R2from Lemma 35 are a classic representation of the unit complex numbers.
We conclude this section by giving out the matrices for
d= 4
, which are the real matrix representations of the four unit
quaternions:
R1=
1000
0100
0010
0001
,R2=
01 0 0
1000
0001
0010
,R3=
0 0 1 0
0 0 0 1
1 0 0 0
0100
,R4=
0 0 0 1
0 0 1 0
0 1 0 0
1 0 0 0
.
It can be checked that these matrices respect the properties outlined above. Thus, in
d= 4
, we can select the transport maps
from the set
R1,±R2,±R3,±R4}
containing eight matrices, which also form a group. Therefore, following the same
procedure as in Lemma 35, we can linearly separate up to eight classes.
Proposition 36.
Let
G
be the class of connected regular graphs with a finite number of classes. Then,
H2
orth(G)
has linear
separation power over G.
Proof of Propositon 36.
Idea: Since the graph is regular, the harmonic features of the nodes will be uniformly scaled and
thus positioned on a circle. The aim is to place different classes at different locations on the circle, which would make the
classes linearly separable as shown in Figure 4b.
Let Gbe a regular graph with Cclasses and define θ=2π
C. Denote by Rithe 2D rotation matrix:
Ri=cos()sin()
sin() cos()(19)
Then let
P={Ri|0iC1, i N}
the set of rotation matrices with an angle multiple of
θ
. Then we can define a
bijection
g:C → P
and a sheaf
(G, F)∈ H2
orth(G)
as in the proof above. Checking the inner-products from Equation 18
Neural Sheaf Diffusion
between the harmonic features of the nodes, we can verify that the angle between any two classes is different from zero. By
Lemma 34, the cross terms of the inner product vanish:
X
k
α2
khRi[k],Rj[k]i=X
k
α2
kcos((ij)θ) = cos((ij)θ)kxk2(20)
Thus, the angle between classes i, j is (ij)θ.
C. Proofs for oversmoothing
Proof of Proposition 26.Let xH0(G, F). Then we have
0 = EF(x) = 1
2X
(v,u)E||FvEeD1
2
vxv− FuEeD1
2
uxu||2
=1
2X
(v,u)E||FeD1
2
vxvD1
2
uxu||2
=1
2X
(v,u)E||d1
2
vxvd1
2
uxu||2.
The last term vanishes if and only if xkker 0for each 1kd.
Proposition 37.
Let
F
be an
O(d)
-bundle over
G
and
ε > 0
. Assume that
FvEe=FuEe
for each
(u, v)6= (u0, v0)
and
that
F>
v0EeFu0EeI:= B6= 0
with
dim(ker B)>0
. Then there exist a linear map
WRd×d
with
||W||2=ε
and
xH0(G, F)such that EF((IW)x)>0.
Proof. We sketch the proof. Let gker(B). Define then xC0(G, F)by
xv=pdvg.
Then
xH0(G, F)
. If we now take
W=εPkerB
the rescaled orthogonal projection in the orthogonal complement of
the kernel of Bwe verify the given claim.
We provide below a proof for the equality in Definition 24.
Proposition 38.
x>Fx=1
2X
e:=(v,u)kFvEeD1/2
vxv− FuEeD1/2
uxuk2
2
Proof. We prove the result for the normalised sheaf Laplacian, and other versions can be obtained as particular cases.
E(x) = x>Fx=X
v
x>
vvv xv+X
w6=z
(w,z)E
x>
wwz xz(21)
=X
vEe
x>
vD1/2
vF>
vEeFvEeD1/2
vxv+X
w<z
(w,z)E
x>
wwz xz+x>
zzw xw(22)
=1
2X
v,wEex>
vD1/2
vF>
vEeFvEeD1/2
vxv+x>
wD1/2
wF>
wEeFwEeD1/2
wxw(23)
+x>
vD1/2
vF>
vEeFwEeD1/2
wxw+x>
wD1/2
wF>
wEeFvEeD1/2
vxv(24)
=1
2X
v,wEe
x>
vD1/2
vF>
vEeFvEeD1/2
vxv− FwEeD1/2
wxw(25)
x>
wD1/2
wF>
wEeFvEeD1/2
vxv− FwEeD1/2
wxw(26)
=1
2X
v,wEex>
vD1/2
vF>
vEex>
wD1/2
wF>
wEeFvEeD1/2
vxv− FwEeD1/2
wxw(27)
Neural Sheaf Diffusion
Note that
Dv
is symmetric for any node
v
and so is any
D1/2
v
. Therefore, the two vectors in the parenthesis are the
transpose of each other and the result is their inner product. Thus, we have:
EF(x) = 1
2X
v,wEekFvEeD1/2
vxv− FwEeD1/2
wxwk2
2(28)
The result follows identically for other types of Laplacian. For the augmented normalized Laplacian, one should simply
replace Dwith ˜
D=D+Iand for the non-normalised Laplacian, one should simply remove Dfrom the equation.
Proof of Theorem 25.
We first prove a couple of Lemmas before proving the Theorem. The structure of the proof follows
that of Cai & Wang (2020), which in turn generalises that of Oono & Suzuki (2019). The latter proof technique is not
directly applicable to our setting because it makes some strong assumptions about the harmonic space of the Laplacian (i.e.
that the eigenvectors of the harmonic space have positive entries).
λ= max (λmin 1)2,(λmax 1)2, where λmin, λmax are the smallest and largest non-zero eigenvalues of F.
Lemma 39. For P=IF,E(Px)< λEF(x).
Proof.
We can write
x=Picihi
as a sum of the eigenvectors
{hi}
of
F
. Then
x>Fx=Pic2
iλihi
, where
{λi}
are
the eigenvalues of F. Using this for EF(Ph):
EF(Px) = x>P>Px =x>PPf=X
i
c2
iλi(1 λi)2λX
i
c2
iλi=λEF(f)(29)
The inequality follows from the fact that the eigenvectors of the normalied sheaf Laplacian are in the range
[0,2]
(Hansen &
Ghrist,2019, Proposition 5.5). We note that the original proof of Cai & Wang (2020) bounds the expression by
(1 λmin)2
instead of λ, which appears to be an error.
Lemma 40. EF(XW)≤ kW>k2
2EF(X)
Proof. Following the proof of Cai & Wang (2020) we have:
EF(XW) = Tr(W>X>FXW)(30)
= Tr(X>FXWW>)trace cyclic property (31)
Tr(X>FX)kWW>k2see Lemma 3.1 in Lee (2008) (32)
= Tr(X>FX)kW>k2
2(33)
Lemma 41. For conditions as in Theorem 25,EF(InW)x≤ kWk2
2E(f).
Proof. First, we note that for orthogonal matrices, Dv=IPvEeα2
e=Idv(Hansen & Ghrist,2019, Lemma 4.4)
Neural Sheaf Diffusion
EF(IW)x=1
2X
v,wEekFvEeD1/2
vWfv− FwEeD1/2
wWxwk2
2(34)
=1
2X
v,wEekFeWd1/2
vxvd1/2
wxwk2
2(35)
=1
2X
v,wEekWd1/2
vxvd1/2
wxwk2
2Feis orthogonal (36)
1
2X
v,wEekWk2
2kd1/2
vxvd1/2
wxwk2
2property of the operator norm
(37)
=1
2X
v,wEekWk2
2kFed1/2
vxvd1/2
wxwk2
2Feis orthogonal (38)
=1
2kWk2
2X
v,wEekFeD1/2
vxvD1/2
wxwk2
2=kWk2
2EF(x)(39)
The proof can also be extended easily extended to vector bundles over weighted graphs (i.e. allowing weighted edges as in
Ghrist & Riess (2020)). For the non-normalised Laplacian, the assumption that
Fe
is orthogonal can be relaxed to being
non-singular and then the upper bound will also depend on the maximum conditioning number over all Fe.
Lemma 42. For conditions as in Theorem 25,EFσ(x)E(x).
Proof.
Eσ(x)=1
2X
v,wEekFvEeD1/2
vσ(xv)− FwEeD1/2
wσ(xw)k2
2(40)
=1
2X
v,wEekFed1/2
vσ(xv)d1/2
wσ(xw)k2
2(41)
=1
2X
v,wEekd1/2
vσ(xv)d1/2
wσ(xw)k2
2orthogonality of Fe(42)
=1
2X
v,wEekσxv
dvσxw
dwk2
2cReLU(x) = ReLU(cx)for c > 0(43)
1
2X
v,wEekxv
dvxw
dwk2
2Lipschitz continuity of ReLU (44)
=1
2X
v,wEekFed1/2
vxvd1/2
wxwk2
2orthogonality of Fe(45)
=EF(x)(46)
Combining these three lemmas for an entire diffusion layer proves the Theorem.
Proof of Theorem 28.
If
d= 1
, then Lemma 41 becomes superfluous as
W1
becomes a scalar that can be absorbed into
the right-weights. It remains to verify that a version of Lemma 42 holds in this case.
Lemma 43. For conditions as in Theorem 28,EFσ(x)E(x).
Neural Sheaf Diffusion
Proof.
Eσ(x)=1
2X
v,wEekFvEeD1/2
vσ(xv)− FwEeD1/2
wσ(xw)k2
2(47)
=1
2X
v,wEek|FvEe|D1/2
vσ(xv)− |FwEe|D1/2
wσ(xw)k2
2FvEeFwEe>0(48)
=1
2X
v,wEekσ|FvEe|xv
dvσ|FwEe|xw
dwk2
2cReLU(x) = ReLU(cx)for c > 0(49)
1
2X
v,wEek|FvEe|xv
dv|FwEe|xw
dwk2
2Lipschitz continuity of ReLU (50)
=1
2X
v,wEekFvEeD1/2
vxv− FwEeD1/2
wxwk2
2FvEeFwEe>0(51)
=EF(x)(52)
We note that if
FvEeFwEe<0
(i.e. the reation is signed), then it is very easy to find counter-examples where ReLU does
not work anymore. However, the result still holds in the deep linear case.
D. Additional model details and hyperparameters
Hybrid transport maps
Consider the transport maps
−F>
vEeFuEe
appearing in the off the diagonal entires of the sheaf
Laplacian
LF
. When learning a sheaf Laplacian there exists the risk that the features are not sufficiently good in the early
layers (or in general) and, therefore, it might be useful to consider a hybrid transport map of the form
−F>
vEeFuEeLF
,
where
L
is the direct sum of two matrices and
F
represents a fixed (non-learnable map). In particular we consider maps
of the form
−F>
vEeFuEeLI1LI1
which essentially appends a diagonal matrix with
1
and
1
on the diagonal to the
learned matrix. From a signal processing perspective, these correspond to a low-pass and a high-pass filter that could
produce generally useful features. We treat the addition of these fixed parts as an additional hyper-parameter.
Adjusting the activation magnitudes
We note that in practice we find it useful to learn an additional parameter
ε
[1,1]d(i.e. a vector of size d) in the discrete version of the models:
Xt+1 = (1 + ε)XtσF(t)(IWt
1)XtWt
2.(53)
This allows the model to adjust the relative magnitude of the features in each stalk dimension. This is used across all of our
experiments in the discrete models.
Augmented normalised sheaf Laplacian
Similarly to GCN which normalises the Laplacian by the augmented degrees
(i.e.
(D+In)1/2
, where
D
is the usual diagonal matrix of node degrees), we similarly use
(D+Ind)1/2
for normalisation
to obtain greater numerical stability. This is particularly helpful when learning general sheaves as it increases the numerical
stability of SVD.
Learning sheaves and the k-WL test
According to our theoretical results, one should learn a pair of non-equal restriction
maps along the edges between two classes. Suppose this is done via a local functions
FvEe:=(v,u)=φ(xv,xu)
. Then,
φ(xv,xu)6=φ(xu,xv)only if xu6=xv. This leads to the following remark.
Remark.
Let
G
be a graph with an initial colouring
x
such that
(v, u)E
and
v
and
u
are not
k
-WL distinguishable.
Then no model upper bounded by k-WL can learn an asymmetric relation along the edge (v, u).
This observation motivates our decision to evolve the geometry at each layer of the model since at initialisation time many
nodes might not be distinguishable. At the same time, it suggests that provably expressive architectures might be able to
learn better sheaves (see ablation study in Appendix E).
Neural Sheaf Diffusion
Figure 5.
(Left) Train accuracy as a function of diffusion time. (Middle) Test accuracy as a function of diffusion time. (Right) Histogram
of the learned rotation angle of the
2D
transport maps. The performance of the bundle model is superior to that of the one-dimensional
sheaf. The transport maps learned by the model are aligned with our expectation: the model learns to rotate more (i.e. to move away) the
neighbours belonging to different classes than the neighbours belonging to the same class.
Hyperparameters for discrete models For the discrete models we searched in the following hyperparameters:
Hidden channels: [8,16,32] for WebKB datasets and [8,16,32,64] for all the other datasets.
d:[1,2,3,4,5]
Layers: [2,3,4,5,6,7,8]
Learning rate: 0.02 for the WebKB datasets and 0.01 for all the other datasets.
Weight decay for the regular model parameters: searched in a log-uniform range over [4.5,11.0]
Weight decay for the parameters learning the sheaf: searched in a log-uniform range over [4.5,11.0]
Dropout on the inputs: searched uniformly over [0,0.9].
Dropout for the other layers: searched uniformly over [0,0.9].
Early stopping patience: 100 epochs for the Wiki datasets and 200 for the others.
Maximum training epochs: 1000 epochs for the Wiki datasets and 500 for the others.
Hyperparameters for continous models For the continous models we searched over the following hyperparameters:
Hidden channels: [8,16,32,64]
d:[1,2,3,4,5]
Learning rate: searched in a log-uniform range over [0.01,0.1]
Weight decay : searched in a log-uniform range over [6.9,13.8]
Dropout: searched uniformly in [0,0.95].
ODE Integration method: euler.
Maximum integration time: searched uniformly over [1.0,9.0].
Training epochs: 50.
We train all models using the Adam optimiser (Kingma & Ba,2015). All models use ELU activations.
E. Additional Experiments
In this section we provide a series of additional experiments and abaltion studies.
Two dimensional synthetic experiment
In the main text we focused on a synthetic example involving sheaves with
one-dimensional stalks. We now consider a graph with three classes and two-dimensional features, with edge homophily
level
0.2
. We use
80%
of the nodes for training and
20%
for testing. First, we know that a discrete vector bundle with
two-dimensional stalks that can solve the task in the limit exists from Theorem 23, while based on Proposition 19 no sheaf
with one-dimensional stalks can perfectly solve the tasks.
Neural Sheaf Diffusion
Eigenvectors Texas Wisconsin Cornell
Cont Diag-SD
0 82.97 ±4.37 86.47 ±2.55 80.00 ±6.07
23.51 ±5.05 85.69 ±3.73 81.62 ±8.00
885.41 ±5.82 86.28 ±3.40 82.16 ±5.57
16 82.70 ±3.86 85.88 ±2.75 81.08 ±7.25
Cont O(d)-SD
0 82.43 ±5.95 84.50 ±4.34 72.16 ±10.40
2 84.05 ±5.85 85.88 ±4.62 83.51 ±9.70
884.87 ±4.71 86.86 ±3.83 84.05 ±5.85
16 83.78 ±6.16 85.88 ±2.88 83.51 ±6.22
Cont Gen-SD
083.78 ±6.62 85.29 ±3.31 84.60 ±4.69
2 83.24 ±4.32 84.12 ±3.97 81.08 ±7.35
8 82.70 ±5.70 84.71 ±3.80 83.24 ±6.82
16 82.16 ±6.19 86.47 ±3.09 82.16 ±6.07
Table 2. Ablation study for the dimension of the stalks. Positional encodings improve performance on some of our models.
Therefore, similarly to the synthetic experiment in the main text, we compare two similar models learning the sheaf from
data: one using
1D
stakls and another using
2D
stalks. As we see from Figure 5, the discrete vector bundle model has better
training and test-time performance than the one-dimensional counterpart. Nonetheless, none of the two models manages to
match the perfect performance of the ideal sheaf on this more challenging dataset. From the final subfigure we also see that
the model learns to rotate more across the heterophilic edges in order to push away the nodes belonging to other classes.
The prevalent angle of this rotation is
2
radians, which is just under
120= 360/C
, where
C= 3
is the number of classes.
Thus the model learns to position the three classes at approximately equal arc-lengths from each other for maximum linear
separability.
Positional encoding ablation
Based on the Remark from the previous section, we proceed to analyse the impact of
increasing the expressive power of the model by making the nodes more distinguishable. For that, we equip our datasets
with additional features consisting of graph Laplacian positional encodings as originally done in Dwivedi et al. (2020). In
Table 2we see that positional encodings do indeed improve the performance of the continuous models compared to the
numbers reported in the main table. Therefore, we conclude that the interaction between the problem of sheaf learning and
that of the expressivity of graph neural networks represents a promising avenue of future research.
Visualising diffusion
To develop a better intuition of the limiting behaviour of sheaf diffusion for node classification tasks
we plot the diffusion process using an oracle discrete vector bundle for two graph with
C= 3
(Figure 7) and
C= 4
(Figure
6) classes. The diffusion processes converges in the limit to a configuration where the classes are rotated at
2π
C
form each
other, just like in the cartoon diagrams of Figure 4. Note that in all cases, the classes are linearly separable in the limit.
We note that this approach generalises to any number of classes, but beyond
C= 4
it is not guaranteed that they will be
linearly separable in 2D. However, they are still well-separated. We include an example with C= 10 classes in Figure 8.
Figure 6. Sheaf diffusion process disentangling the C= 4 classes over time. The nodes are coloured by their class.
Neural Sheaf Diffusion
Figure 7. Sheaf diffusion process disentangling the C= 3 classes over time. The nodes are coloured by their class.
Figure 8. Sheaf diffusion process disentangling the C= 10 classes over time. The nodes are coloured by their class.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper outlines a program in what one might call spectral sheaf theory—an extension of spectral graph theory to cellular sheaves. By lifting the combinatorial graph Laplacian to the Hodge Laplacian on a cellular sheaf of vector spaces over a regular cell complex, one can relate spectral data to the sheaf cohomology and cell structure in a manner reminiscent of spectral graph theory. This work gives an exploratory introduction, and includes discussion of eigenvalue interlacing, sparsification, effective resistance, synchronization, and sheaf approximation. These results and subsequent applications are prefaced by an introduction to cellular sheaves and Laplacians.
Article
Full-text available
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.