ArticlePDF Available

A unified framework for backpropagation-free soft and hard gated graph neural networks

Authors:

Abstract and Figures

We propose a framework for the definition of neural models for graphs that do not rely on backpropagation for training, thus making learning more biologically plausible and amenable to parallel implementation. Our proposed framework is inspired by Gated Linear Networks and allows the adoption of multiple graph convolutions. Specifically, each neuron is defined as a set of graph convolution filters (weight vectors) and a gating mechanism that, given a node and its topological context, generates the weight vector to use for processing the node’s attributes. Two different graph processing schemes are studied, i.e., a message-passing aggregation scheme where the gating mechanism is embedded directly into the graph convolution, and a multi-resolution one where neighboring nodes at different topological distances are jointly processed by a single graph convolution layer. We also compare the effectiveness of different alternatives for defining the context function of a node, i.e., based on hyperplanes or on prototypes, and using a soft or hard-gating mechanism. We propose a unified theoretical framework allowing us to theoretically characterize the proposed models’ expressiveness. We experimentally evaluate our backpropagation-free graph convolutional neural models on commonly adopted node classification datasets and show competitive performances compared to the backpropagation-based counterparts.
Content may be subject to copyright.
Knowledge and Information Systems (2024) 66:2393–2416
https://doi.org/10.1007/s10115-023-02024-z
REGULAR PAPER
A unified framework for backpropagation-free soft and hard
gated graph neural networks
Luca Pasa1·Nicolò Navarin1·Wolfgang Erb1·Alessandro Sperduti1,2
Received: 1 March 2023 / Revised: 14 October 2023 / Accepted: 6 November 2023 /
Published online: 26 December 2023
© The Author(s) 2023
Abstract
We propose a framework for the definition of neural models for graphs that do not rely on back-
propagation for training, thus making learning more biologically plausible and amenable to
parallel implementation. Our proposed framework is inspired by Gated Linear Networks and
allows the adoption of multiple graph convolutions. Specifically, each neuron is defined as a
set of graph convolution filters (weight vectors) and a gating mechanismthat, given a node and
its topological context, generates the weight vector to use for processing the node’s attributes.
Two different graph processing schemes are studied, i.e., a message-passing aggregation
scheme where the gating mechanism is embedded directly into the graph convolution, and a
multi-resolution one where neighboring nodes at different topological distances are jointly
processed by a single graph convolution layer. We also compare the effectiveness of different
alternatives for defining the context function of a node, i.e., based on hyperplanes or on proto-
types, and using a soft or hard-gating mechanism. We propose a unified theoreticalframework
allowing us to theoretically characterize the proposed models’ expressiveness. We experi-
mentally evaluate our backpropagation-free graph convolutional neural models on commonly
adopted node classification datasets and show competitive performances compared to the
backpropagation-based counterparts.
Keywords Graph convolutional networks ·Graph neural network ·Deep learning ·
Structured data ·Machine learning on graphs
This work was partly funded by the SID/BIRD project Deep Graph Memory Networks, Department of
Mathematics, University of Padova. The authors acknowledge the HPC resources of the Department of
Mathematics, University of Padova, made available for conducting the research reported in this paper.
BLuca Pasa
luca.pasa@unipd.it
1Department of Mathematics, University of Padua, Padua, Italy
2DISI, University of Trento, Trento, Italy
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2394 L. Pasa et al.
1 Introduction
In recent years, several definitions of neural architectures capable of dealing with data in
structured forms, such as graphs, have been presented [1,2]. The vast majority of graph
neural networks in the literature are based on the idea of message passing [3], in which
the representation of a node at layer l(or time tif the network is recurrent) is defined as a
transformation of the attributes associated to the same node and the ones of its neighbors at
layer l1(t1).
While many works focused on defining alternative architectures, to the best of the authors’
knowledge, all of them rely on backpropagation to learn the networks’ weights. Backprop-
agation is a powerful and effective method to train deep neural networks (NNs). It has
been successfully applied almost ubiquitously in recent years when the training of NNs was
involved. However, when the amount of available data is not huge, the standard approach of
training a nonlinear NN with backpropagation may quickly lead to overfitting the training
data. This clearly contrasts how humans learn since we do not require nearly the amount of
training data modern NNs do to learn how to generalize [4]. Moreover, the backpropagation
mechanism is not biologically plausible [5,6], suggesting that the brain may use different
learning algorithms.
Some alternative definitions of multilayer neural networks that do not rely on backprop-
agation for their training [4,6] have been proposed. They define local learning rules where
each neuron, given its inputs, is trained independently from the rest of the network exploit-
ing a global error signal. This approach allows these networks to: (i) be more biologically
plausible (i.e., from the current knowledge about the functioning of animal neurons, it seems
implausible for a neuron to have access to the connections in a brain area responsible for
a subsequent processing step); (ii) be more sample efficient/simplify the overall training
procedure since each neuron solves an independent (possibly convex) problem.
The aim of this paper is to explore the contamination between these two cutting-edge
research fields, studying how to define neural networks for graph processing that do not rely
on backpropagation for their training. The goal of our research is to develop an extremely
efficient model for graphs. The training phase of most of the GNN architectures that have been
proposed in the literature in the last years is considerably computation-demanding. One of the
reasons is that they rely on the backpropagation algorithm. Defining a backpropagation-free
architecture reminiscent of the idea that each neuron can be trained independently allows us
to have a very simple and highly parallelizable model that significantly reduces the temporal
demands of the training phase. Moreover, it is also open to the application of GNNs to more
flexible scenarios where the model can be updated continuously during its use. This study
aims to investigate the development of a backpropagation-free architecture for graphs and
show how the use of a different training approach affects the performance of the GNN,
without compromising its effectiveness.
Our exploration is based on the recently proposed Gated Linear Networks (GLN) [4], a
family of backpropagation-free neural networks that have been developed for online learning
and that have shown promising results. The main characteristic of such networks is that,
contrary to the mainstream approach, the nonlinearity is achieved via a gating mechanism
instead of element-wise nonlinear functions. More specifically, each neuron receives a context
vector as additional input that is used to select one weight vector in a pre-defined set. The
only nonlinearity lies in such a gating mechanism. In fact, once the weight vector is selected,
each neuron behaves linearly. The resulting network is a piece-wise linear model (similar to
ReLU networks). While the gating mechanism is not trained, each neuron learns to predict
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2395
(by modifying the weights) a binary output and can be trained independently from the rest
of the network. In the case of multi-class classification, a one-vs-all approach is exploited.
In order to define neural networks capable of processing graph data, we explore two pos-
sible generalizations of the above mechanism. The first one is based on the approach adopted
by many graph convolutional networks, in which the network architecture reflects the struc-
ture of the input graph, and node representations are refined at each layer according to the
local graph topology via an aggregation operation over neighboring nodes. For this general-
ization, we also provide a theoretical result on the incremental expressiveness of our models.
The second one is inspired by the so-called multi-resolution architecture on which the graph
convolutional layer is defined by exploiting the power series of the diffusion operator (also
known as graph-augmented multi-layer perceptrons (GA-MLPs) [7], or polynomial-based
graph convolutional (PGC) [8]). These models are able to simultaneously and directly con-
sider all topological receptive fields up to k-hops. Moreover, applying the gating mechanism
turns out to be decoupled from the graph topology.
The properties inherited from GLNs ensure that our models are (at least in principle) as
expressive as their backpropagation-based counterparts, with a significantly easier training
phase. Nonetheless, several choices have to be made, such as which neighbor aggregation
mechanism to adopt, how to define the contexts on graphs and how to define the gating
mechanism efficiently on graphs to limit the number of parameters of the network while
obtaining good predictive performances.
We experimentally evaluate our backpropagation-free graph convolutional neural archi-
tectures on commonly adopted node classification benchmarks and verify their competitive
performance. This work paves the way for novel neural approaches for graph learning. This is
an extended version of the work [9]. Compared to the conference version, we explored a novel
gating mechanism for GLNs, namely the soft-gating approach. We extended our theoretical
analysis by incorporating both the soft- and hard-gating mechanisms in a unified framework.
We extended the experimental results, including the new gating mechanism, and we analyzed
the impact of the backpropagation-free models for what concerns computational efficiency.
Specifically, we explored the speed-up that the BF-GNNs bring to the training phase and how
their particular structure helps in reducing the cost of the model selection process.
In this paper we explore how to define a GNN that can be trained faster and more efficiently
than using backpropagation. Our proposed solution involves constructing an architecture
where each neuron can be trained independently, resulting in a highly parallelizable model
that reduces the time required for training and inference. The results and analysis show
that using a backpropagation-free model is more efficient while maintaining comparable
performance to backpropagation-based models.
2 Background
This section introduces the background notation and knowledge on which our model hinges.
2.1 Learning on graph nodes
A learning problem on a graph can be formulated as learning a function that maps nodes to
labels. The underlying graph structure is given as G=(V,E,L),whereV={v1,...,v
n}
is the set of nodes, EV×Vis the set of edges connecting the nodes, and L:VRs
is a function associating a vector of attributes to each node. With N(v) we denote the set of
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2396 L. Pasa et al.
nodes adjacent to v, i.e., N(v) ={u|(v , u)E}. To simplify the notation, we define for a
fixed graph Gthe matrix X=[
L(v1),...,L(vn)]
Given a graph G, our training set is composed of the target information associated with
some of the graph nodes, i.e., {(v, y)|vW,yY}with WV. For the sake of simplicity,
in our presentation, we will only consider binary values Y∈{0,1}.
2.2 Graph neural networks
Although learning from graph data is not a new research field [1012], in the last few years,
graph neural networks have emerged as the machine learning model of choice when dealing
with graph problems. The first definition of neural network for structured data, includ-
ing graphs, has been proposed by Sperduti and Starita [10]. Later, it has been refined by
Micheli [13] and Scarselli et al. [12]. The core idea is to define a neural architecture that is
modeled according to the graph topology. Thanks to weight sharing, the same set of neurons
is applied to each vertex in the graph and computes its output based on the representation of
the vertex and of its neighbors. As usual, the function computed by each layer is parametric.
A graph neural network (GNN) is a neural model that exploits the structure of the graph and
the information embedded in feature vectors of each node to learn a representation hvRm
for each vertex vV. In many GNN models, the computation of hvcan be divided into two
main steps: aggregate and combine. We can define aggregation and combination by using
two functions, Aand C, respectively: hv=C(L(v), A({L(u):uN(v)})).
The choice of aggregation function Aand combination function Cdefines the type of
graph convolution (GC) adopted by the GNN. In more detail, a general graph neural network
model is built according to the following equations. First, dgraph convolution layers are
stacked: h(i)
v=f(graphconv(h(i1)
v,{h(i1)
u|uN(v)}),where f(·)is an element-wise
nonlinear activation function, graphconv(·,·)is a graph convolution operator, h(i)
vis the
representation of node vat the i-th graph convolution layer, 1 id,andh(0)
v=xv(i.e.,
the row of Xcorresponding to v). The mechanism of communication between neighboring
nodes exploited by the convolution operator is dubbed message passing.
In the last few years, several different GCs have been proposed [1420].
This work builds on top of two widely adopted graph convolutions. The first one is the
GCN [14]
H(i)=F˜
D1
2(I+A)˜
D1
2H(i1)W(i),i>1(1)
where Adenotes the standard adjacency matrix of the graph Gand ˜
Da diagonal degree
matrix with the diagonal elements defined as ˜
dii =1+jaij.Further,H(i)Rn×miis a
matrix containing the representation h(i)
vof all nodes in the graph (one per row) at layer i,
W(i)Rmi1×midenotes the matrix of the layer’s parameters, and Fis the element-wise
(usually, nonlinear) activation function.
The second graph convolution we consider is a slight variation of the first model and
commonly referred to as GraphConv [2]:
h(i)
v=F
h(i1)
vW(i)
1+
uN(v)
h(i1)
uW(i)
2
,
where W(i)
1,W(i)
2Rmi1×mi(with m0=s, the input dimensionality) are the network
parameters.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2397
2.3 Multi-resolution GNN
Some recent works in the literature exploit the idea of extending graph convolution layers
to increase the receptive field size without increasing the depth of the model. The basic idea
underpinning these methods is to consider the case in which the graph convolution can be
expressed as a polynomial of the powers of a transformation T(·)of the adjacency matrix. The
models based on this idea can simultaneously and directly consider all topological receptive
fields up to l-hops, just like the ones that are obtained by a stack of graph convolutional layers
of depth l, without incurring in the typical limitations related to the complex interactions
among the parameters of the GC layers. Formally, the idea is to define a representation as
built from the contribution of all topological receptive fields up to l-hops as:
H=f(T(A)0X,T(A)1X,...,T(A)lX), (2)
where T(·)is a transformation of the adjacency matrix, (e.g., the Laplacian matrix), and fis
a function that aggregates and transforms the various components obtained from the powers
of T(A), for instance, the concatenation, the summation, or even something more complex,
such as a multi-layer perceptron. The ffunction can be defined as a parametric function,
depending on a set of parameters θwhose values can be estimated from data (e.g., when f
involves an MLP). Various models that rely on this idea have been proposed in the last few
years [7,8,2125].
2.4 Backpropagation-free neural networks
In this paper, we exploit recently defined neurons that can be trained locally and independently
instead of using backpropagation. We consider the recently proposed Gated Linear Networks
[26] (GLNs) where the local optimization problem obtained for each neuron, adopting an
appropriate loss, is convex. Moreover, it has been shown that GLNs can represent any function
that represents a probability arbitrarily well [27]. The aim of GLNs is to define a model
composed of neurons that can be trained locally, independently and relying only on task
supervision. The main difference between GLNs and MLPs trained with backpropagation is
that, in GLNs, the weight update of neurons in a layer does not depend on the following ones.
Each neuron is trained to predict the target value and can be trained independently from the
rest of the network (provided the input). GLNs have been applied to online and continual
learning problems [28].
Each neuron in a GLN is a Gated Geometric Mixer. Geometric mixing [29]isanensem-
ble technique that assigns a weight to each weak predictor in input. In GLNs, every unit
produces in output its prediction for the target. Given an input vector of probabilities
p=[p1,..., pn], geometric mixing is defined as:
σwσ1(p)(3)
where σ(x)=1
1+exis the sigmoid function, σ1(x)=logit(x)=log(x)log(1x)is
the logit function (that is the inverse of the sigmoid function), and both of them are applied
element-wise.
To achieve nonlinearity, specifically piecewise-linearity, GLNs employ a gating mecha-
nism in each neuron. Each neuron divides its input space in regions. A geometric mixing (that
is, a linear model) is associated with each region. The association from examples to regions
is carried by a region assignment function c. GLNs assume that for each example, we have a
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2398 L. Pasa et al.
vectorial representation available, xRd(x), and a vector representing side information (or
context), i.e., zRd(z).Thecfunction is defined depending on side information associated
with each input (in case no side information is available, it is possible to set z=x). Each
neuron in a GLN solves a convex problem and is trained independently to predict the target.
For the sake of simplicity, we omit bias terms in the following formulations. Given a
neuron jat the i-th layer, its output is defined as:
h(i)
j,(x,z)=σσ1h(i1)
(x,z)
w(i)
j,(z),i>1(4)
with h(0)
(x,z)=σ(x). The vector w(i)
j,(z)Rmi1stores the weights associated to the region
activated by the context zfor the corresponding neuron. Let us discuss this weight vector in
more detail and the gating mechanism that defines how a specific set of weights is selected.
Givenanexample(x,z), we can select the weights of a single neuron jat the i-th layer
as:
w(i)
j,(z)=Θ(i)
jc(i)
j,(z)(5)
where Θ(i)
jRmi1×k,kis the number of regions (we assume for simplicity that each neuron
in the network considers the same number of regions), and c(i)
j,(z)Rk. Notice that the main
characteristic of a Gated Linear Neuron is that, instead of having a single weight vector, each
GL neuron depends on a matrix of parameters Θ(i)
j.
The original paper [26] proposes to implement the gating in the cfunctions with a
halfspace-gating mechanism. Given a vector zRd(z), and a hyperplane with parameters
aiRd(z)and biR, let us define a context function ˜ci:Rd(z)→{0,1}as:
˜ci(z)=1ifa
iz>bi
0 otherwise
that divides Rd(z)in two half-spaces, according to the hyperplane a
iz=bi. We can compose
log2(k)(assuming kto be a power of 2) context functions of the same kind, obtaining a higher-
order context function ˜c:Rd(z)→{0,1}log2(k),˜c=[˜c1,..., ˜ck]. We can then easily define
a function fmapping from {0,1}log2(k)to {0,...,k1}⊂N, obtaining the function
ˆc:Rd(z)→{0,...,k1},ˆc=f◦˜c=f(˜c(z)). We can exploit the one-hot encoding of
the output of such function and re-define it as c:Rd(z)→{0,1}k,c=one_hot(ˆc).
Given a layer i, each neuron jcomputes a different function c(i)
j:Rd(z)→{0,1}k.
For the j-th neuron at the i-th layer, the output of the context function applied to zis thus
the (one-hot) vector c(i)
j,(z). Notice that this is a hard-gating mechanism, in the sense that
fixed a context vector, the gating function will select a single weight from the set of weights
associated with the Gated Linear Neuron.
2.4.1 Layer-wise formulation
We can define a whole GLN layer by exploiting the definition of a single Gated Linear Neuron
in the previous section. This formulation will be exploited in the remainder of the paper. The
output for the i-th layer in a GLN (with mineurons) for a sample (x,z)is defined as:
h(i)
(x,z)=σσ1h(i1)
(x,z)
W(i)
(z),i>1(6)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2399
where
W(i)
(z)=w(i)
1,(z),...,w(i)
mi,(z),(7)
and W(i)
(z)Rmi1×mi.
Several layers can then be stacked. For a binary classification problem, the last layer
will comprise a single neuron, i.e., for a network with llayers,wehave:W(l)
(z)=w(l)
1,(z),
W(l)
(z)Rml1×1. The resulting model is, by construction, piecewise-linear. Specifically,
given a context z, the model is (up to a final activation function) linear and can be written as:
y(x,z)=σxW(1)
(z)...W(l1)
(z)W(l)
(z)=σxw(z)(8)
with a weight vector w(z)Rd(x).
3 Backpropagation-free graph neural networks
This section defines our proposed models, which generalize GLNs to graph-structured data.
Firstly, we show how to embed the GLN idea into graph convolutions to build models based
on the message-passing paradigm. Then, we propose a multi-resolution approach that keeps
the features’ propagation through the graph structure separate from the processing of the
resulting node representations.
3.1 Message-passing GLN
The core concept behind several definitions of graph neural networks is the aggregation
function used to obtain information about the local graph structure surrounding a graph node.
The most straightforward aggregation mechanism involves just the summation of neighboring
nodes’ representations. For this simple mechanism, we obtain the following definition for a
single layer in a Gated Linear Graph Neural Network:
h(i)
(v,z)=σ
σ1h(i1)
(v,z)
W(i,1)
(z)+
(u,z)Nv
σ1h(i1)
(u,z)
W(i,2)
(z)
,(9)
for i1, and h(0)
v,z=σ(L(v)). The weights W(i,2)
(z)and W(i,1)
(z)are defined as per Eq. (7)
and can be obtained by backpropagation-free training. This model can be considered as
a modification of GraphConv proposed in [2] in which gated geometric mixing has been
applied. For this reason, we refer to this model as Backpropagation-Free-GraphConv (BF-
GraphConv). Similarly to common formulations of graph neural networks, we can express
the hidden representation for all the nodes in the graph as a single matrix. We obtain the
following form of BF-GraphConv:
H(i)
(z)=σ1H(i1)
(z)W(i,1)
(z)+Aσ1H(i1)
(z)W(i,2)
(z)(10)
where H(i)
(z)Rn×miand H(0)
(z)=σ(X). BF-GraphConv can therefore be regarded as a
piecewise linear GNN depending on the neurons’ context information z.
Following common definitions of graph neural networks, we can resort to any message-
passing mechanism and define the Gated Linear counterpart. For instance, we can also
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2400 L. Pasa et al.
Input graph
W(i)
(z)
Message-Passing GLN h(i)
(v,z)
y(v)
Fig. 1 A graphical layout of the proposed Message-Passing GLN, with an expanded view of the GLN neuron.
Each neuron receives in input the embedding of a node and the corresponding context c(i)
(z)(used to select the
associated weights W(i)
(z)) and produces in output the value h(i)
(v,z), while y(v) represents the target supervision
consider the GCN presented in Eq. (1) which leads to the following BF-GCN:
H(i)
(z)=σ˜
D1
2(I+A)˜
D1
2σ1H(i1)
(z)W(i)
(z).(11)
The resulting model can be regarded as a piecewise linear GCN. In particular, after llayers,
the output H(l)
(z)for the context zcan be written as
H(l)
(z)=σ˜
D1
2(I+A)˜
D1
2l
Xw(z),(12)
i.e., the BF-GCN model with llayers is a generalization of the simple graph convolutional
network (SGC) introduced in [21] and further investigated in [30], where the vector of
weights w(z)changes based on the input context. Notice that the main differences between
the Gated Linear Graph Neural Networks and commonly adopted GNN formulations are the
local training and the gating mechanism. A graphical layout of the proposed architecture is
reported in Fig. 1. The figure also reports an expanded view of the GLN neuron exploited to
define the BF-GCN (and the other architectures proposed in this paper).
3.2 Multi-resolution GLN
A different definition of GNN is based on the idea of exploiting the power series of the
diffusion operator to obtain a multi-scale representation of the graph features. The obtained
representation is usually fed to an MLP that projects it into the output space. Our proposal is
to substitute the MLP with a GLNs architecture.
Considering the explorative purpose of this work, we decide to adopt the most general
multi-scale representation definition proposed in [8]:
Rl,T=[X,T(A)X,T(A)2X,...,T(A)lX],
where T:
j=1(Rj×jRj×j)is a generic transformation of the adjacency matrix that
preserves its shape, i.e., T(A)Rn×n. For instance, Tcan be defined as the function
returning the Laplacian matrix starting from the adjacency matrix. Then we can apply the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2401
GLN-based classifier. In particular, for each degree of diffusion operator (up to l), we consider
cneurons, where cis the number of the classes considered in the classification problem.
We recall that each neuron solves a binary classification problem. Therefore, similar to the
message-passing GLN case, the one-vs-rest approach is exploited.
hi=[h(i)
j,(Ri,T,X)]j∈[0...c1]
h=[GeomMean(h0[j],...,hl[j])]j∈[0...c1].(13)
Note that hiis computed considering only the diffusion operator’s multi-resolution repre-
sentationuptodegreei. This allows us to obtain the same effect that the authors have in
[8]wherelmulti-resolution convolutions of degree ranging from 0 up to lare concatenated.
Finally, to combine the results of each class’s neurons, we compute the geometric mean.
3.3 Context functions
The context function presented in Sect. 3and exploited in (5) is based on random half-space
gating. That definition is suited for online learning, where the training data distribution is not
known beforehand. However, it is not data-driven and may result in the necessity of defining
a high number of context regions to obtain a sufficiently nonlinear model. Notice that the
halfspace gating mechanism depends on some hyperparameters: in addition to the number
of regions k, one has to choose the parameters of the distribution from which to sample
the weights corresponding to each hyperplane (e.g., mean and variance, assuming they are
sampled from a normal distribution). Setting these hyperparameters may be challenging since
their choice can strongly affect results.
In this section, we propose an alternative approach that can be exploited in the batch
learning scenario and that does not depend on any parameter but on the number of regions
to consider.
In particular, we propose to define a partition of the context space based on a set of
prototypes [28]. Each point in the space is assigned to its closest prototype, obtaining a
Voronoi tessellation. Note that half-space gating generates a division of the context space
that can be represented as a planar straight-line graph (PSLG) instead. It is possible to show
that any PSLG coincides with the Voronoi diagram of some set Sof points (i.e., prototypes)
[31]. Similarly to the half-space gating mechanism, the prototypes are not learned. However,
instead of randomly generating them, we propose sampling them randomly among the training
examples. This ensures that each prototype will lie on the input data manifold. Moreover, as
mentioned before, this approach relieves us from many hyperparameter choices.
Let P(i)
jRk×d(z)be the matrix of prototypes. We can formally define for every zRd(z)
the context vector c(i)
j,(z)∈{0,1}kas:
c(i)
j,(z)=one_hot(argmin(vecnorm(P(i)
j1z))),
where 1is a vector with all kelements equal to 1, vecnorm(M)returns the 2 norm of each
row of M,andis the Kronecker product.
3.4 A unified framework for soft and hard gating
A one-hot function c(i)
j,(z)assigns to every context vector zprecisely one region in Rd(z)based
on the closest distance to one of the prototypes. The prototypes correspond to the kcenters
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2402 L. Pasa et al.
of the Voronoi regions. For a one-hot vector c(i)
j,(z)exactly one of the entries is one (the one
corresponding to the closest prototype), while all other entries are zero. This implies that only
the closest prototype is considered when computing the output of the neurons, regardless of
the distances of zto the other prototypes. In the following, we refer to this selection of c(i)
j,(z)
as hard gating. Hard gating can lead to discontinuities in the outputs when the context z
is similarly close to several prototypes and the weights of the different prototypes vary. In
this case, a small amount of noise in the context zcan lead to considerably different neuron
outputs. To diminish these discontinuities on the borders of the regions, we alternatively
propose a soft-gating procedure.
Soft gating uses a more general probability distribution for the vectors c(i)
j,(z)instead of
a one-hot vector. In fact, one-hot vectors can be interpreted as a point mass distribution.
We will construct these general probability distributions in such a way that the entries of the
vectors c(i)
j,(z)depend smoothly on the distance to the prototypes and get large if the respective
distance is small. Using a probability distribution for the vectors c(i)
j,(z)results, according to
Eq. (5), to a weight w(i)
j,(z)which is a weighted sum of the contributions of all the weights
associated with the single prototypes.
For a soft-gating procedure, we, therefore, use probabilistic assignment vectors with the
following properties for every neuron jand every layer i:
c(i)
j,(z)∈[0,1]kand c(i)
j,(z)1=1forallzRd(z).
Remark 1 (i) The one-hot functions considered in the previous part are special cases of this
definition. One-hot functions allow linking the assignment functions to regions with the
prototypes as centers.
(ii) The set of all entries of the vectors c(i)
j,(z)considered as functions in zforms a so-called
partition of unity for the Euclidean space Rd(z). This allows to distribute the contributions
of the single prototypes to the combined weight once the context vector zis given. In
hard gating, this partition of unity corresponds to a segmentation of Rd(z)in kdisjoint
regions.
(iii) From a different point of view, we can consider the entries of c(i)
j,(z)as function values
that depend on the neuron j. In this particular point of view, the entries of c(i)
j,(z)form a
partition of unity on the network. In [32], it was shown that the partition of unities based
on overlapping covers of the network has advantages in the reconstruction of smooth
functions compared to the partition of unities that are derived from pure segmentation of
the network.
Soft-gating assignment vectors c(i)
j,(z)can be generally constructed using auxiliary vectors
ψ(i)
j,zwith positive entries that depend continuously on the Euclidean distance to the pro-
totypes. Then a probabilistic assignment vector can be easily obtained by normalizing the
auxiliary vectors, i.e., by setting:
c(i)
j,(z)=ψ(i)
j,z
ψ(i)
j,z1
.
Being defined as a probabilistic vector is not enough for c(i)
j,(z)to encode a meaningful measure
between an example and the prototypes. Specifically, we are interested in definitions encoding
the similarity between examples and prototypes. There are multiple possibilities to define
such a measure. In the following, we report the one we have used for our experiments.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2403
Aiming to define a similarity and not a distance measure, we can simply subtract the
normalized distance from 1:
c(i)
j,(z)=1ψ(i)
j,z
ψ(i)
j,z1
,
where ψ(i)
j,zcan be instantiated as a function computing the Euclidean distance between the
context vector zand the matrix P(i)
jof the prototypes of the j-th neuron of layer i, i.e.,
ψ(i)
j,z=vecnorm(P(i)
j1z).
Using this formulation, and according to Eq. (5), w(i)
j,(z)becomes the weighted sum of the
contributions of all input projections obtained considering the weights associated with each
region. The weight of each projection is based on the distance (the closer—the higher). Note
also that this formulation ensures that each element in c(i)
j,(z)is comprised in the interval [0,1],
and |c(i)
j,(z)|= 1. Since this is the simplest among the options to implement a soft c(i)
j,(z),we
consider this formulation in our experiments.
3.5 Incremental expressivity of soft Gated Linear Networks
The context functions c(i)
j,(z)incorporated in soft gating depend continuously on the Euclidean
distance from the contexts zRd(z)to the prototype vectors encoded in the matrix P(i)
j.
Now, if the number of prototypes gets larger, we expect the resulting GLNs to become more
expressive. We formalize and show this expected behavior for the two gated linear networks,
BF-GCN and BF-GraphConv. It is true for the soft and hard gated variants of the networks.
Theorem 1 We consider two Gated Linear Graph Neural Networks (either BF-GCN or BF-
GraphConv) based on a soft-gating assignment function c(z).Weassumethatthesecond
network contains more prototypes than the first network, i.e., we assume that the set P2
Rd(z)of prototypes of the second network model contains the prototypes P1of the first model.
Then, the second Gated Linear GNN is more expressive than the first.
Proof In this proof, we will only consider the BF-GCN model. For simplicity, we will also
assume that X=xRn×1, i.e., that the dimension of the input variable is s=1 and, thus,
that we have real-valued weights w(z)R.Now,forasetPof prototypes let H(P)denote
the set of all possible functions
H(P)=
y(z)=σ(˜
D1
2(I+A)˜
D1
2)lxw(z)|w(z)=
pP
c(P)
(z),pw(p)
generated by a BF-GCN network with a given assignment vector c(P)
(z)with entries c(P)
(z),pand
set Pof prototypes. Assume now that the weight w(z)is generated as a positive combination
of the weights of the prototypes in the smaller set P1, i.e.,
w(z)=
pP1
c(P1)
(z),pw(p).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2404 L. Pasa et al.
Table 1 Dataset statistics Dataset #Classes #Edges #Train #Val #Test
Citeseer 6 9228 1995 666 666
Cora 7 10,556 1624 542 542
Pubmed 3 88,651 11,829 3944 3944
WikiCS 10 216,123 7021 2340 2340
The columns #Train, #Val and #Test report the number of nodes in the
training, validation and test sets, respectively
By setting ˆw(p)=w(p)pP1c(P2)
(z),p1if pP1and ˆw(p)=0for pP2\P1,weget
w(z)=
pP2
c(P2)
(z),pˆw(p),
and, therefore that H(P1)H(P2).
Note that, in order to demonstrate the strict inclusion H(P1)H(P2)in Theorem 1,
assuming that P2is strictly larger than the set P1, an additional assumption on the entries
c(P2)
(z),pof the context functions is required. If considered as functions of z, the entries c(P2)
(z),p,
pP2, have to be linearly independent. Only in this way it can be guaranteed that additional
prototypes will generate new linear combinations of weights. From Theorem 1,wealsoget
the following result as an immediate consequence.
Corollary 1 Every BF-GCN with soft gating and more than one prototype is more expressive
than a simplified graph convolutional network (SGC) [21]
H=σ˜
D1
2˜
A˜
D1
2l
XW.
4 Experimental results
We exploited our proposed backpropagation-free graph neural networks on four benchmark
datasets of graph node classification commonly adopted in the literature: Citeseer, Cora,
Pubmed and WikiCS [33]. Citeseer, Cora and Pubmed are datasets representing relationships
among research papers. The task is to classify each paper into a pre-defined set of topics.
Each dataset is composed of a single large graph. The nodes represent documents and are
enriched by a 0/1-valued bag-of-words features vectors indicating the absence/presence of the
corresponding word from a dictionary. WikiCS is a dataset of Computer Science Wikipedia
articles represented as nodes in a graph. Two nodes are connected if a hyperlink connects the
two corresponding web pages. The node features are derived from the text in the web page,
as the average of pre-trained GloVe word embeddings. For each dataset, we split the nodes
into training (60%), validation (20%), and test (20%) sets. Relevant statistics on the datasets
are reported in Table 1. As performance measure we opted for the classification accuracy,
since it is the most common choice for the benchmark datasets we considered.
We used PyTorch Geometric [34] to develop all the models in our experimental com-
parison. We considered the GCN and the GraphConv convolutions as baselines, for which
we exploited the implementation provided by the library. In addition, we also evaluated a
multi-resolution architecture (MRGNN) as a baseline that exploits the same graph-augmented
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2405
representation of the proposed backpropagation-free multi-resolution GLN, followed by an
MLP that projects the node representations to the output space. All the baseline models are
trained end-to-end using the backpropagation algorithm.
We explored two choices for the definition of context of the multilayer backpropagation-
free models (BF-GCN and BF-GraphConv), i.e., the standard definition in which the context
is the same as the input Xand an alternative formulation where we use each node’s inputs,
i.e., the hidden representation computed at the preceding layer (H) as a context for all gated
neurons. Moreover, we tested our proposal of a soft-gating mechanism presented in Sect. 3.4.
We used the Adam algorithm to solve the resulting optimization problems. We used early
stopping (with the patience set to 15 epochs) and model checkpoint, monitoring the accu-
racy on the validation set. We set the maximum number of epochs to 250. All the baseline
experiments involved the softmax activation function applied to the last layer. The results
were averaged by performing five runs for each model. Our experiments have been executed
on a machine equipped with: 2x Intel(R) Xeon(R) CPU E5-2630L v3, 192 GB of RAM and
an Nvidia Tesla V100. For more details, please check the publicly available code.1
4.1 Model selection
Before discussing the results of our experiment, it is of utmost importance to stress that dif-
ferent papers in the literature adopt different model selection and error estimation procedures.
Different procedures can produce very different results. A key aspect to consider is the pro-
cedure adopted to select the hyperparameters (such as learning rate, regularization, network
architecture). Many papers report, for each dataset, the best performance (on the test set)
obtained after testing many hyperparameter configurations. This procedure favors complex
methods that depend on many hyperparameters, since they have a larger set of trials to select
from compared to simpler methods. However, the predictive performances computed in this
way are not unbiased estimations of the true error; thus, these results are not comparable
to other model selection methods [35]. Moreover, since neural networks tend to show long
training times, different works propose to fix different hyperparameters depending on the
particular datasets considered. To avoid differences in results due to the model selection con-
ditions, we ran all the experiments using the same fair model selection procedure, where we
selected all the hyperparameters of each method on the validation set. The hyperparameters
of the model (number of hidden units, number of layers, learning rate, weight decay, and
dropout) were selected by using a narrow grid search, where the explored sets of values do
change based on the considered dataset. We performed preliminary tests to select the set of
values considered for each hyperparameter. In Table 2, we report the sets of hyperparameter
values used for the grid search. For what concerns the selection of the epoch, it was per-
formed for each fold independently based on the accuracy value on the validation set. The
importance of the validation strategy is discussed in [36], where results of a fair comparison
among the considered baseline models are reported. We use the same hyperparameter grid for
all the models to perform a fair comparison between the proposed models and the baselines.
Since our models adopt a one-vs-rest approach, for the baselines we also consider a number
of hidden units corresponding to the values reported in Table 2multiplied by the number
of classes of the considered dataset. As an evaluation measure to perform model selection,
we used the average accuracy computed on the validation set, while we report in Tables 3,4
and 5the average accuracy on the test set along with the hyperparameters selected by the
validation process. We recall that the contribution of this paper is not to present yet another
1https://github.com/lpasa/BF- GNN.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2406 L. Pasa et al.
Table 2 Sets of hyperparameter
values used for model selection
via grid search
Hyperparameter Values
m2, 4, 6, 8, 16, 24, 32, 64
l1, 2, 3, 4, 5, 6, 7, 8
k1, 2, 4, 8, 16
Learning rate 0.1, 0.2, 0.01, 0.001
Weight decay 0,54,53
Drop out 0,0.2,0.5
T(A)A,L
Table 3 Accuracy comparison between the GCN and BF-GCN models
Model Context function Context Citeseer Cora Pubmed WikiCS
GCN 76.6±1.087.5±0.988.5±0.381.6±0.7
BF-GCN Halfspace H76.0±1.487.7±0.588.1±0.379.8±0.8
(2,16,2)(1,16,2)(1,8,2)(2,6,2)
X76.3±1.888.0±0.588.1±0.480.7±0.7
(2,24,2)(2,16,2)(2,4,2)(2,6,8)
Proto H77.0±0.487.8±0.388.1±0.781.0±0.9
(2,16,4)(2,8,2)(2,32,4)(2,4,16)
X76.9±2.088.0±0.488.0±0.680.4±0.8
(2,8,1)(2,32,2)(2,8,2)(2,6,16)
Soft Proto H76.9±1.987.6±0.988.1±0.475.8±0.5
(2,2,8)(2,24,8)(2,2,8)(2,6,2)
X76.6±2.087.5±0.988.1±0.475.9±0.5
(1,24,2)(1,24,4)(1,2,16)(2,6,2)
The results in bold represent the best accuracy achieved for each dataset
The model selection is performed considering the results obtained on the validation set. Under each accuracy
result, we report the hyperparameters selected via the validation process: (l,m,k)
graph neural network architecture performing slightly better than other alternatives. Instead,
we want to show that it is possible to match the performance of different graph convolutional
neural networks (and thus perform an effective representation learning) even without relying
on backpropagation.
4.2 Discussion
The results obtained by the backpropagation-free graph neural networks, namely BF-GCN,
BF-GraphConv and BF-MRGNN, are generally comparable and sometimes higher than
the ones of the baselines trained with backpropagation. We recall that the main goal of
backpropagation-free neural networks is not to obtain higher accuracies but to obtain com-
parable performance in a more parallelizable and biologically plausible setting. In particular,
our proposals are significantly easier to train since they optimize convex optimization prob-
lems (one for each neuron that can be parallelized), compared to the single but significantly
more complex nonlinear problem optimized by the baselines. In the following discussion
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2407
Table 4 Accuracy comparison between the GraphConv and BF-GraphConv models
Model Context function Context Citeseer Cora Pubmed WikiCS
GraphConv 75.1±1.687.1±0.588.9±0.476.8±1.3
BF-GraphConv Halfspace H76.5±0.988.0±1.086.8±0.980.6±0.4
(1,32,2)(2,24,2)(2,2,2)(4,8,2)
X76.3±1.888.0±1.186.5±0.480.5±0.5
(2,24,2)(1,16,2)(2,8,4)(2,6,2)
Proto H76.2±1.288.5±1.386.1±0.281.4±0.8
(2,8,2)(2,8,1)(2,32,4)(2,32,2)
X74.9±0.687.8±1.186.5±0.382.3±0.7
(2,16,2)(2,8,4)(2,16,4)(2,8,16)
Soft Proto H76.9±1.488.4±1.286.5±0.780.3±0.4
(1,2,32)(2,4,4)(1,32,16)(1,16,2)
X76.8±14.788.1±1.486.1±0.780.1±0.2
(1,2,64)(1,8,8)(2,32,2)(2,6,2)
The results in bold represent the best accuracy achieved for each dataset
The model selection is performed considering the results obtained on the validation set. Under each accuracy result, we report the hyperparameters selected via the validation
process (l,m,k)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2408 L. Pasa et al.
Table 5 Accuracy comparison between the Backpropagation-MRGNN and standard MRGNN
Model Context function Citeseer Cora Pubmed WikiCS
MRGNN 74.2±1.686.5±0.887.2±0.980.6±0.7
BF-MRGNN Halfspace 76.3±1.587.6±0.687.6±0.578.4±0.7
(3,2)(3,2)(2,2)(3,4)
Proto 77.5±1.087.1±0.987.9±0.479.4±0.6
(2,2)(3,2)(2,2)(2,2)
Soft Proto 76.6±1.687.7±0.687.6±0.377.4±0.6
(5,4)(3,8)(2,8)(3,2)
The results in bold represent the best accuracy achieved for each dataset.
The model selection is performed considering the results obtained on the validation set. Under each accuracy
result, we report the hyperparameters selected via the validation process (l,k)
we consider the results obtained on the test set by the model selected by validating all the
hyperparameters on the validation set.
Let us start discussing the GCN convolution mechanism presented in Table 3where we
report the results obtained validating all the hyperparameters on the validation set. For each
method and dataset, we report the average accuracy and the standard deviation over five
runs. In the table, we compare the results of backpropagation-based GCN and the proposed
backpropagation-free models (BF-GCN) with the two alternative context definitions, i.e., H
and X. Moreover, for BF-GCN, we experimented with the two context functions based on
random half-space gating and our proposal of defining the partition of the space based on
a set of prototypes (context column in Table 3). Finally, we report the results of the soft-
gating mechanism (Soft Proto), with both choices of context function. We can notice that
the performance of the backpropagation-free models is almost always close to the ones of
GCN. We can conclude that, in this case, all the backpropagation-free models based on GCN
can learn a representation that is comparably expressive to the one of the backpropagation-
based GCN. With this convolution mechanism, there is no clear advantage of using either
Hor Xas contexts. These results show that the proposed backpropagation-free methods are
pretty resilient and show consistent performance even with significantly different choices of
context space. Moreover, the results suggest that the prototype-based context function allows
reaching slightly better performance in terms of accuracy compared to half-space gating.
Let us now consider the GraphConv convolution mechanism. The results of our BF meth-
ods based on this convolution are reported in Table 4. In this case, backpropagation-free
models show slight but consistent accuracy improvements on the Citeseer (up to 1.8%), Cora
(up to 1.4%) datasets and WikiCS (up to 5.5%). On Pubmed, BF methods perform slightly
lower than the baseline GraphConv model, with a gap of at least 2.1%. Notice that the differ-
ences between BF-GraphConv and GraphConv on WikiCS and Pubmed are greater than one
standard deviation. We can notice that with GraphConv, using Has context tends to provide
slightly higher performances compared to using X. Analyzing the accuracy, no clear advan-
tages can be noticed in using the prototypes-based context function instead of a half-space
gating mechanism.
For what concerns BF-MRGNN, the results obtained on Citeseer, Cora and Pubmed show
an improvement with respect to its counterpart trained via backpropagation, i.e., MRGNN.
More specifically on Citeseer, BF-MRGNN with prototypes as contexts achieved an improve-
ment of 3.3%t compared to MRGNN, which is significant, being the two results more than
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2409
Table 6 Accuracy comparison on node classification task between the proposed models and seven baselines
Model\dataset Citeseer Cora Pubmed WikiCS
GCN 76.1±1.087.5±0.988.5±0.381.6±0.7
GraphConv 75.1±1.687.1±0.588.9±0.476.8±1.3
GAT 76.0±1.487.3±1.187.2±0.479.1±0.3
LGC 75.4±1.388.5±1.289.4±0.880.0±0.5
hExpGC 75.7±0.80 87.5±1.388.4±0.378.4±0.35
hLGC 77.3±0.787.5±0.789.9±0.380.1±0.5
MRGNN 74.2±1.686.5±0.887.2±0.980.6±0.7
BF-GCN 77.0±0.488.0±0.488.1±0.481.0±0.9
BF-GraphConv 76.9±1.488.5±1.386.8±0.982.3±0.7
BF-MRGNN 77.5±1.087.6±0.687.9±0.479.4±0.6
The results in bold represent the best accuracy achieved for each dataset
The model selection is performed considering the results obtained on the validation set
a standard deviation apart. On WikiCS, the accuracy difference between MRGNN and its
backpropagation-free version is within the standard deviation interval, with the exception of
the Soft Proto version. Note that in the BF-MRGNN, due to its definition (Eq. (13)), only X
can be used as context.
We can conclude that the proposed backpropagation-free graph convolutions are compet-
itive with their backpropagation-based counterparts while inheriting all the advantages of
backpropagation-free methods.
For what concerns the comparison between the two considered gating mechanisms (halfs-
pace and prototype), we obtained no strong evidence in terms of accuracy in favor of using one
approach over the other. However, the prototype approach does have an advantage in reduc-
ing the number of hyperparameters. In fact, it is not straightforward to define the half-space
gating hyperparameters, as random initialization of hyperplanes introduces a strong assump-
tion on the data distribution. In our experiments, we decided to keep the same parameters
used in [4] for the distribution from which the weights corresponding to each hyperplane are
sampled, since modifying them even slightly seemed to impact the predictive performance.
On the other hand, the proposed prototype-based context function allows us to initialize the
gating mechanism in a data-driven way, which turns out to be very simple since we can
just uniformly sample them from the training set. The soft-gating approach shows improved
performance in some cases, while in other cases, the hard-gating approach performs better.
We argue that the smoothness inductive bias introduced by the soft gating makes the model
less nonlinear compared to the hard-gating version. Finally, in terms of time complexity, the
BackPropagation-free models present a huge advantage with respect to the standard GNN.
Indeed, the computation (both forward and backward step) of each neuron is independent
from all the others. Thus it is possible to perform the computation of each unit in parallel.
Considering the message-passing GLNs, the layerwise construction of the model allows all
neurons in the same layer to be computed in parallel. This constraint is overcome by the
BL-MRGNN, where all the nodes are completely independent, and thus it is possible to
parallelize the computation on all neurons.
In Table 6we compare the performance obtained by the best configuration (selected
considering the accuracy of the validation set) of the proposed BF-based models with some
state-of-the-art baselines. Specifically, we consider GCN, GraphConv, GAT [37], LGC, ECG
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2410 L. Pasa et al.
Fig. 2 UMAP 2D graphical representations of the embeddings of nodes by BF-GCN (using context X and
prototypes-based gating mechanism) and GCN for all datasets. 1st column: original data; 2nd column: first
layer computed by BF-GCN; 3rd column: second hidden layer computed by BF-GCN; 4nd column: last hidden
last computed by the GCN. The nodes are colored based on their target class
and hLGC [38]. The comparison shows how the proposed backpropagation-free architectures
achieve higher or comparable results in all the datasets. In particular, they achieved the best
results on three datasets out of four (Citeseer, Cora and WikiCS) and comparable accuracy
on Pubmed.
4.3 Computed representation
In Fig. 2,weusedUMAP[39] to generate 2D representations of each node of the four
considered datasets in the input space, its representations obtained after a single and two
layers of BF-GCN using Xas context and the representation obtained with backpropagation-
based GCN. UMAP tries to maintain the distances between multi-dimensional points by
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2411
projecting them in two dimensions (allowing for human-readable data visualization). The
resulting plots must be interpreted qualitatively. The network settings considered to generate
these plots are selected through the validation phase (i.e., the ones reported in Tables 3,4
and 5). The color of each node represents the class it belongs to. In Fig. 2, the first plot of each
row represents the manifold of the input space. In Cora and Citeseer datasets, the positioning
of the nodes is chaotic, while in Pubmed and WikiCS it is possible to note some areas where
the nodes of a specific class are clustered.
Each row’s second and third plots show the spatial representation obtained by the BF-GCN
after the first and second hidden layers, using Xas a context and the prototypes-based gating
mechanism, respectively.
In the Cora, Citeseer and WikiCS datasets, it is possible to see that already with one hidden
layer, BF-GCN representations already achieve a good class separation. At the second layer,
the separation tends to increase, at least visually. Obviously, backpropagation-based GCN
also achieves a good level of separation. Both with backpropagation-based GCN and the
backpropagation-free counterpart, we can observe that the models tend to cluster examples
belonging to the same class. In some cases, e.g., WikiCS, the resulting visualizations are
pretty close, while in other cases, e.g., Pubmed or Citeseer, the distribution of the examples
in the 2D space can result different. However, notice that both approaches tend to separate
examples belonging to different classes. It is worth noticing that both the adopted manifold
learning method (UMAP) and the hyperparameters of the model can influence the obtained
plots.
4.4 BF Models and model section
One of the most costly phases of the development of a model is the selection of hyperparam-
eters (i.e., the validation phase). Several experiments using different hyperparameters values
must be run to select the model that ensures the best generalization capabilities. In our specific
case, for the baseline models, we used a grid search approach, which resulted in more than
33,000 runs for each dataset (note that for each configuration, we perform five runs to have
a statistically significant evaluation of the model performance). The particular structure of
the BF-GNN allows us to considerably reduce the number of required runs to perform. For
what concerns the GC-GCN/GraphConv, all the units that compose a layer can be trained in
parallel. Thus, when adding units to the last layer, re-training the model from scratch is not
required, but a pre-trained model with fewer units can be used as a starting point. Moreover,
it has to be noticed that each unit (and so each layer) of the BF-GCN/GraphConv predicts
the classification of the input (in a one vs rest setting). The outputs of all the neurons in a
layer are then used as input for the next layer of the model. Therefore, stacking an additional
layer (the l-th) does not require performing full training of a novel model, but it can be done
incrementally from a previously trained model with l1 layers. Clearly, this incremental val-
idation policy significantly reduces the time required to perform the model selection phase. It
is important to notice that in BF-GCN/GraphConv, the layer-wise structure poses constraints
on the dependencies between the units in a layer and those in the following ones. Indeed, let
us consider a network with a first layer with m1units and a second one with m2units. If we
want to use a different value of units for the first layer, we have to re-train the second one
from scratch. This particular constraint does not apply to the case of BF-MRGNN, where
the computation of Rl,Tand the application of the gating neurons are entirely detached.
Moreover, in BF-MRGNN, the use of the geometric mean (which does not require training)
as a mixing function of the output of the various layers that computes the prediction for the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2412 L. Pasa et al.
Table 7 Time comparison among the model proposed in this paper and two baselines (GCN and GraphConv)
Model Citeseer Cora Pubmed WikiCS
GCN 0.776 ±0.189 1.610 ±0.428 2.372 ±1.983 1.995 ±0.558
(l,m)(1,8)(2,16)(2,48)(1,8)
GraphConv 0.681 ±0.152 1.602 ±0.371 1.108 ±0.0912 4.395 ±0.818
(l,m)(2,96)(2,128)(2,32)(1,64)
MRGNN 0.637 ±0.064 0.331 ±0.0474 60.278 ±11.352 111.465 ±10.690
(k)(2)(2)(7)(5)
BF-GCN 0.120 ±0.001 0.247 ±0.001 0.233 ±0.003 0.484 ±0.006
Speed-up 6.36.510.24.1
BF-GraphConv 0.249 ±0.001 0.246 ±0.002 0.882 ±0.038 0.242 ±0.003
Speed-up 2.76.51.418.2
BF-MRGNN 0.038 ±0.001 0.025 ±0.001 0.018 ±0.001 0.079 ±0.0242
Speed-up 16.813.2 3338.8 1410.9
For each model and dataset, we report the average duration (and standard deviation) of the training phase of all
five runs of the best-performing model. The time measurements are reported in seconds. Under each result of
the baselines, we report the hyperparameters selected via the validation process. Under each BackPropagation-
Free model, we report the speed-up obtained with respect to the same convolution optimized using the
backpropagation algorithm
same class makes the addition of one or more neurons entirely transparent for the training
process. Finally, the independence between the training phase of units of the same layer
(and, in general, of all the units in the case of BF-MRGNN) allows for bringing down the
cost of training a single model. The significant reduction of training cost can be noticed in
Tabl e 7, where we report the time required to train the best model selected through grid
search. In the table, under each BF-GNN, we report the speed-up computed with respect to
the corresponding baseline model, i.e., using the same convolution. In all the cases, using the
Backpropagation-Free version shows a significant speed-up. Specifically, the BF-MRGNN
results in a model that is always more than 10 times faster than the MRGNN reaching a speed
up higher than 1000 times for Pubmed and WikiCS.
5 Conclusions and future directions
In this paper, we extended our exploration [9] of locally trainable graph convolutional oper-
ators. We presented a framework for defining backpropagation-free graph neural networks
that is inspired by Gated Linear Networks. We explored variants of our approach exploit-
ing both the common message-passing-based convolution scheme (GCN and GraphConv)
and a multi-resolution graph architecture (MRGNN). The training relies on a representation
space of graph nodes that is shattered into different subspaces according to the node context.
Indeed, each neuron that composes the GC operator is defined as a set of weight vectors.
A gating mechanism within each neuron selects the weight vector to use for processing the
input based on its context. This mechanism allows to train each neuron independently, with-
out using backpropagation, resulting in a set of convex problems to solve. We studied three
variants of such a gating mechanism: two hard and a soft version.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2413
We empirically assessed the performances of BF-GCN, BG-GraphConv and BF-MRGNN
on four commonly adopted node classification benchmarks and verified their competitive
performances. Moreover, we analyzed the behavior of such models, considering different
options for shattering the context space associated with graph nodes. Finally, we evaluated
the impact in terms of computational efficiency, comparing the cost in terms of computational
time between the baselines and their corresponding versions using the backpropagation-free
approach. The results show that the proposed BF-GNNs ensure a considerable speed-up,
thanks to the possibility of parallelizing the training of the various units belonging to the same
layer. Moreover, we show how the particular structure of the backpropagation-free models
allows us to design a very efficient model selection strategy. We implemented layer-wise
training via Stochastic Gradient Descent (SDG). Still, many other methods can be exploited
to solve the resulting convex problem, making it suitable for online and continual learning
scenarios. Indeed, we plan to explore the application of the proposed backpropagation-free
graph neural networks to continuous learning tasks in the near future.
Author Contributions Luca Pasa and Nicolò Navarin helped in conceptualization, methodology, software,
writing. Wolfgang Erb and Alessandro Sperduti were involved in conceptualization, methodology, formal
analysis, writing.
Funding Open access funding provided by Università degli Studi di Padova
Declarations
Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included in the
article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is
not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
References
1. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In ICLR,
citation Key: Kipf2016a. Available: arXiv:1609.02907
2. Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G, Grohe M (2019) Weisfeiler and leman
go neural: higher-order graph neural networks. In: Proceedings of the AAAI conference on artificial
intelligence, vol 33, pp 4602–4609. Available: arXiv:1810.02244
3. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum
chemistry. In Proceedings of the 34th international conference on machine learning, pp 1263—-1272
4. Veness J, Lattimore T, Budden D, Bhoopchand A, Mattern C, Grabska-Barwinska A, Sezener E, Wang J,
Toth P, Schmitt S et al (2021) Gated linear networks. In: Proceedings of the AAAI conference on artificial
intelligence, vol 35, no 11, pp 10015–10023
5. Whittington JC, Bogacz R (2019) Theories of error back-propagation in the brain
6. Clark DG, Abbott LF, Chung S (2021) Credit assignment through broadcasting a global error vector.
arXiv:2106.04089 [cs, q-bio]
7. Chen L, Chen Z, Bruna J (2020) On graph neural networks versus graph-augmented MLPs. arXiv preprint
arXiv:2010.15116
8. Pasa L, Navarin N, Sperduti A (2021) Polynomial-based graph convolutional neural networks for graph
classification. Mach Learn
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2414 L. Pasa et al.
9. Pasa L, Navarin N, Erb W, Sperduti A (2022) Backpropagation-free graph neural networks. In: IEEE
international conference on data mining (ICDM), pp 388–397
10. Sperduti A, Starita A (1997) Supervised neural networks for the classification of structures. IEEE Trans
Neural Netw 8(3):714–735
11. Gärtner T (2003) A survey of kernels for structured data. In: ACM SIGKDD Explorations Newsletter,
5(1), 49, citation Key: Gartner2003 publisher-place: New York, NY, USA
12. Scarselli F, Gori M, Ah Chung Tsoi AC,Hagenbuchner M, Monfardini G (2009) The graph neural network
model. IEEE Trans Neural Netw 20(1):61–80
13. Micheli A (2009) Neural network for graphs: a contextual constructive approach. IEEE Trans Neural
Netw 20(3):498–511
14. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR,
pp 1–14
15. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast
localized spectral filtering. In: NIPS, pp 3844–3852
16. Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: NIPS, pp
1024–1034
17. Li Y, Tarlow D, Brockschmidt M, Zemel R (2016) Gated graph sequence neural networks. In: ICLR.
Available: arXiv:1511.05493
18. Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? In: International
conference on learning representations
19. Tran DV, Navarin N, Sperduti A (2018) On filter size in graph convolutional networks. In: IEEE SSCI.
IEEE, Bengaluru, pp 1534–1541. Available: https://ieeexplore.ieee.org/document/8628758/
20. Xinyi Z, Chen L (2019) Capsule graph neural network. In: ICLR
21. Wu F, Zhang T, de Souza AH, Fifty C, Yu T, Weinberger KQ (2019) Simplifying graph convolutional
networks. In: ICML
22. Chen T, Bian S, Sun Y (2019) Are powerful graph neural nets necessary? A dissection on graph
classification. arXiv preprint arXiv:1905.04579
23. Luan S, Zhao M, Chang X-W, Precup D (2019) Break the ceiling: stronger multi-scale deep graph
convolutional networks. In: Advances in neural information processing systems, pp 10945–10955
24. Rossi E, Frasca F, Chamberlain B, Eynard D, Bronstein M, Monti F (2020) Sign: scalable inception graph
neural networks. arXiv preprint arXiv:2004.11198
25. Pasa L, Navarin N, Sperduti A (2021) Multiresolution reservoir graph neural network. IEEE Trans Neural
Netw Learn Syst 1–12
26. Veness J, Lattimore T, Budden D, Bhoopchand A, Mattern C, Grabska-Barwinska A, Sezener E, Wang
J, Toth P, Schmitt S, Hutter M (2019) Gated linear networks. Available: arXiv:1910.01526
27. Veness J, Lattimore T, Bhoopchand A, Grabska-Barwinska A, Mattern C, Toth P (2017) Online learning
with gated linear networks. arXiv, pp 1–40
28. Munari M, Pasa L, Zambon D, Alippi C, Navarin N (2022) Understanding catastrophic forgetting of
gated linear networks in continual learning. In: 2022 International joint conference on neural networks
(IJCNN), pp 1–8
29. Mattern C (2013) Linear and geometric mixtures—analysis. In: Data compression conference proceed-
ings, pp 301–310
30. Navarin N, Erb W, Pasa L, Sperduti A (2020) Linear graph convolutional networks. In: European
symposium on artificial neural networks, computational intelligence and machine learning
31. Aloupis G, Pérez-Rosés H, Pineda-Villavicencio G, Taslakian P, Trinchet D (2013) Fitting voronoi
diagrams to planar tesselations. arXiv:1308.5550 [cs]
32. Cavoretto R, Rossi AD, Erb W (2021) Partition of unity methods for signal processing on graphs. J Fourier
Anal Appl 27:66
33. Mernyei P, Cangea C (2020) Wiki-cs: a Wikipedia-based benchmark for graph neural networks. arXiv
preprint arXiv:2007.02901
34. Fey M, Lenssen JE (2019) Fast graph representation learning with pytorch geometric. In: ICLR 2019
(RLGM Workshop)
35. Oneto L (2020) Model selection and error estimation in a nutshell. Springer, Berlin
36. Errica F, Podda M, Bacciu D, Micheli A (2020) A fair comparison of graph neural networks for graph
classification. In: International conference on learning representations
37. Veli ˇckovi´c P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks.
arXiv preprint arXiv:1710.10903
38. Pasa L, Navarin N, Erb N, Sperduti A (2023) Empowering simple graph convolutional networks. In: IEEE
transactions on neural networks and learning systems, vol PP, no 99, pp 1–15
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
A unified framework for backpropagation-free soft and hard... 2415
39. Narayan A, Berger B, Cho H (2020) Density-preserving data visualization unveils dynamic patterns of
single-cell transcriptomic variability.bioRxiv. Available: https:// www.biorxiv.org/content/ early/2020/05/
14/2020.05.12.077776
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Luca Pasa earned his master’s degree in Computer Science from the
University of Padova in 2013. Subsequently, at the same institution,
he successfully completed his Ph.D. in Mathematical Sciences (cur-
riculum Computer Science) in March 2017, under the supervision of
Professor Alessandro Sperduti. Between July 2017 and June 2019, he
served as a postdoctoral researcher at the Center of Translational Neu-
rophysiology Speech and Communication (CTNSC) within the Isti-
tuto Italiano di Tecnologia (IIT), working under the supervision of Dr.
Leonardo Badino. Subsequently, from July 2019 to December 2021,
he continued his research activity as a postdoctoral researcher at the
University of Padova, affiliated with the Department of Mathematics
and the Human Inspired Technologies Research Center. Since January
2022, he has held the position of Assistant Professor (RTDa) at the
Department of Mathematics, University of Padova. His main research
interests lie in the field of machine learning, including deep learning,
computational neuroscience, and automatic speech recognition. Cur-
rently, his research activity is focused on the application of deep learning methods on structured domains
(sequences, trees, and graphs). He has coauthored numerous research papers published in international ref-
ereed journals and conference proceedings. He has also served as a Program Committee member for several
machine learning conferences and has actively participated in organizing various special sessions at inter-
national machine learning conferences. He is a member of the IEEE Task Force on Deep Learning and the
Italian Association for Artificial Intelligence (AIxIA).
Nicoló Navarin is an associate professor in Computer Science at the
Department of Mathematics “Tullio Levi-Civita”, University of Padua,
Italy. He got his Ph.D. in Computer Science from the University of
Bologna, Italy, in 2014. He has been a visiting researcher at the Uni-
versity of Freiburg, Germany, at the Universitá della Svizzera Italiana,
Lugano, Switzerland, and at 3IA Côte d’Azur-Interdisciplinary Insti-
tute for Artificial Intelligence, Sophia Antipolis, France. He has been
a research fellow at the University of Nottingham, UK, and at the
University of Padua. His research interests lie in the field of machine
learning, including kernel methods and neural networks for structured
data, online and continual learning, trustworthy ML, and applications
to bioinformatics, business process mining, computer vision and com-
putational psychology. At the University of Padua, he is affiliated to the
Human Inspired Technology (HIT) Research Centre. He is a doctoral
committee member of the PhD Program in "Brain, Mind and Computer
Science". Dr. He has been serving as PC member in major machine
learning conferences (ICML IJCAI, NeurIPS, ECML, AAAI, ICLR), and he has been actively involved in
the organization of several special sessions (ESANN, WCCI, IJCNN) and conferences (INNS Big Data and
Deep Learning 2019, International Conference on Process Mining 2020, IEEE Symposium Series in Compu-
tational Intelligence 2021, IEEE World Congress on Computational Intelligence 2022). He edited two books,
and he served as guest editor for the journals Neurocomputing and IEEE Transactions on Human-Machine
Systems. He is an associate editor for the journals Evolving Systems (Springer) and Neurocomputing (Else-
vier), and an editorial board member for Intelligenza Artificiale (AIxIA, IOS Press). He is a member of IEEE
Computational Intelligence Society, IEEE Task Force on Deep learning, IEEE Task Force on Learning from
Structured Data, and of the Italian Association for Artificial Intelligence.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2416 L. Pasa et al.
Wolfgang Erb is an Associate Professor in Numerical Analysis at the
Department of Mathematics "Tullio Levi-Civita", University of Padua,
Italy. He received his Ph.D. in Mathematics at the Technical Univer-
sity of Munich (Germany) in 2010. He has been a research fellow at
the University of Lübeck (Germany), at the University of Eichstätt-
Ingolstadt (Germany) and an Assistant Professor in Mathematics at the
University of Hawaii at Manoa (US). His research interests include
multivariate approximation theory, kernel methods for signal process-
ing and learning on graphs, fast and efficient reconstruction algorithms
for inverse problems, as well as applications in biomedical imaging, in
particular magnetic particle imaging. He is a member of RITA (Ital-
ian Research Network on Approximation), of the Italian Mathematical
Society UMI (working group TAA of approximation theory) and of
GNCS-INdAM.
Alessandro Sperduti received the PhD in 1993 from University of Pisa,
Italy. He is a Full Professor at the Department of Mathematics of
the University of Padova. Previously, he has been Associate Professor
(1998-2002) and Assistant Professor (1995-1998) at the Department of
Computer Science of the University of Pisa. His research interests are
mainly in neural networks, kernel methods, and process mining. He
was the recipient of the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
... This mechanism allows training each neuron independently, without using back-propagation, resulting in a set of convex problems to solve. Different versions of BF-GCNs have been defined [4]. One of the simplest and most effective instantiations defines regions based on prototypes (initialized in a data-driven way), where each input context is assigned to the closest prototype, obtaining a Voronoi tessellation of the context space. ...
... Let us start our analysis focusing on the Pubmed and WikiCS datasets. Here we limit the number of layers to 2 since it has been shown that increasing them did not result in any accuracy gain [4]. Consequently, we could validate a higher number of prototypes per neuron (from 2 up to 32), since increasing each neuron's non-linearity can be beneficial. ...
Preprint
Full-text available
The landscape of deep learning has vastly expanded the frontiers of source code analysis, particularly through the utilization of structural representations such as Abstract Syntax Trees (ASTs). While these methodologies have demonstrated effectiveness in classification tasks, their efficacy in regression applications, such as execution time prediction from source code, remains underexplored. This paper endeavours to decode the behaviour of tree-based neural network models in the context of such regression challenges. We extend the application of established models--tree-based Convolutional Neural Networks (CNNs), Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our comparative analysis reveals that while these models are benchmarks in code representation, they exhibit limitations when tasked with regression. To address these deficiencies, we propose a novel dual-transformer approach that operates on both source code tokens and AST representations, employing cross-attention mechanisms to enhance interpretability between the two domains. Furthermore, we explore the adaptation of Graph Neural Networks (GNNs) to this tree-based problem, theorizing the inherent compatibility due to the graphical nature of ASTs. Empirical evaluations on real-world datasets showcase that our dual-transformer model outperforms all other tree-based neural networks and the GNN-based models. Moreover, our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.
Article
Full-text available
Many neural networks for graphs are based on the graph convolution (GC) operator, proposed more than a decade ago. Since then, many alternative definitions have been proposed, which tend to add complexity (and nonlinearity) to the model. Recently, however, a simplified GC operator, dubbed simple graph convolution (SGC), which aims to remove nonlinearities was proposed. Motivated by the good results reached by this simpler model, in this article we propose, analyze, and compare simple graph convolution operators of increasing complexity that rely on linear transformations or controlled nonlinearities, and that can be implemented in single-layer graph convolutional networks (GCNs). Their computational expressiveness is characterized as well. We show that the predictive performance of the proposed GC operators is competitive with the ones of other widely adopted models on the considered node classification benchmark datasets.
Article
Full-text available
Graph convolutional neural networks exploit convolution operators, based on some neighborhood aggregating scheme, to compute representations of graphs. The most common convolution operators only exploit local topological information. To consider wider topological receptive fields, the mainstream approach is to non-linearly stack multiple graph convolutional (GC) layers. In this way, however, interactions among GC parameters at different levels pose a bias on the flow of topological information. In this paper, we propose a different strategy, considering a single graph convolution layer that independently exploits neighbouring nodes at different topological distances, generating decoupled representations for each of them. These representations are then processed by subsequent readout layers. We implement this strategy introducing the polynomial graph convolution (PGC) layer, that we prove being more expressive than the most common convolution operators and their linear stacking. Our contribution is not limited to the definition of a convolution operator with a larger receptive field, but we prove both theoretically and experimentally that the common way multiple non-linear graph convolutions are stacked limits the neural network expressiveness. Specifically, we show that a graph neural network architecture with a single PGC layer achieves state of the art performance on many commonly adopted graph classification benchmarks.
Article
Full-text available
Partition of unity methods (PUMs) on graphs are simple and highly adaptive auxiliary tools for graph signal processing. Based on a greedy-type metric clustering and augmentation scheme, we show how a partition of unity can be generated in an efficient way on graphs. We investigate how PUMs can be combined with a local graph basis function (GBF) approximation method in order to obtain low-cost global interpolation or classification schemes. From a theoretical point of view, we study necessary prerequisites for the partition of unity such that global error estimates of the PUM follow from corresponding local ones. Finally, properties of the PUM as cost-efficiency and approximation accuracy are investigated numerically.
Article
Full-text available
Nonlinear data visualization methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), summarize the complex transcriptomic landscape of single cells in two dimensions or three dimensions, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. Here we present den-SNE and densMAP, which are density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization and the developmental trajectory of Caenorhabditis elegans. Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.
Article
This paper presents a new family of backpropagation-free neural architectures, Gated Linear Networks (GLNs). What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons are able to model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep ReLU networks. Furthermore, we demonstrate that the GLN learning mechanism possesses extraordinary resilience to catastrophic forgetting, performing almost on par to an MLP with dropout and Elastic Weight Consolidation on standard benchmarks.
Article
Graph neural networks are receiving increasing attention as state-of-the-art methods to process graph-structured data. However, similar to other neural networks, they tend to suffer from a high computational cost to perform training. Reservoir computing (RC) is an effective way to define neural networks that are very efficient to train, often obtaining comparable predictive performance with respect to the fully trained counterparts. Different proposals of reservoir graph neural networks have been proposed in the literature. However, their predictive performances are still slightly below the ones of fully trained graph neural networks on many benchmark datasets, arguably because of the oversmoothing problem that arises when iterating over the graph structure in the reservoir computation. In this work, we aim to reduce this gap defining a multiresolution reservoir graph neural network (MRGNN) inspired by graph spectral filtering. Instead of iterating on the nonlinearity in the reservoir and using a shallow readout function, we aim to generate an explicit k -hop unsupervised graph representation amenable for further, possibly nonlinear, processing. Experiments on several datasets from various application areas show that our approach is extremely fast and it achieves in most of the cases comparable or even higher results with respect to state-of-the-art approaches.
Book
How can we select the best performing data-driven model? How can we rigorously estimate its generalization error? Statistical learning theory answers these questions by deriving non-asymptotic bounds on the generalization error of a model or, in other words, by upper bounding the true error of the learned model based just on quantities computed on the available data. However, for a long time, Statistical learning theory has been considered only an abstract theoretical framework, useful for inspiring new learning approaches, but with limited applicability to practical problems. The purpose of this book is to give an intelligible overview of the problems of model selection and error estimation, by focusing on the ideas behind the different statistical learning theory approaches and simplifying most of the technical aspects with the purpose of making them more accessible and usable in practice. The book starts by presenting the seminal works of the 80’s and includes the most recent results. It discusses open problems and outlines future directions for research.