Available via license: CC BY 4.0
Content may be subject to copyright.
arXiv:2110.03576v1 [cs.LG] 7 Oct 2021
TRAINING STABLE GRAPH NEURAL NETWORKS
THROUGH CONSTRAINED LEARNING
Juan Cervi˜
no, Luana Ruiz and Alejandro Ribeiro
Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USA
ABSTRACT
Graph Neural Networks (GNN) rely on graph convolutions to
learn features from network data. GNNs are stable to differ-
ent types of perturbations of the underlying graph, a property
that they inherit from graph filters. In this paper we leverage
the stability property of GNNs as a typing point in order to
seek for representations that are stable within a distribution.
We propose a novel constrained learning approach by impos-
ing a constraint on the stability condition of the GNN within
a perturbation of choice. We showcase our framework in real
world data, corroborating that we are able to obtain more sta-
ble representations while not compromising the overall accu-
racy of the predictor.
Index Terms—Graph Neural Networks, Constrained
Learning, Stability
1. INTRODUCTION
Graph Neural Networks (GNNs) are deep convolutional ar-
chitectures tailored to graph machine learning problems [1, 2]
which have achieved great success in fields such as biology
[3, 4] and robotics [5, 6], to name a few. Consisting of lay-
ers that stack graph convolutions and pointwise n onlinearities,
their successful empirical results can be explained by theo-
retical properties they inherit from graph convolutions. In-
deed, convolutions are the reason why GNNs are invariant to
node relabelings [7, 8]; stable to deterministic [2], stochastic
[9], and space-time graph perturbations [10]; and transferable
from small to large graphs [11].
Stability is an especially important property because, in
practice, networks are prone to perturbations. For instance, in
a social network, friendship links can not only be added or re-
moved, but also strengthened or weakened depending on the
frequency of interaction. Similarly, in a wireless network the
channel states are dramatically affected by environment noise.
Because GNNs have been proved to be stable to such pertur-
bations, in theory any GNN should do well in these scenarios.
In practice, however, actual stability guarantees depend on
factors such as the type of graph perturbation, the smoothness
of the convolutions, the depth and width of the neural net-
work, and the size of the graph [12]. In other words, GNNs
Support by NSF CCF 1717120, and Theorinet Simons.
are provably stable to graph perturbations, but we cannot al-
ways ensure that they will meet a certain stability requirement
or constraint.
In this paper, our goal is thus to enforce GNNs to meet a
specific stability requirement, which we do by changing the
way in which the GNN is learned. Specifically, we modify the
statistical learning problem by introducing GNN stability as
a constraint, therefore giving rise to a constrained statistical
learning problem. This leads to an non-convex constrained
problem for which even a feasible solution may be may chal-
lenging to obtain in practice. To overcome this limitation,
we resort to the dual domain, in which the problem becomes
a weighted unconstrained minimization problem that we can
solve using standard gradient descent techniques. By evaluat-
ing the constraint slackness, we iteratively update the weights
of this problem. This procedure is detailed in Algorithm 1.
In Theorem 1, we quantify the duality gap, i.e., the mismatch
between solving the primal and the dual problems. In The-
orem 2, we present convergence guarantees for Algorithm 1.
These results are illustrated numerically in Section 4, where
we observe that GNNs trained using Algorithm 1 successfully
meet stability requirements for a variety of perturbation mag-
nitudes and GNN architectures.
2. GRAPH NEURAL NETWORKS
A graph is a triplet G= (V,E,W), where V={1,...,N}
is its set of nodes, E ⊆ V × V is its set of edges, and W
is a function assigning weights W(i, j )to edges (i, j)∈ E .
A graph may also be represented by the graph shift operator
(GSO) S∈RN×N, a matrix which satisfies Sij 6= 0 if and
only if (j, i)∈ E or i=j. The most common examples of
GSOs are the graph adjacency matrix A,[A]ij =W(j, i);
and the graph Laplacian L=diag(A1)−A.
We consider the graph Gto be the support of data x=
[x1,...,xN]⊤which we call graph signals. The ith com-
ponent of a graph signal x,xi, corresponds to the value of
the data at node i. The operation Sx defines a graph shift
of the signal x. Leveraging this notion of shift, we define
graph convolutional filters as linear shift-invariant graph fil-
ters. Explicitly, a graph convolutional filter with coefficients
h= [h1,...,hK−1]⊤is given by
y=h∗Sx=
K−1
X
k=0
hkSkx(1)
where ∗Sis the convolution operation parametrized by S.
GNNs are deep convolutional architectures consisting of
Llayers, each of which contains a bank of graph convolu-
tional filters like (1) and a pointwise nonlinearity ρ. Layer l
produces Flgraph signals xf
l, called features. Defining a ma-
trix Xlwhose fth column corresponds to the fth feature of
layer lfor 1≤f≤Fl, we can write the lth layer of the GNN
as
Xl=ρ K−1
X
k=0
SkXl−1Hlk!. (2)
In this expression, [Hlk ]gf denotes the kth coefficient of the
graph convolution (1) mapping feature gto feature ffor 1≤
g≤Fl−1and 1≤f≤Fl. A more succinct representa-
tion of this GNN can be obtained by grouping all learnable
parameters Hlk ,1≤l≤L, in a tensor H={Hlk}l,k.
This allows expressing the GNN as the parametric map XL=
φ(X0,S;H). For simplicity, in the following sections we as-
sume that the input and output only have one feature, i.e.,
X0=x∈RNand XL=y∈RN.
2.1. Statistical Learning on Graphs
To learn a GNN, we are given pairs (x,y)corresponding to
an input graph signal x∈RNand a target output graph signal
y∈RNsampled from the joint distribution p(x,y). Our ob-
jective is to find the filter coefficients Hsuch that φ(x,S;H)
approximates yover the joint probability distribution p. To do
so, we introduce a nonnegative loss function ℓ:RN×RN→
R+which satisfies ℓ(φ(x),y) = 0 when φ(x) = y. The
GNN is learned by averaging the loss over the probability dis-
tribution as follows,
min
H∈RQ
E
p(x,y)[ℓ(y, φ(x,S;H),y))].(3)
Problem 3 is the Statistical Risk Minimization problem
[13] for the GNN.
2.2. Stability to Graph Perturbations
In the real world, it is not uncommon for graphs to be prone to
small perturbations such as, e.g., interference noise in wire-
less networks. Hence, stability to graph perturbations is an
important property for GNNs. Explicitly, we define a graph
perturbation as a graph that is ǫclose to the original graph,
ˆ
S:kˆ
S−Sk ≤ ǫ. (4)
An example of perturbation is an additive perturbation of the
form ˆ
S=S+E, where Eis a stochastic perturbation with
bounded norm kEk ≤ ǫdrawn from a distribution ∆.
The notion of GNN stability is formalized in Definition
1. Note that the maximum is taken in order to account for all
possible inputs.
Definition 1 (GNN stability to graph perturbations) Let
φ(x,S;H)be a GNN (2) and let ˆ
Sbe a graph perturbation
(4) such that kˆ
S−Sk ≤ ǫ. The GNN φ(X,S;H)is C-stable
if
max
xkφ(x,S;H)−φ(x,ˆ
S;H)k ≤ Cǫ (5)
for some finite constant C.
A GNN is thus stable to a graph perturbation kˆ
S−Sk ≤ ǫ
if its output varies at most by C ǫ. The smaller the value of
C, the more stable the GNN is to perturbations ˆ
S. Under
mild smoothness assumptions on the graph convolutions, it
is possible to show that any GNN can be made stable in the
sense of Definition 1 [12]. However, for an arbitrary GNN the
constant Cis not guaranteed to be small and, in fact, existing
stability analyses show that it can vary with the GNN depth
and width (i.e. number of layers L, and number of features F
respectively), the size of the graph N, and the misalignment
between the eigenspaces of Sand ˆ
S. What is more, problem
3 does not impose any conditions on the stability of the GNN,
thus solutions H∗may not have small stability bounds C. In
this paper, our goal is to enforce stability for a constant C
of choice. In the following, we show that, on average over
the support of the data, better stability can be achieved by
introducing a modification of definition 1 as a constraint of
the statistical learning problem for the GNN.
3. CONSTRAINED LEARNING
In order to address the stability of the GNN, we can explic-
itly enforce our learning procedure to account for differences
between the unperturbed and the perturbed performance, to
do this we resort to the constrained learning theory [14]. We
modify the standard unconstrained statistical risk minimiza-
tion problem (cf. (3)) by introducing a constraint that requires
the solution Hto attain at most an C ǫ difference between the
perturbed and unperturbed problem.
P∗= min
H∈RQ
E
p(x,y)[ℓ(y, φ(x,S;y))] (6)
s.t. E
p(x,y,∆)[ℓ(y, φ(x,ˆ
S;H)) −ℓ(y, φ(x,S;H))] ≤Cǫ
Note that if the constant Cis set at a sufficiently large value,
the constraint renders inactive, making problems (6) and (3)
equivalent. As opposed to other methods based on heuristics,
or tailored solutions, our novel formulation admits a simple
interpretation from an optimization perspective.
3.1. Dual Domain
In order to solve problem (6), we will resort to the dual do-
main. To do so, we introduce the dual variable λ > 0∈R,
and we define the Lagrangian function as follows,
L(H, λ) =(1 −λ)E[ℓ(y, φ(x,S;H)] (7)
+λE[ℓ(y, φ(x,ˆ
S;H)) −ǫC].
We can introduce the dual function as the minimum of the La-
grangian L(H, λ), over Hfor a fixed value of dual variables
λ[15],
d(λ) = min
H∈RQL(H, λ).(8)
Note that to obtain the value of the dual function d(λ)we need
to solve an unconstrained optimization problem weighted by
λ. Given that the dual function is a point wise minimum of a
family of affine functions, it is concave, even when the prob-
lem (6) is not convex. The maximum of the dual function
d(λ)over λis called the dual problem D∗, and it is always a
lower bound of problem P∗as follows,
d(λ)≤D∗≤min
H∈RQmax
λ∈R+L(φ, λ) = P∗.(9)
The advantage of delving into the dual domain, and max-
imizing the dual function dis that it allows us to search for
solutions of problem 6 by minimizing an unconstrained prob-
lem. The difference between the dual problem D∗and the
primal problem P∗(cf. 6), is called duality gap and will be
quantified in the following theorem.
AS1 The loss function ℓis L-Lipschitz, i.e.kℓ(x, ·)−ℓ(z, ·)k ≤
Lkx−zk, strongly convex and bounded by B.
AS2 The conditional distribution p(x,∆|y)is non-atomic
for all y∈RN, and there a finite number of target graph
signals y.
AS3 There exists a convex hypothesis class ˆ
Csuch that C ⊆
ˆ
C, and there exists a constant ξ > 0such that for all ˆ
φ∈ˆ
C,
there exists H ∈ RQsuch that supx∈X kˆ
φ(x)−φ(x, H)k ≤ ξ.
Note that assumption 1 is satisfied in practice by most loss
functions (i.e. square loss, L1loss), by imposing a bound.
Assumption 2 can be satisfied in practice by data augmen-
tation [16]. Assumption 3 is related to the richness of the
function class of GNNs C, the parameter ξcan be decrease by
increasing the capacity of the GNNs in consideration. To ob-
tain a convex hypothesis class ˆ
H, it suffices to take the convex
hull over the function class of GNNs.
Theorem 1 (Near-Zero Duality Gap) Under assumptions
1, 2, and 3, the Constrained Graph Stability problem (6), has
near zero duality gap,
P∗−D∗≤(2λ∗+ 1)Lξ (10)
where λ∗is the optimal dual variable.
Algorithm 1 Graph Stability Algorithm
1: Initialize model H0, and dual variables λ= 0
2: for epochs e= 1,2,... do
3: for batch iin epoch edo
4: Obtain Nsamples {(xi,yi)}i∼p(x,y)
5: Obtain Mperturbations {(ˆ
Si)}i∼∆
6: Get primal gradient ∇HL(H, λ)(cf. eq (11))
7: Update params. Hk+1 =Hk−ηPˆ
∇HL(Hk, λ)
8: end for
9: Obtain Nsamples {(xi,yi)}i∼p(x,y)
10: Obtain Mperturbations {(ˆ
Si)}i∼∆
11: Update dual variable λ←[λ+ηD∇λL(H, λ)]+
12: end for
The importance of Theorem 1 is that is allows us to quan-
tify the penalty that we incur by delving into the dual domain.
Note that this penalty decreases as we make our parameteriza-
tion richer, and thus we decrease ξ. Also note that the optimal
dual variable λ∗accounts for the difficulty of finding a fea-
sible solution, thus we should expect this value to be small
given the theoretical guarantees on GNN stability [12].
3.2. Algorithm Construction
In order to solve problem 6, we will resort to iterativelly solv-
ing the dual function d(λ), evaluate the constraint slack and
update the dual variables λaccordingly. We assume that the
distributions are unknown, but we have access to samples of
both graph signals (xi,yi)∼p(x, y), and perturbed graphs
ˆ
Sj∼∆. In a standard learning procedure, to minimize the
Lagrangian L(H, λ)with respect to a set of variables λwe
can take stochastic gradients as follows,
ˆ
∇HL(H, λ) =∇H1−λ
N
N
X
i=1
ℓ(φ(xi,S;Hk), yi)(11)
+λ
NM
N
X
i=1
M
X
j=1
ℓ(φ(xi,ˆ
Sj;Hk), yi)
The main difference with a regularized problem is that the
dual variables λare also updated. To update the dual variables
λ, we evaluate the constraint violation as follows,
ˆ
∇λL(H, λ) = 1
NM
N
X
i=1
M
X
j=1
ℓ(φ(xi,ˆ
Sj;Hk), yi)
−1
N
N
X
i=1
ℓ(φ(xi,S;Hk), yi)−Cǫ. (12)
The intuition behind the dual step is that the dual variable λ
will increase while the constraint is not being satisfied, adding
weight to the stability condition in the minimization of the
Lagrangian. Conversely, if the constraint is being satisfied,
RMSE for 2Layer GNN RMSE for 3Layer GNN
Norm Of Perturbation Unconstrained Constrained Unconstrained Constrained
0 0.8447(0.1386) 0.8564(0.0572) 0.8290(0.1504) 0.8417(0.0980)
0.0001 0.9083(0.2160) 0.8316(0.0692) 3.3542(0.5042) 0.8349(0.1011)
0.001 0.9084(0.2160) 0.8317(0.0691) 3.3542(0.5043) 0.8341(0.1011)
0.01 0.9092(0.2162) 0.8322(0.0688) 3.5410(0.5044) 0.8343(0.1011)
0.1 0.9170(0.2173) 0.8371(0.0667) 3.3484(0.5086) 0.8362(0.1010)
0.2 0.9282(0.2212) 0.8458(0.0608) 3.3326(0.8189) 0.8379(0.1007)
0.5 0.9565(0.2223) 0.8749(0.0528) 3.2070(0.5920) 0.8468(0.0932)
Table 1. Evaluations of the RMSE of the trained GNN for 20 epochs on the testing set (unseen data) for different magnitudes
of relative perturbations. We consider GNNs of 2, and 3layers with K= 5 filter taps in both cases and F1= 64, and F2= 32
features for the first and second layer respectively. The constrained learning approach is able to keep a comparable performance
on the unperturbed evaluation (i.e., Norm of Perturbation= 0) while it is more stable as the norm of the perturbation increases.
we will increase the relative weight of the objective function.
This means, that if the constraint is more restrictive the opti-
mal dual variable will be larger.
Theorem 2 (Convergence) Under assumptions 1, 2, and 3,
if for each dual variable λk, the Lagrangian is minimized up
to a precision α > 0, i.e. L(Hλk, λk)≤minH∈RQL(H, λk)+
α, then for a fixed tolerance β > 0, the iterates generated
by Algorithm 1 achieve a neighborhood of the optimal P∗
problem in finite time
P∗+α≥ L(Hk, λk)≥P∗−(2λ∗+ 1)Lξ −α−β−ηDB2
2
Theorem (2) allows us to claim converge of Algorithm 1 up
to a neighborhood of the optimal problem (6) that depends on
the tolerance β, the precision α, the dual step size ηDand the
loss bound B.
4. EXPERIMENTS
We consider the problem of predicting the rating a movie will
be given by a user. We leverage the dataset MovieLens 100k
[17] which contains 100,000 integer ratings between 1and 5,
that were collected among U= 943 users and M= 1682
movies. For this example, we focus only on the movie “Con-
tact”. To exploit the underlying graph structure of the prob-
lem we build a movie similarity graph, obtained by computing
the pairwise correlations between the different movies in the
training set [18]. In order to showcase the stability properties
of GNNs, we perturb the graph shift operator according to the
Relative Perturbation Modulo Perturbation [12][Definition 3]
model ˆ
S=S+ES +SE. We consider a uniform distribution
of Esuch that, kEk ≤ ǫ.
We split the dataset into 90% for training and 10% for
testing, considering 10 independent random splits. For the
optimizer we used a 5sample batch size, and ADAM [19]
with learning rate 0.005, β1= 0.9, β2= 0.999 and no learn-
ing decay. We used the smooth L1loss. For the GNNs,
we used ReLU as the non-linearity, and we considered two
GNNs: (i) one layer with F= 64 features, and (ii) two layers
with F1= 64 and F2= 32. In both cases, we used K= 5
filter taps per filter. For the Algorithm 1 we used dual step
size ηD= 0.1, stability constant C=1
3, and magnitude of
perturbation ǫ= 0.3. For number of perturbation per primal
step we used M= 3, and to evaluate the constraint slackness
we used 20% of the training set.
Table 1 shows the RMSE achieved on the test set when the
GNN is trained using Algorithm 1 and when trained uncon-
strained (cf. 3). We evaluate GNN performance for different
magnitudes of perturbations of the graph. The numerical re-
sults shown in Table 1 express the manifestation of the claims
that we put forward. First, using Algorithm 1 we are able to
attain a comparable performance to the one we would have
achieved by training ignoring the perturbation. As seen in the
first row, the evaluation of the trained GNNs produces compa-
rable results for both the 2and the 3layer GNN. Second, with
our formulation we are able to obtain more stable represen-
tations because when the perturbation magnitude increases,
the loss deteriorates at a slower rate. This effect is especially
noticeable for the 3layer GNN. It is well studied that GNN
stability worsens as the number of layers increases, however
using Algorithm 1 we are able to curtail this undesirable ef-
fect.
5. CONCLUSION
In this paper we introduced a constrained learning formula-
tion to improve the stability of GNN. By explicitly introduc-
ing a constraint on the stability of the GNN we are able to
obtain filter coefficients that are more resilient to perturba-
tions of the graph. The benefit of our novel procedure was
benchmarked in a recommendation system problem with real
world data. For future work, we will improve our theoreti-
cal guarantees in order to assure stability, and consider other
more demanding simulations such as robot swarms.
6. REFERENCES
[1] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang,
Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng
Li, and Maosong Sun, “Graph neural networks: A re-
view of methods and applications,” AI Open, vol. 1, pp.
57–81, 2020.
[2] Fernando Gama, Antonio G Marques, Geert Leus, and
Alejandro Ribeiro, “Convolutional neural network ar-
chitectures for signals supported on graphs,” IEEE
Transactions on Signal Processing, vol. 67, no. 4, pp.
1034–1049, 2018.
[3] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-
Hur, “Protein interface prediction using graph con-
volutional networks,” in Advances in Neural Infor-
mation Processing Systems, I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.
[4] David K Duvenaud, Dougal Maclaurin, Jorge Ipar-
raguirre, Rafael Bombarell, Timothy Hirzel, Alan
Aspuru-Guzik, and Ryan P Adams, “Convolutional net-
works on graphs for learning molecular fingerprints,”
in Advances in Neural Information Processing Systems,
C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and
R. Garnett, Eds. 2015, vol. 28, Curran Associates, Inc.
[5] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing
Shen, and Song-Chun Zhu, “Learning human-object in-
teractions by graph parsing neural networks,” in Pro-
ceedings of the European Conference on Computer Vi-
sion (ECCV), 2018, pp. 401–417.
[6] Qingbiao Li, Fernando Gama, Alejandro Ribeiro, and
Amanda Prorok, “Graph neural networks for decen-
tralized multi-robot path planning,” arXiv preprint
arXiv:1912.06095, 2019.
[7] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan
Bruna, “On the equivalence between graph isomor-
phism testing and function approximation with gnns,”
arXiv preprint arXiv:1905.12560, 2019.
[8] Nicolas Keriven and Gabriel Peyr´e, “Universal invari-
ant and equivariant graph neural networks,” in Advances
in Neural Information Processing Systems (NeurIPS),
2019.
[9] Zhan Gao, Elvin Isufi, and Alejandro Ribeiro, “Stabil-
ity of graph convolutional neural networks to stochastic
perturbations,” Signal Processing, p. 108216, 2021.
[10] Samar Hadou, Charilaos I. Kanatsoulis, and Alejandro
Ribeiro, “Space-time graph neural networks,” 2021.
[11] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro,
“Graphon neural networks and the transferability of
graph neural networks,” Advances in Neural Informa-
tion Processing Systems, vol. 33, 2020.
[12] Fernando Gama, Joan Bruna, and Alejandro Ribeiro,
“Stability properties of graph neural networks,” IEEE
Transactions on Signal Processing, vol. 68, pp. 5680–
5695, 2020.
[13] Shai Shalev-Shwartz and Shai Ben-David, Understand-
ing machine learning: From theory to algorithms, Cam-
bridge university press, 2014.
[14] Luiz Chamon and Alejandro Ribeiro, “Probably approx-
imately correct constrained learning,” Advances in Neu-
ral Information Processing Systems, vol. 33, 2020.
[15] Stephen Boyd and Lieven Vandenberghe, Convex Opti-
mization, Cambridge University Press, 2009.
[16] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and
Yoshua Bengio, Deep learning, vol. 1, MIT press Cam-
bridge, 2016.
[17] F Maxwell Harper and Joseph A Konstan, “The movie-
lens datasets: History and context,” Acm transactions
on interactive intelligent systems (tiis), vol. 5, no. 4, pp.
1–19, 2015.
[18] Weiyu Huang, Antonio G Marques, and Alejandro R
Ribeiro, “Rating prediction via graph signal process-
ing,” IEEE Transactions on Signal Processing, vol. 66,
no. 19, pp. 5066–5081, 2018.
[19] Diederik P. Kingma and Jimmy Ba, “Adam: A method
for stochastic optimization,” CoRR, vol. abs/1412.6980,
2015.