Available via license: CC BY 4.0

Content may be subject to copyright.

arXiv:2110.03576v1 [cs.LG] 7 Oct 2021

TRAINING STABLE GRAPH NEURAL NETWORKS

THROUGH CONSTRAINED LEARNING

Juan Cervi˜

no, Luana Ruiz and Alejandro Ribeiro

Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USA

ABSTRACT

Graph Neural Networks (GNN) rely on graph convolutions to

learn features from network data. GNNs are stable to differ-

ent types of perturbations of the underlying graph, a property

that they inherit from graph ﬁlters. In this paper we leverage

the stability property of GNNs as a typing point in order to

seek for representations that are stable within a distribution.

We propose a novel constrained learning approach by impos-

ing a constraint on the stability condition of the GNN within

a perturbation of choice. We showcase our framework in real

world data, corroborating that we are able to obtain more sta-

ble representations while not compromising the overall accu-

racy of the predictor.

Index Terms—Graph Neural Networks, Constrained

Learning, Stability

1. INTRODUCTION

Graph Neural Networks (GNNs) are deep convolutional ar-

chitectures tailored to graph machine learning problems [1, 2]

which have achieved great success in ﬁelds such as biology

[3, 4] and robotics [5, 6], to name a few. Consisting of lay-

ers that stack graph convolutions and pointwise n onlinearities,

their successful empirical results can be explained by theo-

retical properties they inherit from graph convolutions. In-

deed, convolutions are the reason why GNNs are invariant to

node relabelings [7, 8]; stable to deterministic [2], stochastic

[9], and space-time graph perturbations [10]; and transferable

from small to large graphs [11].

Stability is an especially important property because, in

practice, networks are prone to perturbations. For instance, in

a social network, friendship links can not only be added or re-

moved, but also strengthened or weakened depending on the

frequency of interaction. Similarly, in a wireless network the

channel states are dramatically affected by environment noise.

Because GNNs have been proved to be stable to such pertur-

bations, in theory any GNN should do well in these scenarios.

In practice, however, actual stability guarantees depend on

factors such as the type of graph perturbation, the smoothness

of the convolutions, the depth and width of the neural net-

work, and the size of the graph [12]. In other words, GNNs

Support by NSF CCF 1717120, and Theorinet Simons.

are provably stable to graph perturbations, but we cannot al-

ways ensure that they will meet a certain stability requirement

or constraint.

In this paper, our goal is thus to enforce GNNs to meet a

speciﬁc stability requirement, which we do by changing the

way in which the GNN is learned. Speciﬁcally, we modify the

statistical learning problem by introducing GNN stability as

a constraint, therefore giving rise to a constrained statistical

learning problem. This leads to an non-convex constrained

problem for which even a feasible solution may be may chal-

lenging to obtain in practice. To overcome this limitation,

we resort to the dual domain, in which the problem becomes

a weighted unconstrained minimization problem that we can

solve using standard gradient descent techniques. By evaluat-

ing the constraint slackness, we iteratively update the weights

of this problem. This procedure is detailed in Algorithm 1.

In Theorem 1, we quantify the duality gap, i.e., the mismatch

between solving the primal and the dual problems. In The-

orem 2, we present convergence guarantees for Algorithm 1.

These results are illustrated numerically in Section 4, where

we observe that GNNs trained using Algorithm 1 successfully

meet stability requirements for a variety of perturbation mag-

nitudes and GNN architectures.

2. GRAPH NEURAL NETWORKS

A graph is a triplet G= (V,E,W), where V={1,...,N}

is its set of nodes, E ⊆ V × V is its set of edges, and W

is a function assigning weights W(i, j )to edges (i, j)∈ E .

A graph may also be represented by the graph shift operator

(GSO) S∈RN×N, a matrix which satisﬁes Sij 6= 0 if and

only if (j, i)∈ E or i=j. The most common examples of

GSOs are the graph adjacency matrix A,[A]ij =W(j, i);

and the graph Laplacian L=diag(A1)−A.

We consider the graph Gto be the support of data x=

[x1,...,xN]⊤which we call graph signals. The ith com-

ponent of a graph signal x,xi, corresponds to the value of

the data at node i. The operation Sx deﬁnes a graph shift

of the signal x. Leveraging this notion of shift, we deﬁne

graph convolutional ﬁlters as linear shift-invariant graph ﬁl-

ters. Explicitly, a graph convolutional ﬁlter with coefﬁcients

h= [h1,...,hK−1]⊤is given by

y=h∗Sx=

K−1

X

k=0

hkSkx(1)

where ∗Sis the convolution operation parametrized by S.

GNNs are deep convolutional architectures consisting of

Llayers, each of which contains a bank of graph convolu-

tional ﬁlters like (1) and a pointwise nonlinearity ρ. Layer l

produces Flgraph signals xf

l, called features. Deﬁning a ma-

trix Xlwhose fth column corresponds to the fth feature of

layer lfor 1≤f≤Fl, we can write the lth layer of the GNN

as

Xl=ρ K−1

X

k=0

SkXl−1Hlk!. (2)

In this expression, [Hlk ]gf denotes the kth coefﬁcient of the

graph convolution (1) mapping feature gto feature ffor 1≤

g≤Fl−1and 1≤f≤Fl. A more succinct representa-

tion of this GNN can be obtained by grouping all learnable

parameters Hlk ,1≤l≤L, in a tensor H={Hlk}l,k.

This allows expressing the GNN as the parametric map XL=

φ(X0,S;H). For simplicity, in the following sections we as-

sume that the input and output only have one feature, i.e.,

X0=x∈RNand XL=y∈RN.

2.1. Statistical Learning on Graphs

To learn a GNN, we are given pairs (x,y)corresponding to

an input graph signal x∈RNand a target output graph signal

y∈RNsampled from the joint distribution p(x,y). Our ob-

jective is to ﬁnd the ﬁlter coefﬁcients Hsuch that φ(x,S;H)

approximates yover the joint probability distribution p. To do

so, we introduce a nonnegative loss function ℓ:RN×RN→

R+which satisﬁes ℓ(φ(x),y) = 0 when φ(x) = y. The

GNN is learned by averaging the loss over the probability dis-

tribution as follows,

min

H∈RQ

E

p(x,y)[ℓ(y, φ(x,S;H),y))].(3)

Problem 3 is the Statistical Risk Minimization problem

[13] for the GNN.

2.2. Stability to Graph Perturbations

In the real world, it is not uncommon for graphs to be prone to

small perturbations such as, e.g., interference noise in wire-

less networks. Hence, stability to graph perturbations is an

important property for GNNs. Explicitly, we deﬁne a graph

perturbation as a graph that is ǫclose to the original graph,

ˆ

S:kˆ

S−Sk ≤ ǫ. (4)

An example of perturbation is an additive perturbation of the

form ˆ

S=S+E, where Eis a stochastic perturbation with

bounded norm kEk ≤ ǫdrawn from a distribution ∆.

The notion of GNN stability is formalized in Deﬁnition

1. Note that the maximum is taken in order to account for all

possible inputs.

Deﬁnition 1 (GNN stability to graph perturbations) Let

φ(x,S;H)be a GNN (2) and let ˆ

Sbe a graph perturbation

(4) such that kˆ

S−Sk ≤ ǫ. The GNN φ(X,S;H)is C-stable

if

max

xkφ(x,S;H)−φ(x,ˆ

S;H)k ≤ Cǫ (5)

for some ﬁnite constant C.

A GNN is thus stable to a graph perturbation kˆ

S−Sk ≤ ǫ

if its output varies at most by C ǫ. The smaller the value of

C, the more stable the GNN is to perturbations ˆ

S. Under

mild smoothness assumptions on the graph convolutions, it

is possible to show that any GNN can be made stable in the

sense of Deﬁnition 1 [12]. However, for an arbitrary GNN the

constant Cis not guaranteed to be small and, in fact, existing

stability analyses show that it can vary with the GNN depth

and width (i.e. number of layers L, and number of features F

respectively), the size of the graph N, and the misalignment

between the eigenspaces of Sand ˆ

S. What is more, problem

3 does not impose any conditions on the stability of the GNN,

thus solutions H∗may not have small stability bounds C. In

this paper, our goal is to enforce stability for a constant C

of choice. In the following, we show that, on average over

the support of the data, better stability can be achieved by

introducing a modiﬁcation of deﬁnition 1 as a constraint of

the statistical learning problem for the GNN.

3. CONSTRAINED LEARNING

In order to address the stability of the GNN, we can explic-

itly enforce our learning procedure to account for differences

between the unperturbed and the perturbed performance, to

do this we resort to the constrained learning theory [14]. We

modify the standard unconstrained statistical risk minimiza-

tion problem (cf. (3)) by introducing a constraint that requires

the solution Hto attain at most an C ǫ difference between the

perturbed and unperturbed problem.

P∗= min

H∈RQ

E

p(x,y)[ℓ(y, φ(x,S;y))] (6)

s.t. E

p(x,y,∆)[ℓ(y, φ(x,ˆ

S;H)) −ℓ(y, φ(x,S;H))] ≤Cǫ

Note that if the constant Cis set at a sufﬁciently large value,

the constraint renders inactive, making problems (6) and (3)

equivalent. As opposed to other methods based on heuristics,

or tailored solutions, our novel formulation admits a simple

interpretation from an optimization perspective.

3.1. Dual Domain

In order to solve problem (6), we will resort to the dual do-

main. To do so, we introduce the dual variable λ > 0∈R,

and we deﬁne the Lagrangian function as follows,

L(H, λ) =(1 −λ)E[ℓ(y, φ(x,S;H)] (7)

+λE[ℓ(y, φ(x,ˆ

S;H)) −ǫC].

We can introduce the dual function as the minimum of the La-

grangian L(H, λ), over Hfor a ﬁxed value of dual variables

λ[15],

d(λ) = min

H∈RQL(H, λ).(8)

Note that to obtain the value of the dual function d(λ)we need

to solve an unconstrained optimization problem weighted by

λ. Given that the dual function is a point wise minimum of a

family of afﬁne functions, it is concave, even when the prob-

lem (6) is not convex. The maximum of the dual function

d(λ)over λis called the dual problem D∗, and it is always a

lower bound of problem P∗as follows,

d(λ)≤D∗≤min

H∈RQmax

λ∈R+L(φ, λ) = P∗.(9)

The advantage of delving into the dual domain, and max-

imizing the dual function dis that it allows us to search for

solutions of problem 6 by minimizing an unconstrained prob-

lem. The difference between the dual problem D∗and the

primal problem P∗(cf. 6), is called duality gap and will be

quantiﬁed in the following theorem.

AS1 The loss function ℓis L-Lipschitz, i.e.kℓ(x, ·)−ℓ(z, ·)k ≤

Lkx−zk, strongly convex and bounded by B.

AS2 The conditional distribution p(x,∆|y)is non-atomic

for all y∈RN, and there a ﬁnite number of target graph

signals y.

AS3 There exists a convex hypothesis class ˆ

Csuch that C ⊆

ˆ

C, and there exists a constant ξ > 0such that for all ˆ

φ∈ˆ

C,

there exists H ∈ RQsuch that supx∈X kˆ

φ(x)−φ(x, H)k ≤ ξ.

Note that assumption 1 is satisﬁed in practice by most loss

functions (i.e. square loss, L1loss), by imposing a bound.

Assumption 2 can be satisﬁed in practice by data augmen-

tation [16]. Assumption 3 is related to the richness of the

function class of GNNs C, the parameter ξcan be decrease by

increasing the capacity of the GNNs in consideration. To ob-

tain a convex hypothesis class ˆ

H, it sufﬁces to take the convex

hull over the function class of GNNs.

Theorem 1 (Near-Zero Duality Gap) Under assumptions

1, 2, and 3, the Constrained Graph Stability problem (6), has

near zero duality gap,

P∗−D∗≤(2λ∗+ 1)Lξ (10)

where λ∗is the optimal dual variable.

Algorithm 1 Graph Stability Algorithm

1: Initialize model H0, and dual variables λ= 0

2: for epochs e= 1,2,... do

3: for batch iin epoch edo

4: Obtain Nsamples {(xi,yi)}i∼p(x,y)

5: Obtain Mperturbations {(ˆ

Si)}i∼∆

6: Get primal gradient ∇HL(H, λ)(cf. eq (11))

7: Update params. Hk+1 =Hk−ηPˆ

∇HL(Hk, λ)

8: end for

9: Obtain Nsamples {(xi,yi)}i∼p(x,y)

10: Obtain Mperturbations {(ˆ

Si)}i∼∆

11: Update dual variable λ←[λ+ηD∇λL(H, λ)]+

12: end for

The importance of Theorem 1 is that is allows us to quan-

tify the penalty that we incur by delving into the dual domain.

Note that this penalty decreases as we make our parameteriza-

tion richer, and thus we decrease ξ. Also note that the optimal

dual variable λ∗accounts for the difﬁculty of ﬁnding a fea-

sible solution, thus we should expect this value to be small

given the theoretical guarantees on GNN stability [12].

3.2. Algorithm Construction

In order to solve problem 6, we will resort to iterativelly solv-

ing the dual function d(λ), evaluate the constraint slack and

update the dual variables λaccordingly. We assume that the

distributions are unknown, but we have access to samples of

both graph signals (xi,yi)∼p(x, y), and perturbed graphs

ˆ

Sj∼∆. In a standard learning procedure, to minimize the

Lagrangian L(H, λ)with respect to a set of variables λwe

can take stochastic gradients as follows,

ˆ

∇HL(H, λ) =∇H1−λ

N

N

X

i=1

ℓ(φ(xi,S;Hk), yi)(11)

+λ

NM

N

X

i=1

M

X

j=1

ℓ(φ(xi,ˆ

Sj;Hk), yi)

The main difference with a regularized problem is that the

dual variables λare also updated. To update the dual variables

λ, we evaluate the constraint violation as follows,

ˆ

∇λL(H, λ) = 1

NM

N

X

i=1

M

X

j=1

ℓ(φ(xi,ˆ

Sj;Hk), yi)

−1

N

N

X

i=1

ℓ(φ(xi,S;Hk), yi)−Cǫ. (12)

The intuition behind the dual step is that the dual variable λ

will increase while the constraint is not being satisﬁed, adding

weight to the stability condition in the minimization of the

Lagrangian. Conversely, if the constraint is being satisﬁed,

RMSE for 2Layer GNN RMSE for 3Layer GNN

Norm Of Perturbation Unconstrained Constrained Unconstrained Constrained

0 0.8447(0.1386) 0.8564(0.0572) 0.8290(0.1504) 0.8417(0.0980)

0.0001 0.9083(0.2160) 0.8316(0.0692) 3.3542(0.5042) 0.8349(0.1011)

0.001 0.9084(0.2160) 0.8317(0.0691) 3.3542(0.5043) 0.8341(0.1011)

0.01 0.9092(0.2162) 0.8322(0.0688) 3.5410(0.5044) 0.8343(0.1011)

0.1 0.9170(0.2173) 0.8371(0.0667) 3.3484(0.5086) 0.8362(0.1010)

0.2 0.9282(0.2212) 0.8458(0.0608) 3.3326(0.8189) 0.8379(0.1007)

0.5 0.9565(0.2223) 0.8749(0.0528) 3.2070(0.5920) 0.8468(0.0932)

Table 1. Evaluations of the RMSE of the trained GNN for 20 epochs on the testing set (unseen data) for different magnitudes

of relative perturbations. We consider GNNs of 2, and 3layers with K= 5 ﬁlter taps in both cases and F1= 64, and F2= 32

features for the ﬁrst and second layer respectively. The constrained learning approach is able to keep a comparable performance

on the unperturbed evaluation (i.e., Norm of Perturbation= 0) while it is more stable as the norm of the perturbation increases.

we will increase the relative weight of the objective function.

This means, that if the constraint is more restrictive the opti-

mal dual variable will be larger.

Theorem 2 (Convergence) Under assumptions 1, 2, and 3,

if for each dual variable λk, the Lagrangian is minimized up

to a precision α > 0, i.e. L(Hλk, λk)≤minH∈RQL(H, λk)+

α, then for a ﬁxed tolerance β > 0, the iterates generated

by Algorithm 1 achieve a neighborhood of the optimal P∗

problem in ﬁnite time

P∗+α≥ L(Hk, λk)≥P∗−(2λ∗+ 1)Lξ −α−β−ηDB2

2

Theorem (2) allows us to claim converge of Algorithm 1 up

to a neighborhood of the optimal problem (6) that depends on

the tolerance β, the precision α, the dual step size ηDand the

loss bound B.

4. EXPERIMENTS

We consider the problem of predicting the rating a movie will

be given by a user. We leverage the dataset MovieLens 100k

[17] which contains 100,000 integer ratings between 1and 5,

that were collected among U= 943 users and M= 1682

movies. For this example, we focus only on the movie “Con-

tact”. To exploit the underlying graph structure of the prob-

lem we build a movie similarity graph, obtained by computing

the pairwise correlations between the different movies in the

training set [18]. In order to showcase the stability properties

of GNNs, we perturb the graph shift operator according to the

Relative Perturbation Modulo Perturbation [12][Deﬁnition 3]

model ˆ

S=S+ES +SE. We consider a uniform distribution

of Esuch that, kEk ≤ ǫ.

We split the dataset into 90% for training and 10% for

testing, considering 10 independent random splits. For the

optimizer we used a 5sample batch size, and ADAM [19]

with learning rate 0.005, β1= 0.9, β2= 0.999 and no learn-

ing decay. We used the smooth L1loss. For the GNNs,

we used ReLU as the non-linearity, and we considered two

GNNs: (i) one layer with F= 64 features, and (ii) two layers

with F1= 64 and F2= 32. In both cases, we used K= 5

ﬁlter taps per ﬁlter. For the Algorithm 1 we used dual step

size ηD= 0.1, stability constant C=1

3, and magnitude of

perturbation ǫ= 0.3. For number of perturbation per primal

step we used M= 3, and to evaluate the constraint slackness

we used 20% of the training set.

Table 1 shows the RMSE achieved on the test set when the

GNN is trained using Algorithm 1 and when trained uncon-

strained (cf. 3). We evaluate GNN performance for different

magnitudes of perturbations of the graph. The numerical re-

sults shown in Table 1 express the manifestation of the claims

that we put forward. First, using Algorithm 1 we are able to

attain a comparable performance to the one we would have

achieved by training ignoring the perturbation. As seen in the

ﬁrst row, the evaluation of the trained GNNs produces compa-

rable results for both the 2and the 3layer GNN. Second, with

our formulation we are able to obtain more stable represen-

tations because when the perturbation magnitude increases,

the loss deteriorates at a slower rate. This effect is especially

noticeable for the 3layer GNN. It is well studied that GNN

stability worsens as the number of layers increases, however

using Algorithm 1 we are able to curtail this undesirable ef-

fect.

5. CONCLUSION

In this paper we introduced a constrained learning formula-

tion to improve the stability of GNN. By explicitly introduc-

ing a constraint on the stability of the GNN we are able to

obtain ﬁlter coefﬁcients that are more resilient to perturba-

tions of the graph. The beneﬁt of our novel procedure was

benchmarked in a recommendation system problem with real

world data. For future work, we will improve our theoreti-

cal guarantees in order to assure stability, and consider other

more demanding simulations such as robot swarms.

6. REFERENCES

[1] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang,

Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng

Li, and Maosong Sun, “Graph neural networks: A re-

view of methods and applications,” AI Open, vol. 1, pp.

57–81, 2020.

[2] Fernando Gama, Antonio G Marques, Geert Leus, and

Alejandro Ribeiro, “Convolutional neural network ar-

chitectures for signals supported on graphs,” IEEE

Transactions on Signal Processing, vol. 67, no. 4, pp.

1034–1049, 2018.

[3] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-

Hur, “Protein interface prediction using graph con-

volutional networks,” in Advances in Neural Infor-

mation Processing Systems, I. Guyon, U. V. Luxburg,

S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and

R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.

[4] David K Duvenaud, Dougal Maclaurin, Jorge Ipar-

raguirre, Rafael Bombarell, Timothy Hirzel, Alan

Aspuru-Guzik, and Ryan P Adams, “Convolutional net-

works on graphs for learning molecular ﬁngerprints,”

in Advances in Neural Information Processing Systems,

C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and

R. Garnett, Eds. 2015, vol. 28, Curran Associates, Inc.

[5] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing

Shen, and Song-Chun Zhu, “Learning human-object in-

teractions by graph parsing neural networks,” in Pro-

ceedings of the European Conference on Computer Vi-

sion (ECCV), 2018, pp. 401–417.

[6] Qingbiao Li, Fernando Gama, Alejandro Ribeiro, and

Amanda Prorok, “Graph neural networks for decen-

tralized multi-robot path planning,” arXiv preprint

arXiv:1912.06095, 2019.

[7] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan

Bruna, “On the equivalence between graph isomor-

phism testing and function approximation with gnns,”

arXiv preprint arXiv:1905.12560, 2019.

[8] Nicolas Keriven and Gabriel Peyr´e, “Universal invari-

ant and equivariant graph neural networks,” in Advances

in Neural Information Processing Systems (NeurIPS),

2019.

[9] Zhan Gao, Elvin Isuﬁ, and Alejandro Ribeiro, “Stabil-

ity of graph convolutional neural networks to stochastic

perturbations,” Signal Processing, p. 108216, 2021.

[10] Samar Hadou, Charilaos I. Kanatsoulis, and Alejandro

Ribeiro, “Space-time graph neural networks,” 2021.

[11] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro,

“Graphon neural networks and the transferability of

graph neural networks,” Advances in Neural Informa-

tion Processing Systems, vol. 33, 2020.

[12] Fernando Gama, Joan Bruna, and Alejandro Ribeiro,

“Stability properties of graph neural networks,” IEEE

Transactions on Signal Processing, vol. 68, pp. 5680–

5695, 2020.

[13] Shai Shalev-Shwartz and Shai Ben-David, Understand-

ing machine learning: From theory to algorithms, Cam-

bridge university press, 2014.

[14] Luiz Chamon and Alejandro Ribeiro, “Probably approx-

imately correct constrained learning,” Advances in Neu-

ral Information Processing Systems, vol. 33, 2020.

[15] Stephen Boyd and Lieven Vandenberghe, Convex Opti-

mization, Cambridge University Press, 2009.

[16] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and

Yoshua Bengio, Deep learning, vol. 1, MIT press Cam-

bridge, 2016.

[17] F Maxwell Harper and Joseph A Konstan, “The movie-

lens datasets: History and context,” Acm transactions

on interactive intelligent systems (tiis), vol. 5, no. 4, pp.

1–19, 2015.

[18] Weiyu Huang, Antonio G Marques, and Alejandro R

Ribeiro, “Rating prediction via graph signal process-

ing,” IEEE Transactions on Signal Processing, vol. 66,

no. 19, pp. 5066–5081, 2018.

[19] Diederik P. Kingma and Jimmy Ba, “Adam: A method

for stochastic optimization,” CoRR, vol. abs/1412.6980,

2015.