Content uploaded by Pierre Humbert
Author content
All content in this area was uploaded by Pierre Humbert on Feb 28, 2020
Content may be subject to copyright.
Learning Laplacian Matrix from Graph Signals with
Sparse Spectral Representation
Pierre Humbert1∗humbertp@cmla.ens-cachan.fr
Batiste Le Bars1∗† lebars@cmla.ens-cachan.fr
Laurent Oudrelaurent.oudre@univ-paris13.fr
Argyris Kalogeratos∗kalogeratos@cmla.ens-cachan.fr
Nicolas Vayatis∗vayatis@cmla.ens-cachan.fr
∗CMLA, ENS Paris Saclay – CNRS – University Paris-Saclay, Cachan, 94230, France.
L2TI, University Paris 13, Villetaneuse, 93430, France.
†Sigfox R&D, 31670 Lab`ege, France.
Abstract
In this paper, we consider the problem of learning a graph structure from multivariate
signals, known as graph signals. Such signals are multivariate observations carrying mea-
surements corresponding to the nodes of an unknown graph, which we desire to infer. They
are assumed to enjoy a sparse representation in the graph spectral domain, a feature which
is known to carry information related to the cluster structure of a graph. The signals are
also assumed to behave smoothly with respect to the underlying graph structure. For the
graph learning problem, we propose a new optimization program to learn the Laplacian of
this graph and provide two algorithms to solve it, called IGL-3SR and FGL-3SR. Based
on a 3-steps alternating procedure, both algorithms rely on standard minimization meth-
ods –such as manifold gradient descent or linear programming– and have lower complexity
compared to state-of-the-art algorithms. While IGL-3SR ensures convergence, FGL-3SR
acts as a relaxation and is significantly faster since its alternating process relies on multiple
closed-form solutions. Both algorithms are evaluated on synthetic and real data. They
are shown to perform as good or better than their competitors in terms of both numer-
ical performance and scalability. Finally, we present a probabilistic interpretation of the
optimization program as a Factor Analysis Model.
Keywords: Graph learning, graph Laplacian, non-convex optimization, graph signal
processing, sparse coding, clustering
1. Introduction
Hidden structures in multivariate or multimodal signals can be captured through the notion
of graph. The availability of such a graph is a core assumption in many computational tasks
such as spectral clustering, semi-supervised learning, graph signal processing, etc. However,
in most situations, no natural graph can be derived or defined and the underlying graph
must be inferred from available data. This task, often referred to as graph learning, has
1. Authors with equal contribution to this work.
1
Humbert, Le Bars et al.
received significant attention in fields such as machine learning, signal processing, biology ,
meteorology, and others (Friedman et al., 2008; Hecker et al., 2009; William et al., 2017).
Learning a graph is an ill-posed problem as several graphs can explain the same set
of observations. Previous works have been devoted to introducing underlying models or
constraints that would narrow down the range of possible solutions. For instance, physical
constraints can be imposed to suggest epidemic models or other information propagation
and interaction models (Rodriguez et al., 2011; Du et al., 2012; Gomez-Rodriguez et al.,
2016). From a statistical perspective, the graph learning task is seen as the estimation
of the parameters of a certain probability distribution parametrized by the graph itself.
Generally, the assumed class of distributions is either a Bayesian Network in the case of
directed graph, or a Markov Random Field for undirected graphs (Koller et al., 2009; Yang
et al., 2015; Wang and Kolar, 2016; Tarzanagh and Michailidis, 2018). Hence, the graph
structure encompasses the conditional dependencies between variables. Two variables will
be connected in the graph if they are dependent conditionally on all the other variables.
In the particular case of Gaussian Random Fields, the graph estimation consists in the
estimation of the inverse covariance matrix, known as the precision matrix (Banerjee et al.,
2008; Friedman et al., 2008). In the latter reference, the proposed estimation method
corresponds to the well-known Graph-Lasso algorithm, which relies on the assumption that
the precision matrix is subject to a sparsity constraint.
More recently, Graph Signal Processing (GSP) (Shuman et al., 2013; Djuric and Richard,
2018), has generalized the standard concepts and tools of signal processing to multivariate
signals recorded over graph structures. Notions such as smoothness, sampling, filtering,
etc., have been adapted to this framework, opening a new field that paves the way to
further developments in graph learning (Pasdeloup et al., 2017; Thanou et al., 2017). In
this framework, the smoothness of observations with respect to the true underlying graph
is a common assumption (Dong et al., 2018; Daitch et al., 2009; Kalofolias, 2016; Egilmez
et al., 2016; Chepuri et al., 2017), which asks for graphs on which signals have small local
variations among adjacent nodes. Another naturally arising property of real-world problems
is the sparsity of the observations in the graph spectral basis (Valsesia et al., 2018; Sardellitti
et al., 2019). In data clustering, for instance, the vector of labels seen as a signal over the
vertices of a graph, exhibits a sparse spectral representation. It is smooth within each cluster
and varies across different clusters (Figure 1). Hence, building such graph is relevant for
graph-based clustering approaches, such as spectral clustering. Furthermore, such sparsity
assumption is also relevant for the sampling task. Indeed, by making use of this property,
it is possible under mild conditions to reconstruct the observations for nodes that have
not been sampled (Chen et al., 2015). These properties, all borrowed from the GSP field,
can be seen as constraints or regularizations for the graph learning task, and offer a new
perspective on the topic.
Aim and main contributions. In the present paper, we introduce an optimization
problem to learn a graph from signals that are assumed to be smooth and admitting a
sparse representation in the spectral domain of the graph. The main contributions can be
summarized as follows:
•The graph learning task problem is cast as the optimization of a smooth nonconvex
objective function over a nonconvex set (Section 2). This challenging problem is effi-
2
Learning a Graph from Signals with Sparse Spectral Representation
ciently solved by introducing a framework that combines barrier methods, alternating
minimization, and manifold optimization (Section 3). A relaxed algorithm is also
proposed, which allows to scale in time with the graph dimensions (Section 4).
•A factor analysis model for smooth graph signals with sparse spectral representation
is introduced (Section 5). This model provides a probabilistic interpretation of our
optimization program and links its objective function to a maximum a posteriori
estimation.
•The proposed algorithms are tested on several synthetic and real databases, and com-
pared to state-of-the-art approaches (Section 7). Experimental results show that our
approach allows to obtain similar or better performance than standard existing meth-
ods while significantly lowering the necessary computing resources.
Background and notations. Throughout all the paper, we consider an undirected and
weighted graph Gwith no self-loops. It is defined as a pair G= (V,E) with vertices (or
nodes) V={1, ..., N }, and set edges E={(i, j, wij), i, j ∈ V} with weights wij ∈R+ar-
ranged in a weight matrix W∈RN×N
+. More particularly, we focus on its combinatorial
graph Laplacian matrix which entirely describes it and is given by L=D−W, where Dis
the diagonal degree matrix and Wthe weight matrix. As Gis undirected, Lis a symmetric
positive semi-definite matrix. Its eigenvalue decomposition can be written as L=XΛXT,
with Λ = diag(λ1, ..., λN) a diagonal matrix with the eigenvalues and X= (x1, ..., xN) a
matrix with the eigenvectors as columns. We also consider graph signals (or graph func-
tions) on this graph. A graph signal is defined as a function y:V → RNthat assigns a
scalar value to each vertex. This function can be represented as a vector y∈RN, with yi
the function value at the i-th vertex. Also, with 1Nwe denote the constant unitary vector
of size N, and with 0Nthe vector containing only zeros. All remaining notations are given
throughout the paper and are collected in Table 2 (Section 9).
Next, we give important definitions for two graph signal properties: first the smoothness,
and then the spectral sparsity that allows us to create a spectral representation of a graph
signal yadapted to a graph, using the Graph Fourier Transform (GFT).
Definition 1 (Smoothness) – Let G= (V,E)be a graph, Lbe its Laplacian matrix, and
y∈RNbe a graph signal seen as vector. Given a smoothness level s≥0, then a graph
signal yis said to be s-smooth with respect to the graph Gif
yTLy =1
2X
i,j
wij (yi−yj)2≤s . (1)
Intuitively, a graph signal yis s-smooth with respect to Gif adjacent nodes of the graph
carry sufficiently similar signal values.
Definition 2 (Graph Fourier Transform) – Let G= (V,E)be an undirected graph with
no self-loops, and L=XΛXTbe the eigenvalue decomposition of the its Laplacian matrix.
Then, the GFT of a graph signal y∈RNis given by h=XTy, where the components of
3
Humbert, Le Bars et al.
(a) (b)
Figure 1: Two graph signals observed on the same graph of 200 nodes. (a) The first signal admits
smoothness at the level of adjacent nodes, and a 100-sparse spectral representation. (b) The second
signal admits also smoothness, but in this case it extends to larger clusters of connected nodes. As
a consequence, this graph signal enjoys a 3-sparse spectral representation.
hare interpreted as Fourier coefficients, the eigenvalues Λas distinct frequencies, and the
eigenvectors Xas decomposition basis.
Definition 3 (Spectral sparsity) – Consider the notations of Definition 2. Given k∈N+,
we say that a graph signal yadmits a k-sparse spectral representation with respect to a graph
G, if for h=XTythen
khk0≤k , (2)
where khk0stands for the number of non-zero elements of h.
Regarding this definition, yadmits k-sparse spectral representation if the number of non-
zero elements in its Fourier coefficients vector is less or equal to k. In the following, we also
use the term k-bandlimited when referring to a k-sparse signal.
2. Problem Statement
This section describes the graph learning problem for smooth and sparse graph signals.
2.1 Setup and working assumptions
The general task of graph learning aims at building a graph Gthat best explains the struc-
ture of nobserved graph signals {y(k)}n
k=1 of size N, composing a matrix Y= [y(1) ,··· , y(n)]∈
RN×n. The proposed graph learning framework takes as input the matrix Yand outputs the
Laplacian matrix Lassociated to G(note that both notions are equivalent). Our learning
process is based on the following assumptions:
Assumption 1 (Assumption on the graph G)–Gis undirected, with no self-loops and has
a single connected component.
With Assumption 1, Lis a symmetric positive semi-definite matrix with eigenvalue decom-
position L=XΛXT, where λ1= 0 and x1=1
√N1N(Chung and Graham, 1997).
4
Learning a Graph from Signals with Sparse Spectral Representation
Assumption 2 (Assumption on the signals Y)– Graph signals Ydefined over the true
underlying graph Gare assumed s-smooth and admit a k-sparse spectral representation,
with unknown values for sand k.
On the smoothness assumption. According to Equation (1), low svalues tend to favor
smooth signals for which adjacent nodes carry similar signal values. This property has
consequently been widely considered for the graph learning task (Daitch et al., 2009; Dong
et al., 2016).
On the spectral sparsity assumption. This property is known as bandlimitedness in
the GSP field. In general, it assumes that the null components of hare those associated
to the largest eigenvalues (frequencies). Essentially, this additional hypothesis expresses a
fundamental principle of signal processing which suggests filtering-out the high-frequency
band of a signal, as it carries mainly noise and little or no information. This assumption is
very common for graph signals, especially in GSP where it is the main hypothesis of several
graph sampling methods (Anis et al., 2014; Narang et al., 2013; Chen et al., 2015; Marques
et al., 2016). To further justify the consideration of this property we can think that intu-
itively it implies that the signal shall be smooth at a larger scale, across subcgraphs that
coincide to a large extend with the cluster structure of the underlying graph. As a result,
such a graph signal shall admit a k-sparse spectral representation.
Figure 1 shows visually an example of two graph signals that illustrate the intuition behind
our two core assumptions.
2.2 Graph Learning for Smooth and Sparse Spectral Representation
A general graph learning scheme consists in learning the adjacency or the Laplacian matrix.
However, since the constraint of Assumption 2 (sparsity of the graph signals over the eigen-
basis of the Laplacian matrix) is easier to be expressed in the spectral domain, in this
article we focus on learning the eigendecomposition of the Laplacian matrix L=XΛXT.
The optimization problem incorporates a linear least square regression term depending of
Y,X, and H, which controls the distance of the new representation XH to the observations
Y. In addition, due to Assumption 2, we add two penalization terms: One to control the
smoothness of the new representation, depending on Λ and H; the other one to control the
sparsity on the spectral domain, which only depends on H. Finally, as we want to learn
a Laplacian matrix satisfying Assumption 1, equality and inequality constraints relative to
Xand Λ are necessary. To that end, we introduce the following optimization problem:
5
Humbert, Le Bars et al.
min
H,X,ΛkY−XHk2
F+αkΛ1/2Hk2
F+βkHkS,(3)
s.t.
XTX=IN, x1=1
√N1N,(a)
(XΛXT)k,` ≤0k6=` , (b)
Λ = diag(0, λ2, ..., λN)0,(c)
tr(Λ) = N∈R+
∗,(d)
where INis the identity matrix of size N, tr(·) denotes the trace, and Λ 0 indicates that
the matrix is semi-definite positive.
This problem aims at conjointly learning the Laplacian L(i.e. (X, Λ)) and a smooth
bandlimited approximation XH of the observed signals Y. Here, His the same size as Y
and corresponds to the spectral representation of the graph signals through the GFT.
Interpretation of the terms. In the objective function (3), the first term corresponds
to the quadratic approximation error of Yby XH, where k·kFis the Frobenius norm.
The second term is a smoothness regularization imposed to the approximation XH . Rewrit-
ing the smoothness equation (1) for the set of graph signals XH, we obtain
kL1/2XHk2
F=kXΛ1/2XTXHk2
F=kΛ1/2Hk2
F=
N
X
i=1
λikHi,:k2
2,
where Hi,:is the i-th row of the matrix H. This kind of regularization is very common
in graph learning (Kalofolias, 2016; Chepuri et al., 2017). From its definition, we can see
that it tends to be low when high values of {λi}N
i=1 are associated to rows of Hwith low
`2-norm. This corroborates the idea that the {λi}N
i=1 can be interpreted as frequencies and
the elements of Has Fourier coefficients.
The last term, βkHkS, is a sparsity regularization. In this work, we propose to either
use the `2,1(sum of the `2-norm of each row of H) or `2,0(number of rows with `2-norm
different than 0) that induces a row-sparse solution b
H.
Remark on the choice of k·kS– In the context of GSP, it is natural to assume that
the graph signals are bandlimited at the same dimensions. This property is enforced by
k·kSand has two main advantages: it is a key assumption for sampling over a graph and
this particular structure is better for inferring graphs with clusters (Sardellitti et al., 2019).
Therefore, in this article, the use of the classical `0-norm and the `1-norm have not been
investigated since they would impose sparsity at every dimension of the matrix H‘indepen-
dently’, which would consequently break the bandlimitedness assumption.
The hyperparameters, α, β > 0 are controlling respectively the smoothness of the approxi-
mated signals and the sparsity of H. A discussion on the influence of these hyperparameters
and an efficient way to fix them is provided in Section 7.3.1. Finally, the first three con-
straints (3a), (3b), (3c) enforce XΛXTto be a Laplacian matrix of a graph with a single
connected component (Assumption 1). More specifically, by definition, L=D−Wwith
W∈RN×N
+, thus we necessary have ∀k6=`, Lk,` = (XΛXT)k,` ≤0 (constraint (3b)).
Furthermore, as XΛXTis the eigendecomposition of the Laplacian matrix of an undirected
6
Learning a Graph from Signals with Sparse Spectral Representation
graph with a single connected component (Assumption 1), XTX=IN, x1=1
√N1Nand
λ1= 0 < λ2≤... ≤λN(constraints (3a) and (3c)). The last constraint (3d) was proposed
in Dong et al. (2016) as to impose structure in the learned graph so that the trivial solution
b
Λ = 0is avoided. A discussion about values other than Nis made in Kalofolias (2016).
Properties of the objective function (3).The objective function (3) is not jointly
convex but convex with respect to each of the block-variables H, X, or Λ, taken indepen-
dently. A natural approach to solve this problem is to alternate between the three variables,
minimizing over one while keeping the others fixed. However, due to the equality constraint
(3a) and inequalities (3b), the feasible set is not convex with respect to X. Hence, this ap-
proach raises several difficulties that will be discussed and handled in the following section.
2.3 Reformulation of the problem
As stated in Section 2.2, problem (3) is not jointly convex and cannot be solved easily with
constraints (3a) and (3b). In this section, we propose to rewrite constraints (3a) and (3b), in
order to define a new equivalent optimization problem that can be solved with well-known
techniques.
2.3.1 Reformulation of the constraint (3a)
In this section, we show that the constraints (3a) can be reformulated as a constraint over
the space of orthogonal matrices in R(N−1)×(N−1). Although such transformation does not
change the convexity of the feasible set, we will see in Section 3.3 that there exist efficient
algorithms that perform optimization over such manifold.
Definition 4 (Orthogonal group) – The space of orthogonal matrices in RN×N, called
orthogonal group, is the space:
Orth(N) = {X∈RN×N|XTX=IN}.
Lemma 5 – Given X, X0∈RN×Ntwo orthogonal matrices, both having their first column
equal to 1
√N1N(constraint (3a)), we have the following equality
X=X010T
N−1
0N−1[XT
0X]2:,2:,
with [XT
0X]2:,2: denoting the submatrix of XT
0Xcontaining everything but the first row and
column of itself. Furthermore, [XT
0X]2:,2: is in Orth(N−1).
The above lemma allows us to build an equivalent formulation of Problem (3) given by
the following proposition.
Proposition 6 – Given X0∈RN×Nan orthogonal matrix with first column being equal to
1
√N1N, an equivalent formulation of optimization problem (3) is given by
min
H,U,Λ
Y−X010T
N−1
0N−1UH
2
F
+αkΛ1/2Hk2
F+βkHkS,f(H, U, Λ) ,(4)
7
Humbert, Le Bars et al.
s.t.
UTU=IN−1,(a’)
X010T
N−1
0N−1UΛ10T
N−1
0N−1UTXT
0!k,` ≤0k6=` , (b’)
Λ = diag(0, λ2, ..., λN)0,(c)
tr(Λ) = N∈R+
∗.(d)
The latter proposition says that since the first column of Xis fixed and known, it is sufficient
to look for an optimal rotation of a valid matrix X0that preserves the first column. Such
a rotation matrix is given above and is parametrized by a Uin Orth(N−1). Note that in
practice, to find a matrix X0satisfying (3a), we build the Laplacian of any graph with a
single connected component and take its eigenvectors.
2.3.2 Log-barrier method for constraint (4b’)
In order to deal with constraint (4b0), we propose to use a log-barrier method. This bar-
rier function allows us to consider an approximation of problem (4) where the inequality
constraint (4b’) is made implicit in the objective function. Denoting by f(·) the objective
function of (4), we want to solve
min
H,U,Λf(H, U, Λ) + 1
tφ(U, Λ) s.t. (4a’),(4c),(4d) ,(5)
where tis a fixed positive constant and φ(·) is the log-barrier function associated to the
constraint (4b0).
Definition 7 (Log-barrier function) – Let the following matrix in RN×N:
h(U, Λ) = X010T
N−1
0N−1UΛ10T
N−1
0N−1UT
XT
0,
involved in the constraint (4b0). The associated log-barrier function φ:R(N−1)×(N−1) ×
RN×N−−→ Ris defined by:
φ(U, Λ) = −
N−1
X
k=1
N
X
`>k
log −h(U, Λ)k,`,(6)
with dom(φ) = (U, Λ) ∈R(N−1)×(N−1) ×RN×N| ∀1≤k < ` ≤N, h(U, Λ)k,` <0, i.e. its
domain is the set of points that strictly satisfy the inequality constraints (4b’).
This barrier function allows us to perform block-coordinate descent on three easier to
solve subproblems, as we discuss in the next section.
3. Resolution of the problem: IGL-3SR
In this section, we describe our method, the Iterative Graph Learning for Smooth and Sparse
Spectral Representation (IGL-3SR), and its different steps to solve Problem (5). Given a
fixed t > 0, we propose to use a block-coordinate descent on H,U, and Λ, which permits
8
Learning a Graph from Signals with Sparse Spectral Representation
to split the problem in three partial minimizations that we discuss in this section. One
of the main advantages of IGL-3SR is that each subproblem can be solved efficiently and
as the objective function is lower-bounded by 0, this procedure ensures convergence. The
summary of the method is presented in Algorithm 1.
3.1 Optimization with respect to H
For fixed Uand Λ, the minimization Problem (5) with respect to His:
min
HkY−XHk2
F+αkΛ1/2Hk2
F+βkHkS,where X=X010T
N−1
0N−1U.(7)
When k·kSis set to k·k2,0(resp. k ·k2,1), this problem is a particular case of what is known
as Sparsify Transform Learning (Ravishankar and Bresler, 2012) (resp. is a particular case
of the Group Lasso (Yuan and Lin, 2006) known as Multi-Task Feature Learning (Argyriou
et al., 2007)). Moreover, as Xis orthogonal, we are able to find closed-form solutions
(Proposition 8).
Proposition 8 (Closed-form solution for the `2,0and `2,1-norms) – The solutions of Prob-
lem (7) when k·kSis set to k·k2,0or k ·k2,1, are given in the following.
•Using the `2,0-norm, the optimal solution of (7) is given by the matrix b
H∈RN×n
where for 1≤i≤N,
b
Hi,:=(0if 1
1+αλik(XTY)i,:k2
2≤β ,
1
(1+αλi)(XTY)i,:else .(8)
•Using the `2,1-norm, the optimal solution of (7) is given by the matrix b
H∈RN×n,
where for 1≤i≤N,
b
Hi,:=1
1 + αλi1−β
2
1
k(XTY)i,:k2+(XTY)i,:,(9)
where (t)+,max{0, t}is the positive part function.
3.2 Optimization with respect to Λ
For fixed Hand U, the optimization Problem (5) with respect to Λ is:
min
Λαtr(HHTΛ)
| {z }
kΛ1/2Hk2
F
+1
tφ(U, Λ) s.t. Λ = diag(0, λ2, ..., λN)0,(c)
tr(Λ) = N∈R+
∗.(d) (10)
This objective function is differentiable and convex with respect to Λ, and the constraints
define a Simplex. Thus, several convex optimization solvers can be employed, such as those
implemented in CVXPY (Diamond and Boyd, 2016). Popular algorithms are interior-point
methods or projected gradient descent methods (Maing´e, 2008). Using one algorithm of
the latter type, we compute the gradient of 10 and project each iteration onto the Simplex
(Duchi et al., 2008).
9
Humbert, Le Bars et al.
Figure 2: The principle of the manifold gradient descend given schematically. TXOrth(N) is the
tangent space of Orth(N) at X. The red line corresponds to a curve in Orth(N) passing through
the point Xin the direction of the arrow. At each iteration, considering that Xis the point of
the current solution, a search direction belonging to TXOrth(N) is first defined, and then a descent
along a curve of the manifold is performed (at the direction of the black arrow along the red line).
3.3 Optimization with respect to U
For fixed Hand Λ, the optimization Problem (5) with respect to Uis:
min
U
Y−X010T
N−1
0N−1UH
2
F
+1
tφ(U, Λ) s.t. UTU=I(N−1) .(a’) (11)
The objective function is not convex but twice differentiable and the constraint (a’) involves
the set of orthogonal matrices Orth(N−1) which is not convex. Orthogonality constraint
is central to many machine learning optimization problems including Principal Component
Analysis (PCA), Sparse PCA, and Independent Component Analysis (ICA) (Hyv¨arinen
and Oja, 2000; Zou et al., 2006; Shalit and Chechik, 2014). Unfortunately, optimizing over
this constraint is a major challenge since simple updates such as matrix addition usually
break orthonormality. One class of algorithms tackles this issue by taking into account that
the orthogonal group Orth(N) is a Riemannian submanifold embedded in RN×N. In this
article, we focus on manifold adaptation of descent algorithms to solve Problem (11).
The generalization of gradient descent methods to a manifold consists in selecting, at
each iteration, a search direction belonging to the tangent space of the manifold defined at
the current point X, and then performing a descent along a curve of the manifold. Figure
2 provides pictures this principle.
Definition 9 (Tangent space at a point of Orth(N)) – Let X∈Orth(N). The tangent
space of Orth(N)at point X, denoted by TXOrth(N)is a 1
2N(N−1) dimensional vector
space defined by:
TXOrth(N) = XΩ|Ω∈RN×Nis skew-symmetric.
When we endow each tangent space with the standard inner product, we are able to define
a notion of Riemannian gradient that allows us to find the best direction for the descent.
For an objective function ¯
f:RN×N→R, the Riemannian gradient defined over Orth(N)
is given by:
grad ¯
f(X) = PX(∇X¯
f(X)) ,(12)
10
Learning a Graph from Signals with Sparse Spectral Representation
where PXis the projection onto the tangent space at X, which is equal to PX(ξ) =
1
2X(XTξ−ξTX), and ∇Xis the standard Euclidean gradient. At each iteration, the mani-
fold gradient descent computes the Riemannian gradient (12) that gives a direction in the
tangent space. Then the update is given by applying a retraction onto this direction, up
to a step-size. A retraction consists in an update mapping from the tangent space to the
manifold, and there are many possible ways to perform that (Edelman et al., 1998; Absil
et al., 2009; Arora, 2009; Meyer, 2011). From the last equation, we see that in order to solve
problem (11) with this method, we need the Euclidean gradient of the objective function,
namely those of f(·) and φ(·). These are given in the following proposition.
Proposition 10 (Euclidean gradient with respect to U)– The Euclidean gradient of f(·)
and φ(·)with respect to Uare:
∇Uf(H, U, Λ) = −2(H Y TX0)2:,2:T+ 2U(HHT)2:,2: ,
∇Uφ(U, Λ) = −
N−1
X
k=1
N
X
`>k Bk,` +BT
k,`UΛ2:,2:
h(U, Λ)k,`
,
with ∀1≤k, ` ≤N, Bk,` =XT
0ekeT
`X02:,2:, and h(·)from Definition 7.
3.4 Log-barrier method and initialization
Choice of the tparameter. The quality of the approximation of Problem (4) by
Problem (5) improves as t > 0 grows. However, taking a too large tat the beginning may
lead to numerical issues. As a solution, we use the path-following method, which computes
the solution for a sequence of increasing values of tuntil the desired accuracy. This method
requires an initial value for t, denoted t(0), and a parameter µsuch that t(`+1) =µt(`). For
an in-depth discussion we refer to Boyd and Vandenberghe (2004).
Initialization. At the beginning, our IGL-3SR method requires a feasible solution to
initialize the algorithm. One possible choice is to take Uas the identity matrix IN−1and
to replace (X0,Λ) by the eigenvalue decomposition of the complete graph with trace equals
to N. Indeed, its eigenvalue decomposition will always satisfy the constraints and belong
to the domain of the barrier function. The initialization of His not needed as we start
directly with the H-step.
IGL-3SR is summarized in Algorithm 1.
3.5 Computational complexity of IGL-3SR
Considering a graph with Nnodes and n>N graph signals:
•H-step (non-iterative) – The closed-form solution requires to compute the matrix
product XTY, which is of complexity O(nN2).
•Λ-step (iterative) – When using a projected gradient descent method, the complexity
of each iteration is O(nN2) to compute the gradient and O(Nlog(N)) for the pro-
jection (Duchi et al., 2008). Hence, denoting by τΛthe number of iterations in each
Λ-step, the complexity is O(τΛ·nN2)
11
Humbert, Le Bars et al.
Algorithm 1 The IGL-3SR algorithm with `2,1-norm
Input: Y∈RN×n, α, β
Input of the barrier method: t(0), tmax , µ – see Section 3.4
Output: b
H,b
X,b
Λ
Initialization: L0(e.g. with a complete graph) – see Section 3.4
t←t(0)
(X0,Λ) ←SVD(L0)
U←IN−1
while t≤tmax do
while not convergence do
BH-step: Compute the closed-form solution of Proposition (8)
for 1=1, ..., N do
Hi,:←1
1 + αλi1−β
2
1
k(XTY)i,:k2+(XTY)i,:
end for
BΛ-step: Solve Problem (10)
Λ←arg min
Λαtr(HHTΛ) + 1
tφ(U, Λ) s.t. Λ = diag(0, λ2, ..., λN)0,
tr(Λ) = N∈R+
∗
BU-step: Solve Problem (11)
while not convergence do
U←retraction(U((H Y TX0)2:,2:U−UT(H Y TX0)2:,2:T))
end while
end while
t←µt
end while
•X-step (iterative) – The complexity of each iteration is O(nN2) to compute the Rie-
mannian gradient and O(N3) when we use the QR factorization as retraction (Boyd
and Vandenberghe, 2018). Hence, denoting by τXthe number of iterations in each
X-step, the complexity is O(τX·nN2).
Overall – The complexity to go through the big loop of IGL-3SR once (i.e. once through
each of the H, Λ, and Xsteps) is of order O(max(τΛ, τX)·nN2). However, recall that τΛ
and τXcan be large in practice for reaching a good solution. In the following, we propose
a relaxation for a faster resolution that relies on closed-form solutions.
4. A relaxation for a faster resolution: FGL-3SR
In this section, we propose another algorithm, first introduced in Le Bars et al. (2019),
called Fast Graph Learning for Smooth and Sparse Spectral Representation (FGL-3SR) to
approximately solve the initial Problem (3). FGL-3SR has a significantly reduced com-
putational complexity due to a well-chosen relaxation. As in the previous section, we use
a block-coordinate descent on H,X, and Λ, which permits to decompose the problem in
three partial minimizations. FGL-3SR relies on a simplification of the minimization step in
Xby removing the constraint (3b). This simplification allows us to compute a closed-form
12
Learning a Graph from Signals with Sparse Spectral Representation
on this step which greatly accelerates the minimization. However, the constraints (3a) and
(3b) are equally important to obtain a valid Laplacian matrix at the end, and reducing the
problem does not ensure that the constraint (3b) will be satisfied. The following proposition
explains why we can get rid of constraint (3b) at the X-step, while still being able to ensure
that the matrix will be a proper Laplacian at the end of the algorithm.
Proposition 11 (Feasible eigenvalues) – Given any X∈RN×Nbeing an orthogonal matrix
with first column being equal to 1
√N1N(constraint (3a)), there always exists a matrix Λ∈
RN×Nsuch that the following constraints are satisfied:
(XΛXT)i,j ≤0i6=j , (3b)
Λ = diag(0, λ2, ..., λN)0,(3c)
tr(Λ) = c∈R+
∗.(3d)
In Proposition 12 of the next section, we will see that, by ignoring constraint (3b) at the X-
step, we can compute a closed-form solution to the optimization problem. For this reason,
we propose to use the closed-form solution that we derive to learn X, and right after always
optimize with respect to Λ. Hence, we are sure that we will obtain a proper Laplacian at the
end of the process (Proposition 11). The initialization and the optimization with respect
to Hare not concerned by this relaxation and can therefore be performed as in IGL-3SR
(see Sections 3.1 and 3.4).
4.1 Optimization with respect to X
As already explained, during the X-step, we solve the program
min
XkY−XHk2
Fs.t. XTX=IN, x1=1
√N1N,(3a) (13)
where the constraint (3b) is missing. The closed-form solution is given next.
Proposition 12 (Closed-form solution of Problem (13)) – Let X0be any matrix that be-
longs to the constraints set (3a), and M= (XT
0Y HT)2:,2: the submatrix containing every-
thing but the input’s first row and first column. Finally, let P D QTbe the SVD of M. Then,
the problem admits the following closed form solution:
b
X=X010T
N−1
0N−1P QT.(14)
In practice, X0can be fixed to the current value of X.
4.2 Optimization with respect to Λ
With respect to Λ, the optimization Problem (3) becomes:
min
Λαtr(HHTΛ)
| {z }
kΛ1/2Hk2
F
s.t.
(XΛXT)i,j ≤0i6=j , (b)
Λ = diag(0, λ2, ..., λN)0,(c)
tr(Λ) = N∈R+
∗,(d)
(15)
13
Humbert, Le Bars et al.
Algorithm 2 The FGL-3SR algorithm with `2,1-norm
Input : Y∈RN×n,α,β
Output : b
H,b
X,b
Λ
Initialization: L0(e.g. with a complete graph) – see Section 3.4
(X, Λ) ←SVD(L0)
for t= 1,2, ... do
BH-step: Compute the closed-form solution of Proposition (8)
for 1=1, ..., N do
Hi,:←1
1 + αλi1−β
2
1
k(XTY)i,:k2+(XTY)i,:
end for
BX-step: Compute the closed-form solution of Proposition (12)
M←(XTY HT)2:,2:
(P, D, QT)←SVD(M)
X←X10T
N−1
0N−1P QT
BΛ-step: Solve the linear Program (15)
Λ←arg min
Λαtr(HHTΛ) s.t.
(XΛXT)i,j ≤0i6=j
Λ = diag(0, λ2, ..., λN)0
tr(Λ) = N∈R+
∗
end for
which is a linear program that can be solved efficiently using linear cone programs. Note
that this will involve an optimization over Nparameters with 1
2N(N−1)+N+1 constraints.
FGL-3SR is summarized in Algorithm 2.
4.3 Computational complexity of FGL-3SR
Considering a graph with Nnodes and ngraph signals:
•H-step – The closed-form solution requires to compute the matrix product XTY,
which is of complexity O(nN2).
•X-step – The closed-form solution requires to compute the SVD of (XT
0Y HT)2:,2: ∈
R(N−1)×(N−1), which is of complexity O(N3) (Cline and Dhillon, 2006).
•Λ-step – Solving the LP can be done with interior-point methods or with the ellipsoid
method (Vandenberghe, 2010). For accuracy ε, the ellipsoid method yields a com-
plexity of O(max(m, N)·N3log (1/ε)), where m=1
2N(N−1) + N+ 1 is the number
of constraints (Bubeck, 2015).
Overall – As m>N, the complexity for FGL-3SR is of order O(N5) when using the
ellipsoid method. In contrast, the most competitive related algorithm of the literature
(ESA-GL (Sardellitti et al., 2019)) relies on a semi-definite program and is of order at
14
Learning a Graph from Signals with Sparse Spectral Representation
least O(N8) (see Section 6). As will be clearly demonstrated in Section 7, in practice the
empirical execution time of FGL-3SR is lower than IGL-3SR and ESA-GL.
5. A probabilistic interpretation
In this section, we introduce a new representation model adapted to smooth graph signals
with sparse spectral representation. The goal of this model is to provide a probabilistic
interpretation of Problem (3) and link its objective function to a maximum a posteriori
estimation (Proposition 16).
Given a Laplacian matrix L=XΛXT, we propose the following Factor Analysis Frame-
work to model a graph signal y:
y=Xh +my+ε , (16)
where my∈RNis the mean of the graph signal yand εis a Gaussian noise with zero
mean and covariance σ2IN. Here, the latent variable h= (h1, ..., hN) controls ythrough
the eigenvector matrix Xof L. The choice of the representation matrix Xis particularly
adapted since it reflects the topology of the graph and provides a spectral embedding of
its vertices. Moreover, as seen in Section 2, Xcan be interpreted as a graph Fourier basis,
which makes it an intuitive choice for the representation matrix. In a noiseless scenario
with my= 0, hactually corresponds to the GFT of y.
To comply with the spectral sparsity assumption (Assumption 2), we now propose a
distribution that allows hto admit zero-valued components. To this end, we introduce
independent latent Bernoulli variables γiwith success probability pi∈[0,1]. Knowing
γ1, ..., γN, the conditional distribution for his:
h|γ∼ N(0,e
Λ†),(17)
where e
Λ†is the Moore-Penrose pseudo-inverse of the diagonal matrix containing the values
{λi{γi= 1}}N
i=1. In this model, γicontrols the sparsity of the i-th element of h. Indeed,
if γi= 0, then hi= 0 almost surely. In the other hand, if γi= 1 then hifollows a Gaussian
distribution with zero-mean and variance equal to 1/λi. This is adapted to the smoothness
hypothesis as for high value of λi(high frequency), the distribution of hiconcentrates more
around 0, leading to small value of λih2
i. The associated probability of success pican be
chosen a priori. One way to chose it is to take piinversely proportional to λi. Indeed, this
would increase the probability to be sparse at dimensions where the associated eigenvalue
is high. Note that, since λ1= 0, h1follows a centered degenerate Gaussian, i.e h1is equal
to 0 almost surely. Furthermore, if pi= 1 for all i, our model reduces to the one proposed
by Dong et al. (2016), which was only focused on the smoothness assumption.
Definition 13 (Prior and conditional distributions) – The following equations summarize
the prior and important conditional distributions of our model:
p(hi|γi, λi)∝exp(−λih2
i){γi= 1}+{hi= 0, γi= 0},(18)
p(y|h, X)∝exp(−1
σ2ky−Xh −myk2
2),(19)
p(γi)∝pγi
i(1 −pi)1−γi.(20)
15
Humbert, Le Bars et al.
For simplicity, in the following we consider that my= 0 and p1= 0.
Lemma 14 – Assume the proposed Model (16). If p1= 0 and pi∈(0,1),∀i≥2, then:
−log(p(h|y, X, Λ)) ∝1
σ2ky−Xhk2
2+1
2hTΛh
+
N
X
i=1 {hi6= 0}pilog( λi
√2π)−log(pi)−log( λi
√2π).
Definition 15 (Lambert W-Function) – The Lambert W-Function, denoted by W(·), is
the inverse function of f:W7→ W eW. In particular, we consider Wto be the principal
branch of the Lambert function, defined over [−1/e, ∞).
Proposition 16 (A posteriori distribution of h)– Let C > 0, and assume for all i≥2that
pi=e−Cif λi=√2π, whereas pi=−W−e−Clog(λi/√2π)
λi/√2π1
log(λi/√2π)otherwise. Then,
pi∈(0,1) and there exist constants α, β > 0such that:
−log(p(h|y, X, Λ)) ∝ ky−Xhk2
2+αhTΛh+βkhk0.
This proposition tells us that: for a given Laplacian matrix, the maximum a posteriori
estimate of hwould corresponds to the minimum of Problem (3).
6. Related work on GSP-based graph learning methods
Here we detail the two state-of-the-art methods for graph learning in the GSP context that
are closer to our work and that will be used for our experimental comparison in Section 7.
GL-SigRep (Dong et al., 2016). This method supposes that the observed graph signals
are smooth with respect to the underlying graph, but do not consider the spectral sparsity
assumption. To learn the graph, they propose to solve the optimization problem:
min
L, ˜
YkY−˜
Yk2
F+αkL1/2˜
Yk2
F+βkLk2
Fs.t.
Lk,` =L`,k ≤0k6=` ,
L1=0,
tr(L) = N∈R+
∗.
(21)
Remark that since no constraints are imposed on the spectral representation of the signals,
the Laplacian matrix is directly learned. The optimization procedure to solve (21) con-
sists in an alternating minimization over Land ˜
Y. With respect to ˜
Ythe problem has a
closed-form solution whereas for L, the authors propose to use a Quadratic Program solver
involving 1
2N(N−1) parameters and 1
2N(N−1) + N+ 1 constraints.
ESA-GL (Sardellitti et al., 2019). This is a two-step algorithm where the signals are
supposed to admit a sparse representation with respect to the learned graph. The difference
to our paper is two-fold. First, ESA-GL does not include the smoothness assumption while
learning the Fourier basis X. This brings a different two-step optimization program. Second,
the complexity of the ESA-GL algorithm (at least O(N8)) is much higher than ours (O(N5)
16
Learning a Graph from Signals with Sparse Spectral Representation
for FGL-3SR - see Section 4.3), and hence is prohibitive for large graphs. The first step
consists in fitting an orthonormal basis such that the observed graph signals Yadmit a
sparse representation with respect to this basis. They consider the problem:
min
H,X kY−XHk2
Fs.t. (XTX=IN, x1=1
√N1N,
kHk2,0≤K∈N,(22)
which is solved using an alternating minimization. Once estimates for Hand Xhave been
computed, they solve a second optimization problem in order to learn the Laplacian L
associated to the learned basis b
X. This is done by minimizing:
min
L∈RN×N, CK∈RK×Ktr( b
HT
KCKb
HK) + µkLk2
Fs.t.
Lk,` =L`,k ≤0k6=` ,
L1N=0N,
Lb
XK=b
XKCK, CK0,
tr(L) = N∈R+
∗,
(23)
where CK∈RK×Kand b
XKcorresponds to the columns of b
Xassociated to the non-zero rows
of b
Hdenoted b
HK. Thus, the second step aims at estimating a Laplacian that enforces the
smoothness of the learned signal representation b
Xb
H. This semi-definite program requires
the computation of over 1
2N(N−1) + 1
2K(K−1) parameters that, as we show empirically
in the next section, can be difficult to compute for graphs with large number of nodes. For
more details on the optimization program and the additional matrix CK, the readers shall
refer to the aforementioned paper.
7. Experimental evaluation
The two proposed algorithms, IGL-3SR and FGL-3SR, are now evaluated and compared
with the two state-of-the-art methods presented earlier, GL-SigRep and ESA-GL. The
results of our empirical evaluation are organized in three subsections: Section 7.2 and 7.3
use synthetic data for first comparing the different methods and then study the influence
of the hyperparameters; Section 7.4 displays several examples on real-world data.
All experiments were conducted on a single personal computer: a personal laptop with
with 4-core 2.5GHz Intel CPUs and Linux/Ubuntu OS. For the Λ-step of both algorithms,
we use the Python’s CVXPY package (Diamond and Boyd, 2016). For the X-step of IGL-3SR,
we use the conjugate gradient descent solver combined with an adaptive line search, both
provided by Pymanopt (Townsend et al., 2016), a Python toolbox for optimization on mani-
folds. Note that this package only requires the gradients given in Proposition 10. The source
code of our implementations is available at https://github.com/pierreHmbt/GL-3SR.
7.1 Evaluation metrics
We provide visual and quantitative comparisons of the learned Laplacian b
Land its weight
matrix c
Wusing the performance measures: Recall,Precision, and F1-measure, which are
standard for this type of evaluation (Pasdeloup et al., 2017). The F1-measure evaluates the
quality of the estimated support – the non-zero entries – of the graph and is given by:
F1=2×precision ×recall
precision +recall .
17
Humbert, Le Bars et al.
As in Pasdeloup et al. (2017), the F1-measure is computed on a thresholded version of the
estimated weight matrix c
W. This threshold is equal to the average value of the off-diagonal
entries of c
W(same process as in (Sardellitti et al., 2019)).
In addition, we compute the correlation coefficient ρ(L, b
L) between the true Laplacian
entries Li,j and their estimates b
Li,j
ρ(L, b
L) = Pij (Lij −Lm)(b
Lij −b
Lm)
qPij(Lij −Lm)2qPij (b
Lij −b
Lm)2
,(24)
where Lmand b
Lmare the average values of the entries of the true and estimated Laplacian
matrices, respectively. This ρcoefficient evaluates the quality of the weights distribution
over the edges.
7.2 Experiments on synthetic data
We now evaluate and compare all algorithms on several types of synthetic data. Details
about graphs, associated graph signals, and evaluation protocol used for the experiments,
are detailed in the sequel.
Graphs and signals. We carried out experiments on graphs with 20,50, and 100 vertices,
following: i) a Random Geometric (RG) graph model with a 2-D uniform distribution for
the coordinates of the nodes and a truncated Gaussian kernel of width size 0.5 for the edges,
where weights smaller than 0.75 were set to 0; ii) an Erd˝os-R´enyi (ER) model with edge
probability 0.2.
Given a graph, the sampling process was made according to Model (18) that we presented
in Section 5. The mean value of each signal was set to 0, the variance of the noise was set to
0.5, and the sparsity was chosen to obtain observations with k-sparse spectral representation,
where kis equal to half the number of nodes (i.e 10,20,50).
For each type of graph, we ran 10 experiments with 1000 graph signals generated as
explained above. For all the methods, the hyperparameters αand βare set by maximizing
the F1-measure on the thresholded c
W, as explained in Section 7.1.
Choice of k·kS.In the following we make all experiments for IGL-3SR and FGL-3SR
with the `2,1-norm. This is motivated by an important fact brought by the closed-form
solutions given in Proposition 8. Indeed, for `2,1-norm, the sparsity of b
His only controlled
by β(Equation (9)). On the contrary, when using the `2,0-norm, the value of αalso
influences the sparsity (Equation (8)). This is an important behavior, as the tuning of β
and αbecomes independent – at least with respect to the H-step – and therefore, as we
will see in Section 7.3.1, easier to tune.
Quantitative results. Average evaluation metrics and their standard deviation are col-
lected in Table 1. The results show that the use of the sparsity constraint improves the
quality of the learned graphs. Indeed, the two proposed methods IGL-3SR and FGL-3SR,
as well as ESA-GL, have better overall performance in all the metrics than GL-SigRep
that only considers the smoothness aspect. This had to be expected as our methods match
perfectly to the sparse (bandlimited) condition.
18
Learning a Graph from Signals with Sparse Spectral Representation
RG graph model ER graph model
NMetrics IGL-3SR FGL-3SR ESA-GL GL-SigRep IGL-3SR FGL-3SR ESA-GL GL-SigRep
20
Precision 0.973 (±0.042) 0.952 (±0.042) 0.899 (±0.054) 0.929 (±0.068) 0.952 (±0.045) 0.819 (±0.080) 0.931 (±0.045) 0.704 (±0.125)
Recall 0.974 (±0.018) 0.985 (±0.023) 0.968 (±0.052) 0.967 (±0.028) 0.927 (±0.046) 0.824 (±0.105) 0.951 (±0.041) 0.899 (±0.075)
F1-measure 0.974 (±0.028) 0.968 (±0.027) 0.929 (±0.032) 0.947 (±0.040) 0.938 (±00.028) 0.816 (±0.068) 0.941 (±0.038) 0.779 (±0.071)
ρ(L, L)0.938 (±0.052) 0.903 (±0.029) 0.925 (±0.050) 0.786 (±0.037) 0.917 (±0.035) 0.730 (±0.063) 0.897 (±0.045) 0.199 (±0.074)
Time <1min <10s <5s <5s <1min <10s <5s <5s
50
Precision 0.901 (±0.022) 0.817 (±0.041) 0.845 (±0.088) 0.791 (±0.055) 0.820 (±0.027) 0.791 (±0.047) 0.854 (±0.038) 0.476 (±0.037)
Recall 0.902 (±0.018) 0.807 (±0.036) 0.910 (±0.040) 0.720 (±0.059) 0.812 (±0.042) 0.740 (±0.049) 0.830 (±0.051) 0.856 (±0.023)
F1-measure 0.901 (±0.014) 0.812 (±0.017) 0.868 (±0.036) 0.750 (±0.001) 0.815 (±0.021) 0.761 (±0.031) 0.841 (±0.021) 0.610 (±0.026)
ρ(L, L)0.863 (±0.020) 0.743 (±0.031) 0.832 (±0.033) 0.549 (±0.022) 0.783 (±0.026) 0.728 (±0.020) 0.816 (±0.058) 0.058 (±0.002)
Time <17mins <40s <60s <40s <17mins <40s <60s <40s
100
Precision 0.713 (±0.012) 0.711 (±0.029) 0.667 (±0.022) – 0.677 (±0.044) 0.640 (±0.033) 0.654 (±0.038) –
Recall 0.751 (±0.067) 0.584 (±0.011) 0.743 (±0.017) – 0.580 (±0.021) 0.543 (±0.027) 0.637 (±0.023) –
F1-measure 0.732 (±0.034) 0.641 (±0.010) 0.703 (±0.012) – 0.623 (±0.009) 0.586 (±0.016) 0.589 (±0.019) –
ρ(L, L)0.612 (±0.045) 0.483 (±0.015) 0.596 (±0.033) – 0.551 (±0.016) 0.512 (±0.0223) 0.644 (±0.023) –
Time <50mins <2mins <4mins – <50mins <2mins <4mins –
Table 1: Comparison of the four methods on five quality metrics (avg ±std) for graphs of N=
{20,50,100}nodes, and for fixed number of n= 1000 graph signals.
Comparing the results across the different types of synthetic graphs, our methods are ro-
bust while being more efficient on RG graphs. In general, IGL-3SR, and FGL-3SR present
similar performance to ESA-GL. However IGL-3SR seems preferable in the case of RG
graphs. For 100 nodes, the computational resources necessary for GL-SigRep was already
too demanding, therefore only the results for the rest three methods are reported. We can
see that, while IGL-3SR has better results than FGL-3SR, the time necessary to estimate
the graph is much longer. In addition, examples of learned graphs are displayed in Fig-
ure 3 with the ground-truth on the left and the learned weighted adjacency matrices (after
thresholding). The evolution of the F1-measure regarding the value of the threshold is also
displayed and shows that a large range of threshold could have been used to obtain similar
performance. All these results, combined with those of Table 1, indicate that in this sam-
pling process the proposed FGL-3SR method managed to infer accurate graphs despite the
relaxation.
Speed performance. Figure 4 displays the evolution of the empirical computation time
as the number of nodes increases. FGL-3SR appears to be much faster than the other
methods. Furthermore, we observe that our methods are scalable over a wider range of
graph sizes than the competitors. Indeed, even quite small graphs of 100 and 150 nodes,
respectively, were already too ‘large’ for the two competitors to be able to produce results,
and they even led to memory allocation errors.
7.3 Influence of the hyperparameters
We now study how hyperparameters of IGL-3SR and FGL-3SR influence their overall per-
formance, with respect to the F1-measure. This study is made on a RG graph with N= 20
nodes and 10-bandlimited signals Yin R20×1000.
19
Humbert, Le Bars et al.
IGL-3SR
IGL-3SR
IGL-3SR
IGL-3SR
(a) Graph learning on RG synthetic graphs.
IGL-3SR FGL-3SR ESA-GL GL-SigRep
Ground-truth IGL-3SR FGL-3SR ESA-GL GL-SigRep
IGL-3SR
IGL-3SR
IGL-3SR
IGL-3SR
Ground-truth
(b) Graph learning on ER synthetic graphs.
IGL-3SR FGL-3SR ESA-GL GL-SigRep
Ground-truth IGL-3SR FGL-3SR ESA-GL GL-SigRep
Figure 3: Graph learning results on random synthetic graphs of 20 nodes: (a) for a RG graph,
and (b) for an ER graph. Each of the two subfigures presents: (top row) the evolution of the F1-
measure with respect to different threshold values and the dashed line indicates the chosen threshold
value; (bottom row) shows as leftmost the ground truth adjacency matrix, followed by the respective
learned adjacency matrices (thresholded) by the compared methods.
7.3.1 Influence of αand β
We first highlight the influence of αand βon FGL-3SR. We run and collect the F1-measure
for 20 values of α(resp. β) in [10−5,100] (resp. [10−5,60]). The resulting heatmaps are
displayed in Figure 5. The most important observation is that the value of αdoes not seem
to impact the quality of the resulted graphs. Indeed, for a fixed value of β, the F1-measure
20
Learning a Graph from Signals with Sparse Spectral Representation
(a) Standard scale.
FGL-3SR
ESA-GL
GL-SigRep
(b) Semi-log scale.
IGL-3SR
FGL-3SR
ESA-GL
GL-SigRep
Figure 4: Average and standard deviation of the computation time over 10 trials for IGL-3SR,
FGL-3SR, ESA-GL, and GL-SigRep, as the number of nodes increases. GL-SigRep and ESA-GL
failed to produce a result for graphs with more than 100 and 150 nodes, respectively. (a) The total
computation times, and (b) the time needed for a single iteration of each algorithm. For IGL-3SR
and FLG-3SR, a single iteration means the computation of the 3 steps one time.
is stable when αvaries. However, it is interesting that the convergence curve of FGL-3SR
(Figure 6) is directly impacted by α: large values for αtend to produce oscillations on the
convergence curves. Thus, setting to a small value α > 0 is suggested. Contrary to α, tun-
ing the parameter βis critical since high βvalues cause a drastic decrease in F1-measure.
This sharp decrease appears when the chosen βimposes too much sparsity for the learned
b
H. One may note that the best βcorresponds to the value just before the sharp decrease,
and this is the value that should be chosen. Although the previous analysis has been done
on FGL-3SR, during our experimental studies, αand βinfluenced the F1-measure similarly
when using IGL-3SR.
7.3.2 Influence of t
We now highlight the influence of ton IGL-3SR. Figure 7 shows the learned graphs for
several values of t∈[10,104]. This experiment brings two main messages: first, when tis
too low, the learned graph is very close to the complete graph, whereas when tincreases the
learned graph becomes more structured and tends to be sparse. This result was expected
since a larger tbrings the barrier closer to the true constraint, i.e. we allow elements of
the resulting Laplacian matrix to be closer to 0. Second, it appears that αalso influences
the final results in a similar way to t. Again, this was expected as the minimization of the
objective function during the Λ-step of Problem (5) is equivalent to the minimization of
tr(HHTΛ) + 1
α t φ(U, Λ).
For a discussion on the initial value of t,t(0), and the step size µsuch that t(`+1) =µt(`),
both relative to the barrier method, we refer the reader to (Boyd and Vandenberghe, 2004).
However, recall that tis not a hyperparameter to tune in practice, and should be taken as
large as possible. The mere goal is to prevent numerical issues. Fortunately, a wide range
of values for t(0) and µachieves that goal (Boyd and Vandenberghe, 2004).
21
Humbert, Le Bars et al.
(a) Average of the F1-measure. (b) Standard deviation of the F1-measure.
(c) Average of the F1-measure (focus). (d) Standard deviation of the F1-measure (focus).
Figure 5: Evolution of the average (a)(c) and standard deviation (b)(d) of the F1-measure over
10 runs of FGL-3SR on RG graphs with 20 nodes. At the top figure row β∈[0,100], and at the
bottom row β∈[20,70].
(a) Low αvalue. (b) Medium αvalue. (c) High αvalue.
Figure 6: Convergence curves of the objective function as the number of iterations increases, using
FGL-3SR with (a) α= 10−5, (b) α= 10−1, (c) α= 1.
Tuning the hyperparameters. The hyperparameter αdoes not seems to have a sub-
stantial impact on the F1-measure. However, a low value of it may be preferred in FGL-3SR
for convergence purpose (Figure 6). The parameter talways needs to be maximal provided
that it does not cause numerical issues. Classical heuristics and methods, like the one
22
Learning a Graph from Signals with Sparse Spectral Representation
Ground-truth t= 10 t= 100 t= 1000 t= 10000
Ground-truth t= 10 t= 100 t= 1000 t= 10000
Figure 7: Learned graphs with increasing tvalues: (top row) α= 10−4, (bottom row) α= 10−3.
presented in Section 3.4, can be used to tune t(Boyd and Vandenberghe, 2004). Hence,
according to our experiments, it remains only βas a critical hyperparameter to tune for
both these methods. Based on Figure 5, one way to fix it is to find the largest βvalue
that leads to satisfying results in terms of signal reconstruction. Alternatively, if we have
an idea about the number of clusters kthat resides on the graph, we could select a βvalue
that produces a k-sparse spectral representation. Bearing in mind that other related works
require the tuning of two hyperparameters, our approach turns out to be of higher value for
practical application on real data where these parameters are unknown and must be tuned.
7.4 Temperature data
We used hourly temperature (C◦) measurements on 32 weather stations in Brittany, France,
during a period of 31 days (Chepuri et al., 2017). The dataset contains 24 ×31 = 744
multivariate observations, i.e. Y∈R32×744, that are assumed to correspond to an unknown
graph, which is our objective to infer. For our two algorithms, we set α= 10−4, and βis
chosen so that we obtain a 2-sparse spectral representation, which this last assumes that
there are two clusters of weather stations.
The graphs obtained with each of the method are displayed in Figure 8 (a-b). They are
in accordance with the one found in Chepuri et al. (2017) on the same dataset. Both the
proposed methods provide similar results, which shows that the relaxation used in FGL-3SR
has a moderate influence in practice in this real-world problem. Although ground-truth is
not available for this use-case, the quality of the learned graph can be assessed when using it
as input in standard tasks such as graph clustering or sampling. For instance, when applying
spectral clustering (Ng et al., 2001) with two clusters on the resulting Laplacian matrices,
it can be seen that both methods split the learned graph in two parts corresponding to the
north and the south of the region of Britanny (Figure 8 (c-d)), which is an expected natural
segmentation.
23
Humbert, Le Bars et al.
(a) Learned graph by IGL-3SR. (b) Learned graph by FGL-3SR.
(c) Graph clustering on IGL-3SR’s result. (d) Graph clustering on FGL-3SR’s result.
Figure 8: (Top row) Learned graph with (a) IGL-3SR and (b) FGL-3SR. The node color corre-
sponds to the average temperature in C◦during all the period of observation. (Bottom row) Graph
segmentation in two parts (red vs. green nodes) with spectral clustering using the Laplacian matrix
learned by (c) IGL-3SR and (d) FGL-3SR.
The learned graphs can be also employed in the graph sampling task. Indeed, due to
the constraints used in the optimization problem, the graph signals are bandlimited with
respect to learned graphs. For instance, in this example the graph signals are 2-bandlimited.
This property means that it is possible to select only 2 nodes and to reconstruct the graph
signal values of the 30 remaining nodes using linear interpolation. Figure 9 displays an
example of such reconstruction: thanks to the learned graph structure, the use of only 2
nodes allows to reconstruct sufficiently well the whole data matrix with a mean absolute
error of 0.614. Again, this is a very interesting result that indirectly shows the quality of
the learned graph.
7.5 Cancer genome data
In this second experiment, we consider the RNA-Seq Cancer Genome Atlas Research Net-
work dataset (Weinstein et al., 2013). The data set consists of 801 samples, each of them
characterized by a set of 20531 genetic features and being labeled by one out of 5 types of
cancer: breast carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear-cell
carcinoma (KIRC), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD).
24
Learning a Graph from Signals with Sparse Spectral Representation
(a) (b)
Figure 9: (a) The 2 nodes kept for the signal interpolation are shown in black. (b) The true signal
at the target node (in red) shown on the left and its reconstruction using only the 2 selected nodes
shown on the left (in black).
The goal of the conseidered task is to build a graph of the 801 individuals using the genetic
features and determine if the learned graph is able to group the samples according to their
tumor type using a graph-based clustering approach; for the proposed FGL-3SR, same as
previously, we use spectral clustering (Ng et al., 2001) on the learned graph.
As the number of nodes of the graph is too large, ESA-GL and Sig-Rep are not able
to run in reasonable time. Therefore, we compare FGL-3SR to two other state-of-the-art
methods, which are however not GSP-oriented but rather specialized to obtain a graph
that facilitates data clustering. The two competitors are namely the Constrained Laplacian
Rank (CLR) algorithm (Nie et al., 2016) that builds a special graph from the available
data, and the Structured Graph Learning (SGL) algorithm (Kumar et al., 2019) that take
as input the sample covariance matrix of the data. As quality measure we use the clustering
accuracy, which has also been used in the associated papers of the competitors from where
we obtain their reported results. The results for the three methods are respectively:
FGL-3SR: 0.9887, CLR: 0.9862, SGL: 0.9987.
The first interesting result is that FGL-3SR presents similar accuracy to CLR and SGL,
even though it is not a graph learning method specially designed for clustering like the
competitors. Secondly, while FGL-3SR comes second in terms of accuracy after SGL, two
important remarks need to be made about the SGL method: 1) it must fix the right number
of clusters of the learned graph a priori to obtain such result; 2) it has an additional
hyperparameter to tune compared to FGL-3SR. Therefore comparably, bearing in mind the
above results, the fact that SGL is fine-tailored for the undertaken clustering tasks and
that it has higher tuning complexity, and finally the limitations of ESA-GL and Sig-Rep
that prevent them from being applied in this scenario, FGL-3SR seems to be a promising
alternative for large-scale graph-based learning applications.
7.6 Results on the ADHD dataset
In this third experiment, we consider the Attention Deficit Hyperactivity Disorder (ADHD)
dataset (Bellec et al., 2017) composed of functional Magnetic Resonance Imaging (fMRI)
data. ADHD is a mental pathophysiology characterized by an excessive activity (Boyle
25
Humbert, Le Bars et al.
(a) Indicative Regions of Interest (ROIs) from Varoquaux et al. (2011).
(b) ADHD subject.
(c) Healthy subject.
Figure 10: (a) Indicative ROIs from the Multi-Subject Dictionary Learning atlas extracted in
Varoquaux et al. (2011) with sparse dictionary learning. Results: Graphs returned by FGL-3SR,
separately for (b) an ADHD patient and (c) a healthy subject, where darker edges indicate larger
weights of connection.
et al., 2011). We study the resting-state fMRI of 20 subjects with ADHD and 20 healthy
subjects available from Nilearn (Abraham et al., 2014). Each fMRI consists in a series of
images measuring the brain activity. These images are processed as follows. First we split
the brain into 39 Regions Of Interest (ROIs) with the Multi-Subject Dictionary Learning
atlas (Varoquaux et al., 2011) (see Figure 10a). Each ROI defines a node of our graph and
the signal value at a certain node is an aggregation of the fMRI values over the associated
ROI. For each subject, we therefore obtain a matrix in Rn×39, where nstands for the number
of images in the fMRI for the subject.
We then estimate the graph of each subject using the graph learning methods we dis-
cussed earlier in the article on the extracted signals. Examples of learned graphs with
26
Learning a Graph from Signals with Sparse Spectral Representation
FGL-3SR for an ADHD subject and a healthy subject are displayed in Figure 10. Visu-
ally, they reveal strong symmetric links between the right and left hemisphere of the brain.
This phenomenon is common in resting-state fMRI where one hemisphere tends to correlate
highly with the homologous anatomical location in the opposite hemisphere (Damoiseaux
et al., 2006; Smith et al., 2009). Pointing out differences, though, the graph from the ADHD
subject seems less structured and contains several spurious links (diagonal and north-south
connections).
Aiming to better highlight the potential value of quality learned graphs for such studies,
we proceed and use the Laplacian matrices of the brain graphs to classify the subjects, as
proposed in several resting-state fMRI studies (Abraham et al., 2017; Dadi et al., 2019).
First, we subtract the average graph for all subjects, which in fact removes the symmetrical
connections common to all subjects), and then we use a 3-Nearest Neighbors classification
algorithm. We use the correlation coefficient of Equation (24) as distance metric between
Laplacian matrices, and a leave-one-out cross-validation strategy. The classification ac-
curacy of the described approach reaches 65%. This level shall be compared with the
performance obtained using simple correlation graphs (Abraham et al., 2017) that, on these
40 subjects, leads to an accuracy of 52.5%. It appears that in this context the use of a more
sophisticated graph learning process allows a subject characterization that goes beyond
considering basic statistical correlation effects. Interestingly, this score is also comparable
with state-of-the-art results reported in Sen et al. (2018) for the same task, but on a larger
database (67.3% of accuracy), using more sophisticated and specially-tailored processing
steps, as well as carefully chosen classifiers.
At this point, we should make absolutely clear that with this case study we only illustrate
the general usefulness and potential of the graphs learned using our approach. Besides the
lack of deeper clinical evaluation of the findings, there are two more technical reasons that
would limit the value of any further comparison or conclusion. First, we do not know
whether our assumptions (sparsity and bandlimitedness) are the best or even perfectly
valid for the specific problem. Second, the small size of the data set limits the statistical
importance of any comparative results on the basis of common evaluation metrics, such
as the accuracy. Nevertheless, we believe that the reported result motivates future work
on ADHD or other related disorders on the direction of exploring further the potential of
sophisticated graph learning techniques.
8. Conclusions
This article presented a data-driven graph learning approach by employing a combination of
two assumptions. The first is standard in the related literature and concerns the smoothness
of graph signals with respect to the underlying graph structure. The second is the spectral
sparsity assumption, a consequence of the presence of clusters in real-world graphs. We
proposed two algorithms to solve the corresponding optimization problem. The first one,
IGL-3SR, effectively minimizes the objective function and has the advantage to decrease
at each iteration. To address its low speed of convergence, we propose FGL-3SR that is
a fast and scalable alternative. The findings of our empirical evaluation on synthetic data
showed that the proposed approaches are as good or better performing than the reference
state-of-the-art algorithms in term of reconstructing the unknown underlying graph and
27
Humbert, Le Bars et al.
of computational cost (running time). Experiments on real-world benchmark use-cases
suggest that our algorithms learn graphs that are useful and promising for any graph-based
machine learning methodology, such as graph clustering and subsampling, etc. Finally, by
including the two assumptions in a probabilistic model, we link our optimization problem to
a maximum a posteriori estimation and pave the way for further statistical understanding.
9. Technical proofs
This section provides the technical proofs of the different propositions exposed in the paper.
Recall that lower case variables refer vectors/scalars while upper case variable denote ma-
trices. The table below provides the main notations used in the technical discussion that
that follows.
xT, M TTranspose of vector x, matrix M.
tr(M) Trace of matrix M.
diag(x) Diagonal matrix containing the vector x.
Mk,l (k, l)-th element of the matrix M.
Mk,:k-th row of M.
M:,l l-th column of M.
Mk:,l:Submatrix containing the elements of Mfrom the k-th row to the
last row, and from the l-th column to the last column.
M0Mis a positive semi-definite matrix.
M†The Moore-Penrose pseudoinverse of M.
ekVector containing zeros except a 1 at position k.
InIdentity matrix of size n.
0nVector of size ncontaining only zeros.
1nVector of size ncontaining only ones.
A(·) The indicator function over the set A.
kxk0The number of non-zero elements of a vector x.
k·kFThe Frobenius norm.
k·k2,0The `2,0-norm, with kMk2,0=Pi=1 {kHi,:k26=0}.
k·k2,1The `2,1-norm, with kMk2,1=Pi=1 kHi,:k2.
∇fGradient of the function f.
h·,·i Inner product function.
Orth(N) The set of all orthogonal matrices of size N×N.
O(·) Order of magnitude (e.g. of computational complexity).
τNumber of iterations needed for an optimization procedure.
Table 2: Table of notations used throughout the article.
Lemma 5 – Given X, X0∈RN×Ntwo orthogonal matrices with first column equals to
1
√N1N(constraint (3a)), we have the following equality:
X=X010T
N−1
0N−1[XT
0X]2:,2:,
with [XT
0X]2:,2: denoting the submatrix of XT
0Xcontaining everything but the first row and
column of itself. Furthermore, remark that [XT
0X]2:,2: is in Orth(N−1).
28
Learning a Graph from Signals with Sparse Spectral Representation
Proof Let consider X, X0∈RN×Ntwo orthogonal matrix with first column equals to
1
√N1N. We have the following equalities:
X010T
N−1
0N−1[XT
0X]2:,2:=
.
.
..
.
.
X0(:,1) X0(:,2:)[XT
0X]2:,2:
.
.
..
.
.
=
.
.
..
.
.
1
√N1NX:,2:
.
.
..
.
.
=X .
Furthermore, thanks to the orthogonality of Xand X0, we have
[XT
0X]2:,2: h[XT
0X]2:,2:iT
=XT
0,(2:,:)X:,2: hXT
0,(2:,:)X:,2:iT
=XT
0,(2:,:)X:,2: XT
:,2:[XT
0,(2:,:)]T=IN−1.
By symmetry we conclude that [XT
0X]2:,2: ∈Orth(N−1).
Proposition 6 – Given X0∈RN×Nan orthogonal matrix with first column equals to
1
√N1N, an equivalent formulation of optimization problem (3) is given by:
min
H,U,Λ
Y−X010T
N−1
0N−1UH
2
F
+αkΛ1/2Hk2
F+βkHkS,f(H, U, Λ) ,
s.t.
UTU=IN−1,(a’)
X010T
N−1
0N−1UΛ10T
N−1
0N−1UTXT
0!k,` ≤0k6=` , (b’)
Λ = diag(0, λ2, ..., λN)0,(c)
tr(Λ) = N∈R+
∗.(d)
Proof From the previous lemma, we know that Xcan be decompose into two orthogonal
matrices X0and U= [XT
0X]2:,2:. Hence, we can optimize with respect to Uinstead of X
and the second part of the constraint (3a) is automatically satisfied. To make the equiva-
lence, we just replace Xfrom the main optimization problem to X010T
N−1
0N−1Uwhere
Uis now imposed to be orthogonal.
Proposition 8 (Closed-form solution for the `2,0and `2,1-norms) – The solutions of prob-
lem (7) when k·kSis set to k·k2,0or k ·k2,1are given in the following.
•Using the `2,0-norm, the optimal solution of (7) is given by the matrix b
H∈RN×n
where for 1≤i≤N,
b
Hi,:=0if k(XTY)i,:k2
2/(1 + αλi)≤β ,
(XTY)i,:/(1 + αλi)else .
29
Humbert, Le Bars et al.
•Using the `2,1-norm, the optimal solution of (7) is given by the matrix b
H∈RN×n
where for 1≤i≤N,
b
Hi,:=1
1 + αλi1−β
2
1
k(XTY)i,:k2+(XTY)i,:,
where (t)+,max{0, t}is the positive part function.
Proof In the following, we suppose that Y6= 0 since in this trivial case, the solution is
simply given by b
H= 0.
Closed-form solution for the `2,0. Recall that kHk2,0=Pi=1 {kHi,:k26=0}, the objective
function can be written as:
f(X, Λ, H ) = kXTY−Hk2
F+αkΛ1/2Hk2
F+βkHk2,0
=kYk2
F+
N
X
i=1 n
X
j=1 H2
i,j −2(XTY)i,jHi,j +αλiH2
i,j+β{kHi,:k26=0}!
=kYk2
F+
N
X
i=1 kHi,:k2
2−2h(XTY)i,:, Hi,:i+αλikHi,:k2
2+β{kHi,:k26=0}!
=kYk2
F+
N
X
i=1 (1 + αλi)kHi,:k2
2−2h(XTY)i,:, Hi,:i+β{kHi,:k26=0}!
=kYk2
F+
N
X
i=1
˜
fi(X, Λ, Hi,:).
Our objective function is written as a sum of independent objective functions, each associ-
ated with a different Hi,:. Hence, we can optimize the problem for each i. Our problem for
a given iis:
min
Hi,:∈n(1 + αλi)kHi,:k2
2−2h(XTY)i,:, Hi,:i+β{kHi,:k26=0}.
When we restrict the minimization to kHi,:k2= 0, the unique solution is b
Hi,:=0nand
˜
fi(X, Λ,b
Hi,:) = 0.
When kHi,:k26= 0, the objective function is convex and differentiable, thus it suffice to take
the following derivative equal to 0:
∂
∂Hi,:
˜
fi(Hi,:) = 2(1 + αλi)Hi,:−2(XTY)i,:= 0 ,
b
Hi,:= (XTY)i,:/(1 + αλi).
With this solution, the objective function ˜
fiis equal to:
˜
f(X, Λ,b
Hi,:) = (1 + αλi)k(XTY)i,:/(1 + αλi)k2
2−2h(XTY)i,:,(XTY)i,:/(1 + αλi)i+β
=1
1 + αλik(XTY)i,:k2
2−2
1 + αλik(XTY)i,:k2
2+β
=β−1
1 + αλik(XTY)i,:k2
2.
30
Learning a Graph from Signals with Sparse Spectral Representation
Hence, whenever 1
1 + αλik(XTY)i,:k2
2≤β, the objective function is positive, making b
Hi,:=
0 a better choice for the minimization and conversely. In conclusion, for all 1 ≤i≤N, the
solution is:
b
Hi,:=0 if k(XTY)i,:k2
2/(1 + αλi)≤β ,
(XTY)i,:/(1 + αλi) else .
Closed-form solution for the `2,1. Similarly to the `2,0case, the objective function can be
decomposed by a sum of independent objectives functions.
f(X, Λ, H ) = kXTY−Hk2
F+αkΛ1/2Hk2
F+βkHk2,1
=kYk2
F+
N
X
i=1 n
X
j=1 H2
i,j −2(XTY)i,jHi,j +αλiH2
i,j+βsX
j=1
H2
i,j!
=kYk2
F+
N
X
i=1 kHi,:k2
2−2h(XTY)i,:, Hi,:i+αλikHi,:k2
2+βkHi,:k2!
=kYk2
F+
N
X
i=1 (1 + αλi)kHi,:k2
2−2h(XTY)i,:, Hi,:i+βkHi,:k2!
=kYk2
F+
N
X
i=1
˜
fi(X, Λ, Hi,:).
Again, we can optimize the problem for each row iof Hindependently. Our problem for a
given iis:
min
Hi,:∈n(1 + αλi)kHi,:k2
2−2h(XTY)i,:, Hi,:i+βkHi,:k2.(25)
Although non-differentiable at Hi,:=0n, this function is convex and we need to find Hi,:
such that the vector 0nbelongs to the subdifferential of ˜
fidenoted by ∂˜
fi(Hi,:) and is equal
to:
∂˜
fi(Hi,:) =
B2−2(XTY)i,:, βif Hi,:=0n,
21 + αλi+β
2
1
kHi,:k2Hi,:−2(XTY)i,:otherwise .
Where B2stand for the `2-norm bowl.
Remark that when k(XTY)i,:k2≤β
2,0n∈ B2−2(XTY)i,:, βand thus in this case
b
Hi,:=0n.
On the contrary, when k(XTY)i,:k2>β
2, we must find Hi,:such that:
1 + αλi+β
2
1
kHi,:k2Hi,:= (XTY)i,:.
31
Humbert, Le Bars et al.
By tacking the norm of the previous equation, we obtain
1 + αλi+β
2
1
kHi,:k2kHi,:k2=k(XTY)i,:k2
⇐⇒(1 + αλi)kHi,:k2+β
2=k(XTY)i,:k2
⇐⇒kHi,:k2=k(XTY)i,:k2−β
2/(1 + αλi)>0.
We can now replace kHi,:k2in the initial equation and get Hi,:.
1 + αλi+β(1 + αλi)
2k(XTY)i,:k2−βHi,:=(1 + αλi)k(XTY)i,:k2
k(XTY)i,:k2−β/2Hi,:= (XTY)i,:
⇐⇒Hi,:=k(XTY)i,:k2−β/2
(1 + αλi)k(XTY)i,:k2
(XTY)i,:=1
1 + αλi1−β
2
1
k(XTY)i,:k2(XTY)i,:,
which concludes the proof.
Proposition 10 (Euclidean gradient with respect to U)– The Euclidean gradient of f
and φwith respect to Uare
∇Uf(H, U, Λ) = −2(H Y TX0)2:,2:T+ 2U(HHT)2:,2: ,
∇Uφ(U, Λ) = −
N−1
X
k=1
N
X
`>k Bk,` +BT
k,`UΛ2:,2:
h(U, Λ)k,`
.
with ∀1≤k, ` ≤N, Bk,` =XT
0ekeT
`X02:,2:, and h(·)from Definition 7.
Proof We begin by computing the gradient of the main objective, with respect to U.
Recall the objective function with respect to U:
f(H, U, Λ) = −2tr(YTX010T
N−1
0N−1UH) + tr(HT10T
N−1
0N−1UTUH).
The corresponding gradient is the following.
∇Uf(H, U, Λ) = −2∇Utr(YTX010T
N−1
0N−1UH) + ∇Utr(HT10T
N−1
0N−1UTUH)
=−2∇Utr HY TX010T
N−1
0N−1U!+∇Utr HHT10T
N−1
0N−1UTU!
=−2∇U (HY TX0)1,1·1 + tr(HY TX0)2:,2:U!
+∇U (HHT)1,1·1 + tr(HHT)2:,2:UTU!
=−2(HY TX0)2:,2:T+ 2U(HHT)2:,2: .
32
Learning a Graph from Signals with Sparse Spectral Representation
We now derive the gradient of the barrier function φ(U, Λ) with respect to U:
∇Uφ(U, Λ) = −
N−1
X
k=1
N
X
`>k ∇Ulog −h(U, Λ)k,`
=−
N−1
X
k=1
N
X
`>k
1
h(U, Λ)k,` ∇Uh(U, Λ)k,` .
We can write the hfunction as:
h(U, Λ)k,` =ekeT
`, h(U, Λ)=DXT
0ekeT
`X0,10T
N−1
0N−1UΛ10T
N−1
0N−1UTE
=DXT
0ekeT
`X0,λ10T
N−1
0N−1UΛ2:,2:UTE= trXT
0e`eT
kX000T
N−1
0N−1UΛ2:,2:UT
=XT
0e`eT
kX01,1·0 + trXT
0e`eT
kX02:,2:UΛ2:,2: UT
= trBT
k,`UΛ2:,2: UT.
In conclusion we have ∇Uh(U, Λ)k,` =Bk,` +BT
k,`UΛ2:,2: , which finishes the proof.
Proposition 11 (Feasible eigenvalues) – Given any X∈RN×Nbeing an orthogonal
matrix with first column equals to 1/√N(constraint (3a)), there always exist a matrix
Λ∈RN×Nsuch that the following constraints are satisfied:
(XΛXT)i,j ≤0i6=j , (3b)
Λ = diag(0, λ2, ..., λN)0,(3c)
tr(Λ) = c∈R+
∗.(3d)
Proof Let us consider a positive real value c > 0. Taking Λ = diag(0, c, ..., c)/(N−1) leads
to tr(Λ) = cand ∀i6=j, (XΛXT)i,j =−c/N < 0. However, this solution with constant
eigenvalues actually corresponds to the complete graph. For our purpose, it is the worst
case scenario as it contains no structural information between the nodes.
Proposition 12 (Closed-form solution of problem (13)) – Consider the optimization
problem (13). Let X0be any matrix that belongs to the constraints set (a), and M=
(XT
0Y HT)2:,2: the submatrix containing everything but the input’s first row and first column.
Finally, let P D QTbe the SVD of M. Then, the problem admits the following closed form
solution
b
X=X010T
N−1
0N−1P QT.
33
Humbert, Le Bars et al.
Proof One can observe that the relaxed optimization problem is equivalent to finding:
b
G= argmin
G
Y−X010T
N−1
0N−1G
| {z }
,˜
G
H
2
F
,(26)
s.t. GTG=IN−1. This is obtained by replacing Xwith X0˜
G.
Solving the above Equation (26) is equivalent to finding:
b
G= arg max
G
tr HY TX0˜
G= arg max
G
tr MTG,
s.t. GTb
G=IN−1. Then, as proved in Zou et al. (2006), we finally have G∗=P QT, which
completes the proof.
Lemma 14 – Assume the proposed Model (16). If p1= 0 and pi∈(0,1),∀i≥2, then,
−log(p(h|y, X, Λ)) ∝1
σ2ky−Xhk2
2+1
2hTΛh
+
N
X
i=1 {hi6=0}pilog( λi
√2π)−log(pi)−log( λi
√2π).
Proof Based on the Factor Analysis model and the independence of hi’s,
log(p(h|y, X, Λ)) ∝log (p(y|h, X, Λ)) + log (p(h|X, Λ))
∝ − 1
2σ2ky−Xhk2
2+
N
X
i=1
log (p(hi|λi)) .(27)
Let us now focus on log (p(hi|λi)), for which we have:
log (p(hi|λi)) = log
X
γi={0,1}
p(hi, γi|λi)
= log
X
γi={0,1}
p(hi, γi|λi)p(γi|hi, λi)
p(γi|hi, λi)
≥
(=) X
γi={0,1}
p(γi|hi, λi) log p(hi, γi|λi)
p(γi|hi, λi).
The last equality is obtain using the concavity of the logarithm and Jensen inequality. For
this particular case, it correspond to an equality. Then we have:
log (p(hi|λi)) = X
γi={0,1}
p(γi|hi, λi) log (p(hi, γi|λi)) (?)
−X
γi={0,1}
p(γi|hi, λi) log (p(γi|hi, λi)) .(??)
34
Learning a Graph from Signals with Sparse Spectral Representation
Before computing the previous two sums, we need to observe that:
p(γi= 1|hi) = 1 if hi6= 0 ,
piif hi= 0 .
We can now compute (?) and (??) as follows:
(?) = X
γi={0,1}
p(γi|hi, λi) [log (p(hi|γi, λi)) + log (p(γi|λi))]
={hi6=0}+pi{hi=0}log λi
√2π−1
2λih2
i+ log (pi)
+(1 −pi){hi=0}log {hi=0}+ log (1 −pi)
(??) = [pilog(pi) + (1 −pi) log(1 −pi)] {hi=0}.
Finally we can compute log (p(hi|λi)):
log (p(hi|λi)) = (?)−(??)
={hi6=0}log λi
√2π−1
2λih2
i+ log (pi)+pilog λi
√2π{hi=0}
={hi6=0}log λi
√2π+ log (pi)−pilog λi
√2π+pilog λi
√2π+−1
2λih2
i
∝{hi6=0}log λi
√2π+ log (pi)−pilog λi
√2π+−1
2λih2
i.
Note that with our parametrization, the particular case i= 1 leads to log (p(h1|λ1)) = 0.
Now plugging our result in equation (27) and multiplying on both side by −1, we get our
final result.
Proposition 16 (A posteriori distribution of h)– Let C > 0, and assume for all i≥2
that pi=e−Cif λi=√2πand pi=−W−e−Clog(λi/√2π)
λi/√2π/log(λi/√2π)if not. Then,
pi∈(0,1) and there exist constants α, β > 0such that:
−log(p(h|y, X, Λ)) ∝ ky−Xhk2
2+αhTΛh+βkhk0.
Proof To show that the pi’s are well-defined and belongs to (0,1), it suffices to apply
Lemma 17 with x=λi/√2π.
We now proof the main result of the proposition. If λi=√2π, then pi=e−C<1 and
pilog( λi
√2π)−log(pi)−log( λi
√2π) = −log(pi) = C .
35
Humbert, Le Bars et al.
If λi6=√2π, then −pilog(λi/√2π) = W−e−Clog(λi/√2π)
λi/√2π. Since Wcorresponds to the
inverse function of f(W) = W eW, we have:
−pilog(λi/√2π)e−pilog(λi/√2π)=−e−Clog(λi/√2π)
λi/√2π
⇐⇒−pilog(λi/√2π)e−pilog(λi/√2π)=−e−Clog(λi/√2π)
λi/√2π
⇐⇒log pilog(λi/√2π)e−pilog(λi/√2π)= log e−Clog(λi/√2π)
λi/√2π!
⇐⇒log(pi) + log log(λi/√2π)−pilog(λi/√2π)
=−C+ log log(λi/√2π)−log(λi/√2π).
Same as the case where λi=√2π, the final equality gives us:
pilog( λi
√2π)−log(pi)−log( λi
√2π) = C . (28)
Plugging the equation (28) into the final result of proposition 1, we obtain:
−log(p(h|y, X, Λ)) ∝1
2σ2ky−Xhk2
2+1
2hTΛh+Ckhk0
∝ ky−Xhk2
2+αhTΛh+βkhk0,
taking α=σ2and β= 2Cσ2. This concludes the proof.
Lemma 17 Let C > 0. For any x > 0,
0≤ −W−e−Clog(x)
x/log(x)≤1.(29)
Proof First, we show that this function is decreasing for x > 0. The derivative of the
function is given by
∂
∂x −W−e−Clog(x)
x/log(x) =
W−e−Clog(x)
xW−e−Clog(x)
x+ log(x)
xlog2(x)W−e−Clog(x)
x+ 1.(30)
For x > 0 and C > 0,
−1/e < −e−(C+1) = min
x>0−e−Clog(x)
x≤ −e−Clog(x)
x.(31)
As W(·) is strictly increasing for x > −1/e, we have W−e−Clog(x)
x> W (−1/e) = −1.
Hence, the bottom part of the previous equation is always positive.
36
Learning a Graph from Signals with Sparse Spectral Representation
For 0 < x ≤1, W−e−Clog(x)
xis positive. Furthermore,
−e−Clog(x)
x<−log(x)
x⇐⇒ W−e−Clog(x)
x< W −log(x)
x=−log(x) (32)
⇐⇒ W−e−Clog(x)
x+ log(x)<0.(33)
Hence, when 0 < x ≤1, the upper part of the previous equation is negative.
For 1 < x ≤e,W−e−Clog(x)
xis negative. Furthermore,
−1
e≤ −log(x)
x<−e−Clog(x)
x⇐⇒ W−log(x)
x=−log(x)< W −e−Clog(x)
x(34)
⇐⇒ W−e−Clog(x)
x+ log(x)>0.(35)
Hence, when 1 < x ≤e, the upper part of the previous equation is negative again.
For x>e,W−e−Clog(x)
xis negative. Furthermore, W−e−Clog(x)
x>−1 and log(x)>1.
Hence, the addition is positive and the upper part of the previous equation is negative again.
We just have shown that the derivative is negative for x > 0. Hence, the initial function
is decreasing on this interval. We now go back to the initial inequality (29). The left part
of the inequality is straightforward as for xlarge enough, the function corresponds to the
product of two positive functions. The function being decreasing, the lower bound follows.
For the upper bound, let us remind that for y > e, we have the inequality W(y)<log(y)
(Hoorfar and Hassani, 2007). Let f(x) = −e−Clog(x)
x, for xsmall enough we have:
W(f(x)) <log(f(x)) ⇔ − W(f(x)) >−log(f(x))
⇔ − W(f(x)) /log(x)<−log(f(x))/log(x).
Tacking the limit when x→0+conclude the proof,
lim
x→0+−log(f(x))/log(x) = lim
x→0+−log(−e−Clog(x)
x)/log(x)
= lim
x→0+−log(e−C) + log(−log(x)) −log(x)/log(x)
= lim
x→0+
C
log(x)+log(log(1/x))
log(1/x)+ 1 = 1 .
37
Humbert, Le Bars et al.
References
A. Abraham, F. Pedregosa, M. Eickenberg, P. Gervais, A. Mueller, J. Kossaifi, A. Gramfort,
B. Thirion, and G. Varoquaux. Machine learning for neuroimaging with scikit-learn.
Frontiers in Neuroinformatics, 8:14, 2014.
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras, B. Thirion, and
G. Varoquaux. Deriving reproducible biomarkers from multi-site resting-state data: an
autism-based example. NeuroImage, 147:736–745, 2017.
P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds.
Princeton University Press, 2009.
A. Anis, A. Gadde, and A. Ortega. Towards a sampling theorem for signals on arbitrary
graphs. In Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing, pages 3864–3868, 2014.
A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in
Neural Information Processing Systems, pages 41–48, 2007.
R. Arora. On learning rotations. In Advances in Neural Information Processing Systems,
pages 55–63, 2009.
O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparse maxi-
mum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine
Learning Research, 9:485–516, 2008.
P. Bellec, C. Chu, F. Chouinard-Decorte, Y. Benhajali, D. S. Margulies, and R. C. Craddock.
The neuro bureau ADHD-200 preprocessed repository. Neuroimage, 144:275–286, 2017.
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
S. Boyd and L. Vandenberghe. Introduction to applied linear algebra: vectors, matrices,
and least squares. Cambridge University Press, 2018.
C. A. Boyle, S. Boulet, L. A. Schieve, R. A. Cohen, S. J. Blumberg, M. Yeargin-Allsopp,
S. Visser, and M. D. Kogan. Trends in the prevalence of developmental disabilities in US
children, 1997–2008. Pediatrics, 127(6):1034–1042, 2011.
S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R
in Machine Learning, 8(3-4):231–357, 2015.
S. Chen, A. Sandryhaila, and J. Kovaˇcevi´c. Sampling theory for graph signals. In Proceedings
of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages
3392–3396, 2015.
S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero. Learning sparse graphs under smoothness
prior. In Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing, pages 6508–6512, 2017.
38
Learning a Graph from Signals with Sparse Spectral Representation
F. R. K. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathe-
matical Society, 1997.
A. K. Cline and I. S. Dhillon. Computation of the singular value decomposition. 2006.
K. Dadi, M. Rahim, A. Abraham, D. Chyzhyk, M. Milham, B. Thirion, G. Varoquaux,
Alzheimer’s Disease Neuroimaging Initiative, et al. Benchmarking functional connectome-
based predictive models for resting-state fMRI. Neuroimage, 192:115–134, 2019.
S. I. Daitch, J. A. Kelner, and D. A. Spielman. Fitting a graph to vector data. In Proceedings
of the International Conference on Machine Learning, pages 201–208, 2009.
J. S. Damoiseaux, S. Rombouts, F. Barkhof, P. Scheltens, C. J. Stam, S. M. Smith, and
C. F. Beckmann. Consistent resting-state networks across healthy subjects. Proceedings
of the national academy of sciences, 103(37):13848–13853, 2006.
S. Diamond and S. Boyd. Cvxpy: A python-embedded modeling language for convex
optimization. Journal of Machine Learning Research, 17:1–5, 2016.
P. M. Djuric and C. Richard, editors. Cooperative and Graph Signal Processing – Principles
and Applications. Elsevier, 2018.
X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Learning Laplacian matrix in
smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):
6160–6173, 2016.
X. Dong, D. Thanou, M. Rabbat, and P. Frossard. Learning graphs from data: A signal
representation perspective. preprint arXiv:1806.00848, 2018.
N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneous influence.
In Advances in Neural Information Processing Systems, pages 2780–2788, 2012.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l
1-ball for learning in high dimensions. In Proceedings of the International Conference on
Machine Learning, pages 272–279, 2008.
A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality
constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
H. E. Egilmez, E. Pavez, and A. Ortega. Graph learning from data under structural and
Laplacian constraints. preprint arXiv:1611.05181, 2016.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the
graphical lasso. Biostatistics, 9(3):432–441, 2008.
M. Gomez-Rodriguez, L. Song, H. Daneshmand, and B. Sch¨olkopf. Estimating diffusion
networks: Recovery conditions, sample complexity & soft-thresholding algorithm. Journal
of Machine Learning Research, 17:3092–3120, 2016.
39
Humbert, Le Bars et al.
M. Hecker, S. Lambeck, S. Toepfer, E. Van Someren, and R. Guthke. Gene regulatory
network inference: data integration in dynamic models — A review. Biosystems, 96(1):
86–103, 2009.
A. Hoorfar and M. Hassani. Approximation of the lambert w function and hyperpower
function. Research report collection, 10(2), 2007.
A. Hyv¨arinen and E. Oja. Independent component analysis: algorithms and applications.
Neural networks, 13(4-5):411–430, 2000.
V. Kalofolias. How to learn a graph from smooth signals. In Proceedings of the Conference
on Artificial Intelligence and Statistics, pages 920–929, 2016.
D. Koller, N. Friedman, and F. Bach. Probabilistic graphical models: principles and tech-
niques. MIT press, 2009.
Sandeep Kumar, Jiaxi Ying, Jos’e Vin’icius de Cardoso, Daniel P Palomar, et al. Structured
graph learning via laplacian spectral constraints. In Advances in Neural Information
Processing Systems, 2019.
B. Le Bars, P. Humbert, L. Oudre, and A. Kalogeratos. Learning Laplacian matrix from
bandlimited graph signals. In IEEE International Conference on Acoustics, Speech and
Signal Processing, pages 2937–2941, 2019.
P.-E. Maing´e. Strong convergence of projected subgradient methods for nonsmooth and
nonstrictly convex minimization. Set-Valued Analysis, 16(7-8):899–912, 2008.
A. G. Marques, S. Segarra, G. Leus, and A. Ribeiro. Sampling of graph signals with
successive local aggregations. IEEE Transactions on Signal Processing, 64(7):1832–1843,
2016.
G. Meyer. Geometric optimization algorithms for linear regression on fixed-rank matrices.
PhD thesis, 2011.
S.K. Narang, A. Gadde, and A. Ortega. Signal processing techniques for interpolation in
graph structured data. In Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing, pages 5445–5449, 2013.
A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In
Advances in Neural Information Processing Systems, pages 849–856, 2001.
Feiping Nie, Xiaoqian Wang, Michael I Jordan, and Heng Huang. The constrained laplacian
rank algorithm for graph-based clustering. In AAAI Conference on Artificial Intelligence,
2016.
B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat. Characterization
and inference of graph diffusion processes from observations of stationary signals. IEEE
Transactions on Signal and Information Processing over Networks, 2017.
S. Ravishankar and Y. Bresler. Learning sparsifying transforms. IEEE Transactions on
Signal Processing, 61(5):1072–1086, 2012.
40
Learning a Graph from Signals with Sparse Spectral Representation
M. G. Rodriguez, D. Balduzzi, and B. Sch¨olkopf. Uncovering the temporal dynamics of
diffusion networks. In Proceedings of the International Conference on Machine Learning,
pages 561–568, 2011.
S. Sardellitti, S. Barbarossa, and P. Di Lorenzo. Graph topology inference based on spar-
sifying transform learning. IEEE Transactions on Signal Processing, 67(7):1712–1727,
2019.
B. Sen, N. C Borle, R. Greiner, and M. R. G. Brown. A general prediction model for the
detection of ADHD and autism using structural and functional MRI. PloS one, 13(4):
e0194856, 2018.
U. Shalit and G. Chechik. Coordinate-descent for learning orthogonal matrices through
givens rotations. In International Conference on Machine Learning, pages 548–556, 2014.
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging
field of signal processing on graphs: Extending high-dimensional data analysis to networks
and other irregular domains. Signal Processing Magazine, 30(3):83–98, 2013.
S. M. Smith, P. T. Fox, K. L. Miller, D. C. Glahn, P. M. Fox, C. E. Mackay, N. Filippini,
K. E. Watkins, R. Toro, A. R. Laird, et al. Correspondence of the brain’s functional
architecture during activation and rest. Proceedings of the National Academy of Sciences,
106(31):13040–13045, 2009.
D. A. Tarzanagh and G. Michailidis. Estimation of graphical models through structured
norm minimization. Journal of Machine Learning Research, 18, 2018.
D. Thanou, X. Dong, D. Kressner, and P. Frossard. Learning heat diffusion graphs. IEEE
Transactions on Signal and Information Processing over Networks, 3(3):484–499, 2017.
J. Townsend, N. Koep, and S. Weichwald. Pymanopt: A python toolbox for optimization
on manifolds using automatic differentiation. Journal of Machine Learning Research, 17:
1–5, 2016.
D. Valsesia, G. Fracastoro, and E. Magli. Sampling of graph signals via randomized local
aggregations. preprint arXiv:1804.06182, 2018.
L. Vandenberghe. The CVXOPT linear and quadratic cone program solvers. 2010.
G. Varoquaux, A. Gramfort, F. Pedregosa, V. Michel, and B. Thirion. Multi-subject dic-
tionary learning to segment an atlas of brain spontaneous activity. In Proceedings of the
Biennial International Conference on Information Processing in Medical Imaging, pages
562–573. Springer, 2011.
J. Wang and M. Kolar. Inference for high-dimensional exponential family graphical models.
In Artificial Intelligence and Statistics, pages 1042–1050, 2016.
John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozen-
berger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome
Atlas Research Network, et al. The cancer genome atlas pan-cancer analysis project.
Nature genetics, 45(10):1113, 2013.
41
Humbert, Le Bars et al.
L. H. William, Y. Rex, and J. Leskovec. Representation learning on graphs: Methods and
applications. preprint arXiv:1709.05584, 2017.
E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via univariate exponential
family distributions. Journal of Machine Learning Research, 16:3813–3847, 2015.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67,
2006.
H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of
Computational and Graphical Statistics, 15(2):265–286, 2006.
42