ArticlePDF Available

Abstract and Figures

Modeling self-organization of neural networks for unsupervised learning using Hebbian and anti-Hebbian plasticity has a long history in neuroscience. Yet, derivations of single-layer networks with such local learning rules from principled optimization objectives became possible only recently, with the introduction of similarity matching objectives. What explains the success of similarity matching objectives in deriving neural networks with local learning rules? Here, using dimensionality reduction as an example, we introduce several variable substitutions that illuminate the success of similarity matching. We show that the full network objective may be optimized separately for each synapse using local learning rules both in the offline and online settings. We formalize the long-standing intuition of the rivalry between Hebbian and anti-Hebbian rules by formulating a min-max optimization problem. We introduce a novel dimensionality reduction objective using fractional matrix exponents. To illustrate the generality of our approach, we apply it to a novel formulation of dimensionality reduction combined with whitening. We confirm numerically that the networks with learning rules derived from principled objectives perform better than those with heuristic learning rules.
Content may be subject to copyright.
ARTICLE Communicated by Sebastian Seung
Why Do Similarity Matching Objectives Lead
to Hebbian/Anti-Hebbian Networks?
Cengiz Pehlevan
cpehlevan@atironinstitute.org
Center for Computational Biology, Flatiron Institute, New York,
NY 10010, U.S.A.
Anirvan M. Sengupta
anirvans@physics.rutgers.edu
Center for Computational Biology, Flatiron Institute, New York,
NY 10010, U.S.A., and Physics and Astronomy Department,
Rutgers University, New Brunswick, NJ 08901, U.S.A.
Dmitri B. Chklovskii
dchklovskii@atironinstitute.org
Center for Computational Biology, Flatiron Institute, New York, NY 10010, U.S.A.,
and NYU Langone Medical Center, New York 10016, U.S.A.
Modeling self-organization of neural networks for unsupervised learn-
ing using Hebbian and anti-Hebbian plasticity has a long history in
neuroscience. Yet derivations of single-layer networks with such local
learning rules from principled optimization objectives became possible
only recently, with the introduction of similarity matching objectives.
What explains the success of similarity matching objectives in deriving
neural networks with local learning rules? Here, using dimensionality
reduction as an example, we introduce several variable substitutions
that illuminate the success of similarity matching. We show that the full
network objective may be optimized separately for each synapse us-
ing local learning rules in both the ofine and online settings. We for-
malize the long-standing intuition of the rivalry between Hebbian and
anti-Hebbian rules by formulating a min-max optimization problem. We
introduce a novel dimensionality reduction objective using fractional
matrix exponents. To illustrate the generality of our approach, we ap-
ply it to a novel formulation of dimensionality reduction combined with
whitening. We conrm numerically that the networks with learning rules
derived from principled objectives perform better than those with heuris-
tic learning rules.
Neural Computation 30, 84–124 (2018) © 2017 Massachusetts Institute of Technology
doi:10.1162/NECO_a_01018
Why Do Similarity Matching Objectives Lead to Local Networks? 85
1 Introduction
The human brain generates complex behaviors via the dynamics of elec-
trical activity in a network of approximately 1011 neurons, each making
about 104synaptic connections. As there is no known centralized authority
determining which specic connections a neuron makes or specifying the
weights of individual synapses, synaptic connections must be established
based on local rules. Therefore, a major challenge in neuroscience is to de-
termine local synaptic learning rules that would ensure that the network
acts coherently, that is, to guarantee robust network self-organization.
Much work has been devoted to the self-organization of neural net-
works for solving unsupervised computational tasks using Hebbian and
anti-Hebbian learning rules (Földiak, 1989, 1990; Rubner & Tavan, 1989;
Rubner & Schulten, 1990; Carlson, 1990; Plumbley, 1993a, 1993b; Leen, 1991;
Linsker, 1997). Unsupervised setting is natural in biology because large-
scale labeled data sets are typically unavailable. Hebbian and anti-Hebbian
learning rules are biologically plausible because they are local: the weight
of an (anti-)Hebbian synapse is proportional to the (minus) correlation in
activity between the two neurons the synapse connects.
In networks for dimensionality reduction, for example, feedforward con-
nections use Hebbian rules and lateral anti-Hebbian rules (see Figure 1).
Hebbian rules attempt to align each neuronal feature vector, whose com-
ponents are the weights of synapses impinging onto the neuron, with the
input space direction of greatest variance. Anti-Hebbian rules mediate com-
petition among neurons, which prevents their feature vectors from aligning
in the same direction. A rivalry between the two kinds of rules results in
the equilibrium where synaptic weight vectors span the principal subspace
of the input covariance matrix—the subspace spanned by the eigenvectors
corresponding to the largest eigenvalues.
However, in most existing single-layer networks (see Figure 1), Heb-
bian and anti-Hebbian learning rules were postulated rather than derived
from a principled objective. Having such derivation should yield better-
performing rules and deeper understanding than has been achived using
heuristic rules. But until recently, all derivations of single-layer networks
from principled objectives led to biologically implausible nonlocal learning
rules, where the weight of a synapse depends on the activities of neurons
other than the two the synapse connects.
Recently, single-layer networks with local learning rules have been
derived from similarity matching objective functions (Pehlevan, Hu, &
Chklovskii, 2015; Pehlevan & Chklovskii, 2014; Hu, Pehlevan, & Chklovskii,
2014). But why do similarity matching objectives lead to neural networks
with local, Hebbian, and anti-Hebbian learning rules? A clear answer to this
question has been lacking.
Here, we answer this question by performing several illuminating vari-
able transformations. Specically, we reduce the full network optimization
86 C. Pehlevan, A. Sengupta, and D. Chklovskii
Figure 1: Dimensionality reduction neural networks derived by min-max opti-
mization in the online setting. (A) Network with autapses. (B) Network without
autapses.
problem to a set of trivial optimization problems for each synapse that can
be solved locally. Eliminating neural activity variables leads to a min-max
objective in terms of feedforward and lateral synaptic weight matrices. This
nally formalizes the long-held intuition about the adversarial relationship
of Hebbian and anti-Hebbian learning rules.
In this article, we make the following contributions. In section 2, we
present a more transparent derivation of the previously proposed online
similarity matching algorithm for principal subspace projection (PSP). In
section 3, we propose a novel objective for PSP combined with spheriz-
ing, or whitening, the data, which we name principal subspace whitening
(PSW), and derive from it a biologically plausible online algorithm. In sec-
tions 2 and 3, we also demonstrate that stability in the ofine setting guar-
antees projection onto the principal subspace and give principled learning
rate recommendations. In section 4, by eliminating activity variables from
the objectives, we derive min-max formulations of PSP and PSW that yield
themselves to game-theoretical interpretations. In section 5, by expressing
the optimization objectives in terms of feedforward synaptic weights only,
we arrive at novel formulations of dimensionality reduction in terms of frac-
tional powers of matrices. In section 6, we demonstrate numerically that the
performance of our online algorithms is superior to the heuristic ones.
2 From Similarity Matching to Hebbian/Anti-Hebbian
Networks for PSP
2.1 Derivation of a Mixed PSP from Similarity Matching. The PSP
problem is formulated as follows. Given Tcentered input data samples,
xtRn,ndTprojections, ytRk, onto the principal subspace (kn)—the
subspace spanned by eigenvectors corresponding to the ktop eigenvalues
Why Do Similarity Matching Objectives Lead to Local Networks? 87
of the input covariance matrix:
C1
T
T
t=1
xtx
t=1
TXX,(2.1)
where we resort to a matrix notation by concatenating input column vectors
into X=[x1,...,xT]. Similarly, outputs are Y=y1,...,yT.
Our goal is to derive a biologically plausible single-layer neural network
implementing PSP by optimizing a principled objective. Biological plausi-
bility requires that the learning rules are local; synaptic weight update, that
is, depends on the activity of only the two neurons the synapse connects.
The only PSP objective known to yield a single-layer neural network with
local learning rules is based on similarity matching (Pehlevan et al., 2015).
This objective, borrowed from multidimensional scaling (MDS), minimizes
the mismatch between the similarity of inputs and outputs (Mardia, Kent,
& Bibby, 1980; Williams, 2001; Cox & Cox, 2000):
PSP : min
YRk×T
1
T2
XXYY
2
F.(2.2)
Here, similarity is quantied by the inner products between all pairs of in-
puts (outputs) comprising the Grammians XX(YY).
One can understand intuitively that the objective, equation 2.2, is op-
timized by the projection onto the principal subspace by considering the
following (for a rigorous proof, see Pehlevan & Chklovskii, 2015; Mardia
et al., 1980; Cox & Cox, 2000). First, substitute a singular value decomposi-
tion (SVD) for matrices Xand Yand note that the mismatch is minimized by
matching right singular vectors of Yto that of X. Then, rotating the Grammi-
ans to the diagonal basis reduces the minimization problem to minimizing
the mismatch between the corresponding singular values squared. There-
fore, Yis given by the top kright singular vectors of Xscaled by correspond-
ing singular values. As the objective 2.2 is invariant to the left-multiplication
of Yby an orthogonal matrix, it has innitely many degenerate solutions.
One such solution corresponds to the principal component analysis (PCA).
Unlike non-neural-network formulations of PSP or PCA, similarity
matching outputs principal components (scores) rather than principal
eigenvectors of the input covariance (loadings). Such difference in formu-
lation is motivated by our interest in PSP or PCA neural networks (Dia-
mantaras & Kung, 1996) that output principal components, yt, rather than
principal eigenvectors. Principal eigenvectors are not transmitted down-
stream of the network but can be recovered computationally from the
synaptic weight matrices. Although synaptic weights do not enter objec-
tive 2.2, (Pehlevan et al., 2015), they arise naturally in the derivation of the
online algorithm (see below) and store correlations between input and out-
put neural activities.
88 C. Pehlevan, A. Sengupta, and D. Chklovskii
Next, we derive the min-max PSP objective from equation 2.2, starting
with expanding the square of the Frobenius norm:
arg min
YRk×T
1
T2
XXYY
2
F
=arg min
YRk×T
1
T2Tr 2XXYY+YYYY.(2.3)
We can rewrite equation 2.3 by introducing two new dynamical variable
matrices in place of covariance matrices 1
TXYand 1
TYY:
min
YRk×Tmin
WRk×nmax
MRk×kLPSP(W,M,Y),where (2.4)
LPSP(W,M,Y)
Tr4
TXWY+2
TYMY+2Tr WWTr MM.(2.5)
To see that equation 2.5 is equivalent to equation 2.3, nd optimal W=
1
TYXand M=1
TYYby setting the corresponding derivatives of objec-
tive 2.5 to zero. Then substitute Wand Minto equation 2.5 to obtain
equation 2.3.
Finally, we exchange the order of minimization with respect to Yand W,
as well as the order of minimization with respect to Yand maximization
with respect to Min equation 2.5. The last exchange is justied by the saddle
point property (see proposition 1 in appendix A). Then we arrive at the
following min-max optimization problem,
min
WRk×nmax
MRk×kmin
YRk×TLPSP(W,M,Y),(2.6)
where LPSP(W,M,Y) is dened in equation 2.5. We call this a mixed objec-
tive because it includes both output variables, Y, and covariances, Wand M.
2.2 Ofine PSP Algorithm. In this section, we present an ofine op-
timization algorithm to solve the PSP problem and analyze xed points
of the corresponding dynamics. These results will be used in the next sec-
tion for the biologically plausible online algorithm implemented by neural
networks.
In the ofine setting, we can solve equation 2.6 by the alternat-
ing optimization approach used commonly in neural networks literature
(Olshausen & Field, 1996, 1997; Arora, Ge, Ma, & Moitra, 2015). We rst
minimize with respect to Ywhile keeping Wand Mxed,
Y=arg min
YRk×T
LPSP(W,M,Y),(2.7)
Why Do Similarity Matching Objectives Lead to Local Networks? 89
and, second, make a gradient descent-ascent step with respect to Wand M
while keeping Yxed:
WM
←− WM
+1
2ηLPSP(W,M,Y)
W
η
τ
LPSP(W,M,Y)
M,
(2.8)
where η/2istheWlearning rate and τ>0 is a ratio of learning rates for W
and M. In appendix C, we analyze how τaffects the linear stability of the
xed point dynamics. These two phases are iterated until convergence (see
algorithm 1).1
Optimal Yin equation 2.9 exists because Mstays positive denite if ini-
tialized as such.
2.3 Linearly Stable Fixed Points of Algorithm 1 Correspond to the PSP.
Here we demonstrate that convergence of algorithm 1 to xed Wand M
implies that Yis a PSP of X. To this end, we approximate the gradient
descent-ascent dynamics in the limit of small learning rate with the system
of differential equations:
Y(t)=M1(t)W(t)X,
dW(t)
dt =2
TY(t)X2W(t),
1This alternating optimization is identical to a gradient descent-ascent (see proposition
2 in appendix B) in Wand Mon the objective:
lPSP(W,M)min
YRk×TLPSP(W,M,Y).
90 C. Pehlevan, A. Sengupta, and D. Chklovskii
τdM(t)
dt =1
TY(t)Y(t)M(t),(2.11)
where tis now the time index for gradient descent-ascent dynamics.
To state our main result in theorem 1, we dene the lter matrix F(t)
whose rows are neural lters,
F(t):=M1(t)W(t),(2.12)
so that, according to equation 2.9,
Y(t)=F(t)X.(2.13)
Theorem 1. Fixed points of the dynamical system 2.11 have the following prop-
erties:
1. The neural lters, F, are orthonormal, that is, FF=I.
2. The neural lters span a k-dimensional subspace in Rnspannedbysomek
eigenvectors of the input covariance matrix.
3. Stability of a xed point requires that the neural lters span the principal
subspace of X.
4. Suppose the neural lters span the principal subspace. Dene
γij :=2+σiσj2
σiσj
,(2.14)
where i =1,...,k, j =1,...,kand{σ1,...,σ
k}are the top k principal
eigenvalues of C. We assume σk= σk+1. This xed point is linearly stable if
and only if
τ< 1
24ij
(2.15)
for all (i,j)pairs. By linearly stable, we mean that linear perturbations of
Wand Mconverge to a conguration in which the new neural lters are
merely rotations within the principal subspace of the original neural lters.
Proof. See appendix C.
Based on theorem 1, we claim that provided the dynamics converges
to a xed point, algorithm 1 has found a PSP of input data. Note that the
orthonormality of the neural lters is desired and consistent with PSP since,
in this approach, outputs, Y, are interpreted as coordinates with respect to
a basis spanning the principal subspace.
Theorem 1 yields a practical recommendation for choosing learning rate
parameters in simulations. In a typical situation, one will not know the
eigenvalues of the covariance matrix a priori but can rely on the fact, γij 2.
Why Do Similarity Matching Objectives Lead to Local Networks? 91
Then equation 2.15 implies that for τ1/2, the principal subspace is lin-
early stable, leading to numerical convergence and stability.
2.4 Online Neural Min-Max Optimization Algorithms. Unlike the of-
ine setting considered so far, where all the input data are available from
the outset, in the online setting, input data are streamed to the algorithm
sequentially, one at a time. The algorithm must compute the correspond-
ing output before the next input arrives and transmit it downstream. Once
transmitted, the output cannot be altered. Moreover, the algorithm cannot
store in memory any sizable fraction of past inputs or outputs but only a
few, O(nk), state variables.
Whereas developing algorithms for the online setting is more challeng-
ing than that for the ofine, it is necessary for both data analysis and mod-
eling biological neural networks. The size of modern data sets may exceed
that of available RAM or the output must be computed before the data set is
fully streamed. Biological neural networks operating on the data streamed
by the sensory organs are incapable of storing any signicant fraction of it
and compute the output on the y.
Pehlevan et al. (2015) gave a derivation of a neural online algorithm for
PSP, starting from the original similarity matching cost function, equation
2.2. Here, instead, we start from the min-max form of similarity matching,
equation 2.6, and end up with a class of algorithms that reduce to the al-
gorithm of Pehlevan et al. (2015) for special choices of learning rates. Our
main contribution, however, is that the current derivation is much more in-
tuitive and simpler, with insights to why similarity matching leads to local
learning rules.
We start by rewriting the min-max PSP objective, equation 2.6, as a sum
of time-separable terms that can be optimized independently:
min
WRk×nmax
MRk×k
1
T
T
t=1
lPSP,t(W,M),(2.16)
where
lPSP,t(W,M)2Tr WWTr MM+min
ytRk×1lt(W,M,yt) (2.17)
and
lt(W,M,yt)=−4x
tWyt+2y
tMyt.(2.18)
This separation in time is a benet of the min-max PSP objective, equation
2.6, and leads to a natural way to derive an online algorithm that was not
available for the original similarity matching cost function, equation 2.2.
92 C. Pehlevan, A. Sengupta, and D. Chklovskii
To solve the optimization problem, equation 2.16, in the online setting,
we optimize sequentially each lPSP,t.Foreacht, rst, minimize equation 2.18
with respect to ytwhile keeping Wtand Mtxed. Second, make a gradient
descent-ascent step with respect to Wtand Mtfor xed Y:
Wt+1=Wtηt
2
lPSP,t(Wt,Mt)
Wt
,
Mt+1=Mt+ηt
2τ
lPSP,t(Wt,Mt)
Mt
,(2.19)
where 0
t/2<1istheWlearning rate and τ>0 is the ratio of Wand
Mlearning rates. As before, proposition 2 (see appendix B) ensures that the
alternating optimization (Olshausen & Field, 1996, 1997; Arora et al., 2015)
of lPSP,t, follows from gradient descent-ascent.
Algorithm 2 can be implemented by a biologically plausible neural net-
work. The dynamics, equation 2.20, corresponds to neural activity in a re-
current circuit, where Wtis the feedforward synaptic weight matrix and
Mtis the lateral synaptic weight matrix, (see Figure 1A). Since Mtis al-
ways positive denite, equation 2.18 is a Lyapunov function for neural ac-
tivity. Hence the dynamics is guaranteed to converge to a unique xed
point, yt=M1
tWtxt, where matrix inversion is computed iteratively in a
distributed manner.
Updates of covariance matrices, equation 2.21, can be interpreted as
synaptic learning rules: Hebbian for feedforward and anti-Hebbian (due
to the “ sign in equation 2.20) for lateral synaptic weights. Importantly,
these rules are local—the weight of each synapse depends only on the activ-
ity of the pair of neurons that synapse connects—and therefore biologically
plausible.
Why Do Similarity Matching Objectives Lead to Local Networks? 93
Even requiring full optimization with respect to ytversus a gradient step
with respect to Wtand Mtmay have a biological justication. As neural
activity dynamics is typically faster than synaptic plasticity, it may settle
before the arrival of the next input.
To see why similarity matching leads to local learning rules, let us con-
sider equations 2.6 and 2.16. Aside from separating in time, useful for
derivation of online learning rules, LPSP(W,M,Y) also separates in synaptic
weights and their pre- and postsynaptic neural activities,
LPSP(W,M,Y)
=
t
ij 2W2
ij 4Wijxt,jyt,i
ij M2
ij +2Mijyt,jyt,i
.(2.22)
Therefore, a derivative with respect to a synaptic weight depends on only
the quantities accessible to the synapse.
Finally, we address two potential criticisms of the neural PSP algorithm.
First is the existence of autapses (i.e., self-coupling of neurons) in our net-
work, manifested in nonzero diagonals of the lateral connectivity matrix,
M(see Figure 1A). Whereas autapses are encountered in the brain, they
are rarely seen in principal neurons (Ikeda & Bekkers, 2006). Second is the
symmetry of lateral synaptic weights in our network, which is not observed
experimentally.
We derive an autapse-free network architecture (zeros on the diagonal of
the lateral synaptic weight matrix Mt) with asymmetric lateral connectivity,
Figure 1B, by using coordinate descent (Pehlevan et al., 2015) in place of gra-
dient descent in the neural dynamics stage, equation 2.20 (see appendix F).
The resulting algorithm produces the same outputs as the current algorithm
and, for the special case τ=1/2andηt=η/2, reduces to the algorithm with
“forgetting” of Pehlevan et al. (2015).
3 From Constrained Similarity Matching to Hebbian/Anti-Hebbian
Networks for PSW
The variable substitution method we introduced in the previous section can
be applied to other computational objectives in order to derive neural net-
works with local learning rules. To give an example, we derive a neural
network for PSW, which can be formulated as a constrained similarity
matching problem. This example also illustrates how an optimization con-
straint can be implemented by biological mechanisms.
3.1 Derivation of PSW from Constrained Similarity Matching. The
PSW problem is closely related to PSP: project centered input data sam-
ples onto the principal subspace (kn), and “spherize” the data in the
94 C. Pehlevan, A. Sengupta, and D. Chklovskii
subspace so that the variances in all directions are 1. To derive a neural
PSW algorithm, we use the similarity matching objective with an additional
constraint:
PSW : min
YRk×T
1
T2
XXYY
2
F,s.t.1
TYY=I.(3.1)
We rewrite equation 3.1 by expanding the Frobenius norm squared and
dropping the Tr YYYYterm, which is constant under the constraint,
thus reducing equation 3.1 to a constrained similarity alignment problem:
min
YRk×T1
T2XXYY,s.t.1
TYY=I.(3.2)
To see that objective 3.2 is optimized by the PSW, rst substitute a singular
value decomposition (SVD) for matrices Xand Yand note that the align-
ment is maximized by matching right singular vectors of Yto Xand rotating
to the diagonal basis (for a rigorous proof, see Pehlevan & Chklovskii, 2015).
Since the squared singular values of Yequal unity, the objective, equation
3.2, is reduced to a summation of ksquared singular values of Xand is op-
timized by choosing the top k. Then Yis given by the top kright singu-
lar vectors of Xscaled by T. As before, objective 3.2 is invariant to the
left-multiplication of Yby an orthogonal matrix and therefore has innitely
many degenerate solutions.
Next, we derive a mixed PSW objective from equation 3.2 by introduc-
ing two new dynamical variable matrices: the input-output correlation ma-
trix, W=1
TXY, and the Lagrange multiplier matrix, M, for the whitening
constraint:
min
YRk×Tmin
WRk×nmax
MRk×kLPSW (W,M,Y),(3.3)
where
LPSW (W,M,Y)
≡−2
TTr XWY+Tr WW+Tr M1
TYYI.(3.4)
To see that equation 3.4 is equivalent to equation 3.2, nd optimal W=
1
TYXby setting the corresponding derivatives of the objective 3.4 to zero.
Then, substitute Winto equation 3.4 to obtain the Lagrangian of equation
3.2.
Finally, we exchange the order of minimization with respect to Yand W,
as well as the order of minimization with respect to Yand maximization
with respect to Min equation 3.4 (see proposition 5 in appendix D for a
proof). Then we arrive at the following min-max optimization problem with
Why Do Similarity Matching Objectives Lead to Local Networks? 95
a mixed objective:
min
WRk×nmax
MRk×kmin
YRk×TLPSW (W,M,Y),(3.5)
where LPSW (W,M,Y) is dened in equation 3.4.
3.2 Ofine PSW Algorithm. Next, we give an ofine algorithm for the
PSW problem, using the alternating optimization procedure as before. We
solve equation 3.5 by, rst, optimizing with respect to Yfor xed Wand M
and, second, making a gradient descent-ascent step with respect to Wand
Mwhile keeping Yxed.2We arrive at algorithm 3.
Convergence of algorithm 3 requires the input covariance matrix, C,to
have at least knonzero eigenvalues. Otherwise, a consistent solution can-
not be found because update 3.7 forces Yto be full rank while equation 3.6
lowers its rank.
3.3 Linearly Stable Fixed Points of Algorithm 3 Correspond to PSW.
Here we claim that convergence of algorithm 3 to xed Wand Mimplies
PSW of X. In the limit of small learning rate, the gradient descent-ascent
dynamics can be approximated with the system of differential equations:
Y(t)=M1(t)W(t)X,
2This alternating optimization is identical to a gradient descent-ascent (see proposition
2 in appendix B) in Wand Mon the objective:
lPSW (W,M)min
YRk×TLPSW (W,M,Y).
96 C. Pehlevan, A. Sengupta, and D. Chklovskii
dW(t)
dt =2
TY(t)X2W(t),
τdM(t)
dt =1
TY(t)Y(t)I(t),(3.8)
where tis now the time index for gradient descent-ascent dynamics. We
again dene the neural lter matrix F=M1W.
Theorem 2. Fixed points of the dynamical system, equation 3.8, have the follow-
ing properties:
1. The outputs are whitened for 1
TYY=I.
2. The neural lters span a k-dimensional subspace in Rnspannedbysomek
eigenvectors of the input covariance matrix.
3. Stability of the xed point requires that the neural lters span the principal
subspace of X.
4. Suppose the neural lters span the principal subspace. This xed point is
linearly stable if and only if
τ< σi+σj
2σiσj2(3.9)
for all (i,j)pairs, i = j. By linear stability, we mean that linear perturba-
tions of Wand Mconverge to a rotation of the original neural lters within
the principal subspace.
Proof. see Appendix E.
Based on theorem 2 we claim that, provided algorithm 3 converges, this
xed point corresponds to a PSW of input data. Unlike the PSP case, the
neural lters are not orthonormal.
3.4 Online Algorithm for PSW. As before, we start by rewriting the
min-max PSW objective, equation 3.5, as a sum of time-separable terms that
can be optimized independently:
min
WRk×nmax
MRk×k
1
T
T
t=1
lPSW,t(W,M),(3.10)
where
lPSW,t(W,M)Tr WWTr (M)+1
2min
ytRk×1lt(W,M,yt),(3.11)
and lt(W,M,yt) is dened in equation 2.18.
In the online setting, equation 3.10 can be optimized by sequentially min-
imizing each lPSW,t.Foreacht, rst, minimize equation 2.18 with respect to
Why Do Similarity Matching Objectives Lead to Local Networks? 97
ytfor xed Wtand Mt; second, update Wtand Mtaccording to a gradient
descent-ascent step for xed yt:
Wt+1=Wtηt
lPSW,t(Wt,Mt)
Wt
,
Mt+1=Mt+ηt
τ
lPSW,t(Wt,Mt)
Mt
,(3.12)
where 0
t<1istheWlearning rate and τ>0 is the ratio of Wand M
learning rates. As before, proposition 2 (see appendix B) ensures that the
alternating optimization (Olshausen & Field, 1996, 1997; Arora et al., 2015)
of lPSW,t, follows from gradient descent-ascent.
Algorithm 4 can be implemented by a biologically plausible single-layer
neural network with lateral connections as in algorithm 2, Figure 1A. Up-
dates to synaptic weights, equation 3.14, are local, Hebbian/anti-Hebbian
plasticity rules. An autapse-free network architecture, Figure 1B, may be ob-
tained using coordinate descent (Pehlevan et al., 2015) in place of gradient
descent in the neural dynamics stage, equation 3.13 (see appendix G).
The lateral connection weights are the Lagrange multipliers introduced
in the ofine problem, equation 3.4. In the PSP network, they resulted from
a variable transformation of the output covariance matrix. This difference
carries over to the learning rules, where in algorithm 4, the lateral learning
rule is enforcing the whitening of the output, but in algorithm 2, the lateral
learning rule sets the lateral weight matrix to the output covariance matrix.
98 C. Pehlevan, A. Sengupta, and D. Chklovskii
4 Game-Theoretical Interpretation of Hebbian/Anti-Hebbian
Learning
In the original similarity matching objective, equation 2.2, the only vari-
ables are neuronal activities, which, at the optimum, represent principal
components. In section 2, we rewrote these objectives by introducing ma-
trices Wand Mcorresponding to synaptic connection weights, equation
2.5. Here, we eliminate neural activity variables altogether and arrive at a
min-max formulation in terms of feedforward, W, and lateral, M, connec-
tion weight matrices only. This formulation lends itself to a game-theoretical
interpretation.
Since in the ofine PSP setting, optimal Min equation 2.6 is an invert-
ible matrix (because M=1
TYY; see also appendix A), we can restrict
our optimization to invertible matrices, M, only. Then, we can optimize
objective, equation 2.5, with respect to Yand substitute its optimal value
Y=M1WX into equations 2.5 and 2.6 to obtain
min
WRk×nmax
MRk×k2
TTr XWM1WX+2Tr WWTr MM,
s.t. Mis invertible. (4.1)
This min-max objective admits a game-theoretical interpretation where
feedforward, W, and lateral, M, synaptic weight matrices oppose each
other. To reduce the objective, feedforward synaptic weight vectors of each
output neuron attempt to align with the direction of maximum variance of
input data. However, if this was the only driving force, then all output neu-
rons would learn the same synaptic weight vectors and represent the same
top principal component. At the same time, linear dependency between dif-
ferent feedforward synaptic weight vectors can be exploited by the lateral
synaptic weights to increase the objective by cancelling the contributions of
different components. To avoid this, the feedforward synaptic weight vec-
tors become linearly independent and span the principal subspace.
A similar interpretation can be given for PSW, where feedforward, W,
and lateral, M, synaptic weight matrices oppose each other adversarially.
5 Novel Formulations of Dimensionality Reduction Using
Fractional Exponents
In this section, we point to a new class of dimensionality reduction objec-
tive functions that naturally follow from the min-max objectives 2.5 and 2.6.
Eliminating both the neural activity variables, Y, and the lateral connection
weight matrix, M, we arrive at optimization problems in terms of the feed-
forward weight matrix, W, only. The rows of optimal Wform a nonorthogo-
nal basis of the principal subspace. Such formulations of principal subspace
Why Do Similarity Matching Objectives Lead to Local Networks? 99
problems involve fractional exponents of matrices and, to the best of our
knowledge, have not been proposed previously.
By replacing maxMminYoptimization in the min-max PSP objective,
equation 2.6, by its saddle point value (see proposition 1 in appendix A),
we nd the following objective expressed solely in terms of W:
min
WRk×nTr 3
T2/3WXXW2/3+2WW,(5.1)
The rows of the optimal Ware not principal eigenvectors; rather, the row
space of Wspans the principal subspace.
By replacing maxMminYoptimization in the min-max PSW objective,
equation 3.5, by its optimal value (see proposition 5 in Appendix D):
min
WRk×nTr 2
T1/2WXXW1/2+WW.(5.2)
As before, the rows of the optimal Ware not principal eigenvectors; rather,
the row space of Wspans the principal eigenspace.
We observe that the only material difference between equations 5.1 and
5.2 is in the value of the fractional exponent. Based on this, we conjecture
that any objective function of such form with a fractional exponent from a
continuous range is optimized by Wspanning the principal subspace. Such
solutions would differ in the eigenvalues associated with the corresponding
components.
A supporting argument for our conjecture comes from the work of Miao
and Hua (1998), who studied the cost:
min
WRk×nTr log WXXW+WW.(5.3)
Equation 5.3 can be seen as a limiting case of our conjecture, where the frac-
tional exponent goes to zero. Indeed, Miao and Hua (1998) proved that the
rows of optimal Ware an orthonormal basis for the principal eigenspace.
6 Numerical Experiments
Next, we test our ndings using a simple articial data set. We generated
an n=10 dimensional data set and simulated our ofine and online algo-
rithms to reduce this data set to k=3 dimensions, using different values of
the parameter τ. The results are plotted in Figures 2, 3, 4, and 5, along with
details of the simulations in the gures’ caption.
Consistent with theorems 1 and 2, small perturbations to PSP and PSW
xed points decayed (solid lines) or grew (dashed lines) depending on the
value of τ(see Figure 2A). Ofine simulations that start from random initial
conditions converged to the PSP (or the PSW) solution if the xed point was
100 C. Pehlevan, A. Sengupta, and D. Chklovskii
Figure 2: Demonstration of the stability of the PSP (top row) and PSW (bottom
row) algorithms. We constructed an n=10 by T=2000 data matrix Xfrom its
SVD, where the left and right singular vectors are chosen randomly; the top
three singular values are set to {3T,2T,T}; and the rest of the singular
values are chosen uniformly in [0,0.1T]. Learning rates were ηt=1/103+t.
Errors were dened using deviation of the neural lters from their optimal val-
ues (Pehlevan et al., 2015). Let Ube the 10 ×3 matrix whose columns are the
top three left singular vectors of X.PSPerror:
F(t)F(t)UU
F, PSW er-
ror:
F(t)F(t)USU
F, with S=diag ([1/3,1/2,1])in Matlab notation. Solid
(dashed) lines indicate linearly stable (unstable) choices of τ. (A) Small per-
turbations to the xed point. Wand Mmatrices were initialized by adding a
random gaussian variable, N(0,106), elementwise to their xed point values.
(B) Ofine algorithm, initialized with random Wand Mmatrices. (C) Online
algorithm, initialized with the same initial condition as in panel B. A random
column of Xis processed at each time.
linearly stable, (see Figure 2B). Interestingly, the online algorithms’ perfor-
mances were very close to those of the ofine, (see Figure 2C).
The error for linearly unstable simulations in Figure 2 saturates rather
than blowing up. This may seem at odds with theorems 1 and 2, which
stated that if there is a stable xed point of the dynamics, it should be the
PSP/PSW solution. A closer look resolves this paradox. In Figure 3, we plot
the evolution of an element of the Mmatrix in the ofine algorithms for
Why Do Similarity Matching Objectives Lead to Local Networks? 101
Figure 3: Evolution of a synaptic weight. The same data set was used as in Fig-
ure 2. η=103.
Figure 4: Effect of τof performance. Error after 2 ×104gradient steps are plot-
ted as a function of different choices of τ. The same data set was used as in
Figure 2 with the same network initialization and learning rates. Both curves
start from τ=0.01 and go to the maximum value allowed for linear stability.
stable and unstable choices of τ. When the principal subspace is linearly
unstable, the synaptic weights exhibit undamped oscillations. The dynam-
ics seems to be conned to a manifold with a xed distance (in terms of
the error metric) from the principal subspace. That the error does not grow
to innity is a result of the stabilizing effect of min-max antagonism of the
synaptic weights. Online algorithms behave similarly.
102 C. Pehlevan, A. Sengupta, and D. Chklovskii
Figure 5: Comparison of the online PSP algorithm with the subspace network
(Oja, 1989) and the GHA (Sanger, 1989). The data set and the error metric are as
in Figure 2. For fairness of comparison, the learning rates in all networks were
set to η=103.τ=1/2 for the online PSP algorithm. Feedforward connectivity
matrices were initialized randomly. For the online PSP algorithm, lateral con-
nectivity matrix was initialized to the identity matrix. Curves show averages
over 10 trials.
Next, we studied in detail the effect of τparameter on the convergence
(see Figure 4). In the ofine algorithm, we plot the error after a xed num-
ber of gradient steps, as a function of τ. For PSP, there is an optimal τ.
Decreasing τbeyond the optimal value does not lead to a degradation in
performance; however increasing it leads to a rapid increase in the error.
For PSW, there is a plateau of low error for low values of τbut a rapid in-
crease as one approaches the linear instability threshold. Online algorithms
behave similarly.
Finally, we compared the performance of our online PSP algorithm to
neural PSP algorithms with heuristic learning rules such as the subspace
network (Oja, 1989) and the generalized Hebbian algorithm (GHA) (Sanger,
1989), on the same data set. We found that our algorithm converges much
faster (see Figure 5). Previously, the original similarity matching network
(Pehlevan et al., 2015), a special case of the online PSP algorithm of this
article, was shown to converge faster than the APEX (Kung, Diamantaras,
& Taur, 1994) and (Földiak’s 1989) networks.
7 Discussion
In this article, through transparent variable substitutions, we demonstrated
why biologically plausible neural networks can be derived from similarity
matching objectives, mathematically formalized the adversarial relation-
ship between Hebbian feedforward and anti-Hebbian lateral connections
Why Do Similarity Matching Objectives Lead to Local Networks? 103
using min-max optimization, and formulated dimensionality reduction
tasks as optimizations of fractional powers of matrices. The formalism we
developed should generalize to unsupervised tasks other than dimension-
ality reduction and could provide a theoretical foundation for both natural
and articial neural networks.
In comparing our networks with biological ones, most importantly, our
networks rely only on local learning rules that can be implemented by
synaptic plasticity. While Hebbian learning is famously observed in neu-
ral circuits (Bliss & Lømo, 1973; Bliss & Gardner-Medwin, 1973), our net-
works also require anti-Hebbian learning, which can be interpreted as the
long-term potentiation of inhibitory postsynaptic potentials. Experimen-
tally, such long-term potentiation can arise from pairing action potentials
in inhibitory neurons with subthreshold depolarization of postsynaptic
pyramidal neurons (Komatsu, 1994; Maffei, Nataraj, Nelson, & Turrigiano,
2006). However, plasticity in inhibitory synapses does not have to be Heb-
bian, that is, depend on the correlation between pre- and postsynaptic ac-
tivity (Kullmann, Moreau, Bakiri, & Nicholson, 2012).
To make progress, we had to make several simplications sacricing bi-
ological realism. In particular, we assumed that neuronal activity is a con-
tinuous variable that would correspond to membrane depolarization (in
graded potential neurons) or ring rate (in spiking neurons). We ignored
the nonlinearity of the neuronal input-output function. Such a linear regime
could be implemented via a resting state bias (in graded potential neurons)
or resting ring rate (in spiking neurons).
The applicability of our networks as models of biological networks can
be judged by experimentally testing the following predictions. First, we pre-
dict a relationship between the feedforward and lateral synaptic weight ma-
trices that could be tested using modern connectomics data sets. Second, we
suggest that similarity of output activity matches that of the input, which
could be tested by neuronal population activity measurements using cal-
cium imaging.
Often the choice of a learning rate is crucial to the learning performance
of neural networks. Here, we encountered a nuanced case where the ratio
of feedforward and lateral weights, τ, affects the learning performance sig-
nicantly. First, there is a maximum value of such a ratio, beyond which
the principal subspace solution is linearly unstable. The maximum value
depends on the principal eigenvalues, but for PSP, τ1/2 is always lin-
early stable. For PSW there is not always a safe choice. Having the same
learning rates for feedforward and lateral weights, τ=1, may actually be
unstable. Second, linear stability is not the only thing that affects perfor-
mance. In simulations, we observed for PSP, that there is an optimal value
of τ. For PSW, decreasing τseems to increase performance until a plateau is
reached. This difference between PSP and PSW may be attributed to the dif-
ference of origins of lateral connectivity. In PSW algorithms, lateral weights
originate from Lagrange multipliers enforcing an optimization constraint.
104 C. Pehlevan, A. Sengupta, and D. Chklovskii
Low τ, meaning higher lateral learning rates, force the network to satisfy
the constraint during the whole evolution of the algorithm.
Based on these observation, we can make practical suggestions for the τ
parameter. For PSP, τ=1/2 seems to be a good choice, which is also pre-
ferred from another derivation of an online similarity matching algorithm
(Pehlevan et al., 2015). For PSW, the smaller the τ, the better it is, although
one should make sure that the lateral weight learning rate η/τ is still suf-
ciently small.
Appendix A: Proof of Strong Min-Max Property for PSP Objective
Here we show that minimization with respect to Yand maximization with
respect to Mcan be exchanged in equation 2.5. We will make use of the
following min-max theorem (Boyd & Vandenberghe, 2004), for which we
give a proof for completeness:
Theorem 3. Let f :Rn×Rm−→ R. Suppose the saddle-point property holds,
that is, aRn,bRmsuch that aRn,bRm
f(a,b)f(a,b)f(a,b).(A.1)
Then
max
bmin
af(a,b)=min
amax
bf(a,b)=f(a,b).(A.2)
Proof. cRn,min
amaxbf(a,b)maxbf(c,b), which implies
min
amax
bf(a,b)max
bf(a,b)
=f(a,b)=min
af(a,b)max
bmin
af(a,b).(A.3)
Since maxbminaf(a,b)minamaxbf(a,b) is always true, we get an
equality.
Now, we present the main result of this section.
Proposition 1. Dene
f(Y,M,A):=Tr 4
TAY+2
TYMY Tr MM,(A.4)
where Y,M,andAare arbitrary sized, real-valued matrices. f obeys a strong min-
max property:
Why Do Similarity Matching Objectives Lead to Local Networks? 105
min
Ymax
Mf(Y,M,A)=max
Mmin
Yf(Y,M,A)=− 3
T2/3Tr AA2/3.
(A.5)
Proof. We will show that the saddle-point property holds for equation A.4.
Then the result follows from theorem 3.
If the saddle point exists, it is when f=0,
M=1
TYY,
MY=A.(A.6)
Note that Mis symmetric and positive semidenite. Multiplying the rst
equation by Mon the left and the right, and using the the second equation,
we arrive at
M3=1
TAA.(A.7)
Solutions to equation A.6 are not unique because Mmay not be invertible
depending on A. However, all solutions give the same value of f:
f(Y,M,A)=Tr 4
TAY+2
TYMYTr M2
=Tr 4
TYMY+2
TYMYTr M2
=−3Tr M2=− 3
T2/3Tr AA2/3.(A.8)
Now, we check if the saddle-point property, equationA.1, holds. The rst
inequality is satised:
f(Y,M,A)f(Y,M,A)
=Tr 2
TY(MM)YTr M2+Tr MM
=−2Tr(MM)+Tr M2+Tr MM
=
MM
2
F0.(A.9)
The second inequality is also satised:
f(Y,M,A)f(Y,M,A)
106 C. Pehlevan, A. Sengupta, and D. Chklovskii
=Tr 4
TA(YY)+2
TYMY2
TYMY
=Tr 4
TYMY+2
TYMY+2
TYMY
=2
TTr (YY)M(YY)0,(A.10)
where the last line follows from Mbeing positive semidenite.
Equations A.9 and A.10 show that the saddle-point property, equation
A.1, holds, and therefore max and min can be exchanged and the value of
fat the saddle point is f(Y,M,A)=− 3
T2/3Tr( ( AA)2/3).
Appendix B: Taking a Derivative Using a Chain Rule
Proposition 2. Suppose a differentiable, scalar function H(a1,...,am),where
aiRdiwith arbitrary di. Assume a nite minimum with respect to amexists for
a given set of {a1,...,am1}:
h(a1,...,am1)=min
am
H(a1,...,am),(B.1)
and the optimal a
m=arg minamH(a1,...,am)is a stationary point,
H
am{a1,...,am1,a
m}=0.(B.2)
Then, for i =1,...,m1,
h
ai{a1,...,am1}=H
ai{a1,...,am1,a
m}
.(B.3)
Proof. The result follows from application of the chain rule and the station-
arity of the minimum:
h
ai{a1,...,am1}
=H
ai{a1,...,am1,a
m}+H
am{a1,...,am1,a
m}am
ai{a1,...,am1}
(B.4)
where the second term is zero due to equation B.2.
C Proof of Theorem 1
Here we prove theorem 1 using methodology from Pehlevan et al. (2015).
Why Do Similarity Matching Objectives Lead to Local Networks? 107
The xed points of equation 2.11 are (using¯for xed point):
¯
W=¯
FC,¯
M=¯
FC¯
F,(C.1)
where Cis the input covariance matrix dened as in equation 2.1.
C.1 Proof of Item 1. The result follows from equations 2.12 and C.1:
I=¯
M1¯
M=¯
M1¯
FC¯
F=¯
M1¯
W¯
F=¯
F¯
F.(C.2)
C.2 Proof of Item 2. First note that at xed points, ¯
F¯
Fand Ccommute:
¯
F¯
FC =C¯
F¯
F.(C.3)
Proof. The result follows from equations 2.12 and C1:
¯
F¯
FC =¯
F¯
W=¯
F¯
M¯
F=¯
W¯
F=C¯
F¯
F.(C.4)
¯
F¯
Fand Cshare the same eigenvectors because they commute. Or-
thonormality of neural lters, equation C.2, implies that the krows of ¯
Fare
degenerate eigenvectors of ¯
F¯
Fwith unit eigenvalue. To see this, ¯
F¯
F¯
F=
¯
F. Because the lters are degenerate, the corresponding kshared eigenvec-
tors of Cmay not be the lters themselves but linear combinations of them.
Nevertheless, the shared eigenvectors composed of lters span the same
space as the lters.
Since we are interested in PSP, it is desirable that it is the top keigen-
vectors of Cthat span the lter space. A linear stability analysis around the
xed point reveals that any other combination is unstable and that the PS
is stable if τis chosen appropriately.
C.3 Proof of Item 3
C.3.1 Preliminaries. In order to perform a linear stability analysis, we
linearize the system of equation 2.11 around the xed point. Although
equation 2.11 depends on Wand M, we will nd it convenient to change
variables and work with Fand Minstead.
Using the relation F=M1W, one can express linear perturbations of F
around its xed point, δF, in terms of perturbations of Wand M:
δF=δM1¯
W+¯
M1δW=−¯
M1δM¯
F+¯
M1δW.(C.5)
Linearization of equation 2.11 gives
dδW
dt =2δFC 2δW(C.6)
108 C. Pehlevan, A. Sengupta, and D. Chklovskii
and
τdδM
dt =δFC¯
F+¯
FC ¯
δFδM.(C.7)
Using these, we arrive at
dδF
dt =−1
τ¯
M1δFC¯
F+¯
FC ¯
δF+(2τ1)δM¯
F
+2¯
M1δFC 2δF.(C.8)
Equations C.7 and C.8 dene a closed system of equations.
It will be useful to decompose δFinto components:3
δF=δA¯
F+δS¯
F+δB¯
G,(C.9)
where δAis a k×kantisymmetric matrix, δSis a k×ksymmetric matrix,
and δBis a k×(nk) matrix. ¯
Gis an (nk)×nmatrix with orthonormal
rows, which are orthogonal to the rows of ¯
F.δAand δSare perturbations
that keep the neural lters within the lter space. Antisymmetric δAcorre-
sponds to rotations of lters within the lter space, preserving orthonormal-
ity. Symmetric δSdestroys orthonormality. δBis a perturbation that takes
the neural lters outside the lter space.
Let v1,...,nbe the eigenvectors Cand σ1,...,nbe the corresponding eigenval-
ues. We label them such that ¯
Fspans the same space as the space spanned
by the rst keigenvectors. We choose rows of ¯
Gto be the remaining
eigenvectors— ¯
G:=[vk+1,...,vn]. Note that with this choice,
k
Cik ¯
G
kj =σj+m¯
G
ij.(C.10)
C.3.2 Proof. The proof of item 3 in theorem 1 follows from studying the
stability of δBcomponent.
Multiplying equation C.8 on the right by ¯
G,onearrivesatadecoupled
equation for δB:
dδBj
i
dt =mPj
imδBj
m,Pj
im :=2¯
M1
im σj+kδim,(C.11)
where, for convenience, we changed our notation to δBkj =δBj
k.Foreach
j, the dynamics is linearly stable if all eigenvalues of all Pjare negative. In
turn, this implies that for stability, eigenvalues of ¯
Mshould be greater than
σk+1,...,n.
3See lemma 3 in Pehlevan et al. (2015) for a proof of why such a decomposition always
exists.
Why Do Similarity Matching Objectives Lead to Local Networks? 109
Eigenvalues of ¯
Mare
eig( ¯
M)={σ1,...,σ
k}.(C.12)
Proof. Theeigenvalueequation,
¯
FC¯
Fλ=λλ,(C.13)
implies that
C¯
Fλ=λ¯
Fλ,(C.14)
which can be seen by multiplying equation C.13 on the left by ¯
F, using
the commutation of ¯
F¯
Fand C, and the orthonormality of neural lters.
Further, orthonormality of neural lters implies
¯
F¯
F¯
Fλ=¯
Fλ.(C.15)
Then ¯
Fλis a shared eigenvector between Cand ¯
F¯
F.4Shared eigenvec-
tors of Cwith unit eigenvalue in ¯
F¯
Fare v1,...,vk. Since the eigenvalue of
¯
Fλwith respect to ¯
F¯
Fis 1, ¯
Fλmust be one of v1,...,vk. Then equation
C.14 implies that λ={σ1,...,σ
k}and
eig( ¯
M)={σ1,...,σ
k}.(C.16)
Then it follows that linear stability requires
{σ1,...,σ
k}>{σk+1,...,σ
n}.(C.17)
This proves our claim that if at the xed point, the neural lters span a sub-
space other than the principal subspace, the xed point is linearly unstable.
C.4 Proof of Item 4. We now assume that the xed point is the principal
subspace. From item 3, we know that the δBperturbations are stable. The
proof of item 4 in theorem 1 follows from the linear stabilities of δAand δS.
Multiplying equation C.8 on the right by ¯
F,
dδA
dt +dδS
dt =21
τ¯
M1(δA+δS)¯
M¯
M1δMδA
2+1
τδS.(C.18)
4One might worry that ¯
Fλ=0, but this would require ¯
F¯
Fλ=λ=0,whichisa
contradiction.
110 C. Pehlevan, A. Sengupta, and D. Chklovskii
Unlike the case of δB, this equation is coupled to δM, whose dynamics,
equation C.7, reduces to
τdδM
dt =(δA+δS)¯
M+¯
M(δA+δS)δM.(C.19)
We will consider only symmetric δMperturbations, although if antisym-
metric perturbations were allowed, they would stably decay to zero because
the only antisymmetric term on the right-hand side of equation C.19 would
come from δM.
From equations C.18 and C.19, it follows that
d
dt δA+δS(2τ1)¯
M1δM=−4δS.(C.20)
The right-hand side is symmetric. Therefore, the antisymmetric part of the
left-hand side must equal zero. This gives us an integral of the dynamics
:=δA(t)τ1
2¯
M1δM(t)δM(t)¯
M1,(C.21)
where is a constant, skew symmetric matrix. This reveals an interesting
point: after the perturbation, δAand δMwill not decay to 0even if the xed
point is stable. In hindsight, this is expected because due to the symmetry
of the problem, there is a manifold of stable xed points (bases in princi-
pal subspace), and perturbations within this manifold should not decay. A
similar situation was observed in Pehlevan et al. (2015).
The symmetric part of equation C.20 gives
d
dt δSτ1
2¯
M1δM+δM¯
M1=−4δS,(C.22)
which, using equation C.19, implies
dδS
dt =11
2τ¯
M1δA¯
M¯
MδA¯
M1
+11
2τ¯
M1δS¯
M+¯
MδS¯
M1+2δS4δS
11
2τ¯
M1δM+δM¯
M1.(C.23)
To summarize, we analyze the linear stability of the system of equations,
dened by equations C.19, C.21, and C.23.
Why Do Similarity Matching Objectives Lead to Local Networks? 111
Next, we change to a basis where ¯
Mis diagonal. ¯
Mis symmetric, its
eigenvalues are the principal eigenvectors {σ1,...,σ
k}as shown in ap-
pendix C.3, and it has an orthonormal set of eigenvectors. Let Ube the ma-
trix that contains the eigenvectors of ¯
Min its columns. Dene
δAU:=UδAU,
δSU:=UδSU,
δMU:=UδMU,
U:=UU.(C.24)
Expressing equations C.19, C.21, and C.23 in this new basis, in component
form, and eliminating δAU
ij:
d
dt δMU
ij
δSU
ij =Hij δMU
ij
δSU
ij +
1
τσjσi
11
2τσj
σiσi
σj
U
ij,(C.25)
where
Hij :=
11
2τσjσi1
σi1
σj1
τ
1
τσj+σi
11
2τσj
σiσi
σjτ1
21
σi1
σj1
σi+1
σj
11
2τσj
σi+σi
σj+24
.
(C.26)
This is a closed system of equations for each (i,j) pair! The xed point
of this system of equations is at
δSU
ij =0,
δMU
ij =
U
ij
1
σjσiτ1
21
σi1
σj.(C.27)
Hence, if the linear perturbations are stable, the perturbations that destroy
the orthonormality of neural lters will decay to zero and orthonormality
will be restored.
112 C. Pehlevan, A. Sengupta, and D. Chklovskii
The stability of the xed point is governed by the trace and the determi-
nant of the matrix Hij. The trace is
Tr(Hij)=−4+21
τσi
σj+σj
σi
,1
τ,(C.28)
and the determinant is
det(Hij)=8+2
τ4σi
σj+σj
σi.(C.29)
The system C.25 is linearly stable if both the trace is negative and the deter-
minant is positive.
Dening the following function of covariance eigenvalues,
γij :=σi
σj+σj
σi=2+σiσj2
σiσj
,(C.30)
the trace is negative if and only if
τ<1+1ij
24ij
.(C.31)
The determinant is positive if and only if
τ< 1
24ij
.(C.32)
Since γij >0, equation C.32 implies equation C.31. For stability, equation
C.32 has to be satised for all (i,j) pairs. When i=j,γii =2, equation C.32 is
satised because the right-hand side is innity. When i= j, equation C.32 is
nontrivial and depends on relations between covariance eigenvalues. Since
γij 2, τ1/2 is always stable.
Collectively, our results prove item 4 of theorem 1.
Appendix D: Proof of Strong Min-Max Property for PSW Objective
Here we show that minimization with respect to Yand maximization with
respect to Mcan be exchanged in equation 3.4. We do this by explicitly cal-
culating the value of
2
TTr XWY+Tr M1
TYYI (D.1)
with respect to min-max and max-min optimizations, and showing that the
value does not change.
Why Do Similarity Matching Objectives Lead to Local Networks? 113
Proposition 3. Let ARk×Twith k T. Then
min
YRk×Tmax
MRk×k2
TTr AY+Tr M1
TYYI
=− 2
T1/2Tr AA1/2.(D.2)
Proof. The left side of equation D.2 is a constrained optimization problem:
min
YRk×T2
TTr AYs.t.1
TYY=I.(D.3)
Suppose an SVD of A=k
i=1σA,iuA,iv
A,iand an SVD of Y=
k
i=1σY,iuY,iv
Y,i. The constraint sets σY,i=T. Then the optimization
problem reduces to
min
uY,1,...,uY,k,vY,1,...,vY,k2
T
k
i=1
σA,i
k
j=1
u
A,iuY,jv
A,ivY,j,
s.t.u
Y,iuY,j=δij,v
Y,ivY,j=δij.(D.4)
Note that k
j=1u
A,iuY,jv
A,ivY,j15and therefore the cost is lower bounded
by 2
Tk
i=1σA,i. The lower bound is achieved when uA,i=uY,iand
vA,i=vY,i, with the optimal value of the objective 2
Tk
i=1σA,i=
2
TTr( ( AA)1/2).
Proposition 4. Let ARk×Twith k T. Then
max
MRk×kmin
YRk×T2
TTr AY+Tr M1
TYYI
=− 2
T1/2Tr AA1/2.(D.5)
Proof. Note that we only need to consider the symmetric part of M,because
its antisymmetric component does not contribute to the cost. Below, we use
Mto mean its symmetric part. We will evaluate the value of the objective
2
TTr AY+Tr M1
TYYI (D.6)
5Dene αj:=u
A,iuY,jand βj:=v
A,ivY,j. Because u
Y,iuY,j=v
Y,ivY,j=δij, it follows
that k
i=1α2
i=1andk
i=1β2
i1. The sum in question is k
i=1αiβi, which is an inner
product of a unit vector and a vector with magnitude less than or equal to 1. Hence, the
maximal inner product can be 1.
114 C. Pehlevan, A. Sengupta, and D. Chklovskii
considering the following cases:
1. A=0. In this case, the rst term in equation D.6 drops. Minimization
of the second term with respect to Ygives −∞ if Mhasanegative
eigenvalue or a 0 if Mis positive semidenite. Hence, the max-min
objective is zero, and the proposition holds.
2. A= 0and Ais full rank.
a. Mhas at least one negative eigenvalue. Then minimization of
equation D.6 with respect to Ygives −∞.
b. Mis positive semidenite and has at least one zero eigenvalue.
Then minimization of equation D.6 with respect to Ygives −∞.
To achieve this solution, one chooses all columns of Yto be one
of the zero eigenvectors. The sign of the eigenvector is chosen
such that Tr AYis positive. Multiplying Yby a positive scalar,
one can reduce the objective indenitely.
c. Mis positive denite. Then Y=M1Aminimizes equation D.6
with respect to Y. Plugging this back to equation D.6, we get the
objective
1
TTr AM1ATr (M).(D.7)
The positive-denite Mthat maximizes equation D.7 can be
found by setting its derivative to zero:
M2=1
TAA.(D.8)
Plugging this back in equation D.7, one gets the objective
2
TTr AA1/2,(D.9)
which is maximal with respect to all possible M. Therefore the
proposition holds.
3. A= 0and Ahas rank r<k.
a. Mhas at least one negative eigenvalue. Then, minimization of
equation D.6 with respect to Ygives −∞, as before.
b. Mis positive semidenite and has at least one zero eigenvalue.
i. If at least one of the zero eigenvectors of Misnotaleftzero-
singular vector of A, then minimization of equation D.6 with
respect to Ygives −∞. To achieve this solution, one chooses
all columns of Yto be the zero eigenvector of Mthat is not
a left zero-singular vector of A. The sign of the eigenvector
is chosen such that Tr AYis positive. Multiplying Yby a
positive scalar, one can reduce the objective indenitely.
Why Do Similarity Matching Objectives Lead to Local Networks? 115
ii. If all of the zero eigenvectors of Mare also left zero-singular
vectors of A, then equation D.6 can be reformulated in the
subspace spanned by the top reigenvectors of M. Suppose an
SVD for A=r
i=1σA,iuA,iv
M,iwith σA,1σA,2...σA,r.
One can decompose Y=YA+Y, where columns of Yare
perpendicular to the space spanned by {uA,1,...,uA,r}. Then
the value of the objective equation D.6 depends only on
YA. Dening new matrices ˜
Ai,:=u
A,iA,˜
Yi,:=u
A,iYA,˜
Mij =
u
A,iMuA,j, where i,j=1,...,r, we can rewrite equation D.6
as
2
TTr ˜
A˜
Y+Tr ˜
M1
T˜
Y˜
YI.(D.10)
Now ˜
Ais full rank and ˜
Mis positive denite. As in 2c, the
objective, which is maximal with respect to positive-denite
˜
Mmatrices, is
2
TTr ˜
A˜
A1/2=− 2
TTr AA1/2.(D.11)
c. Mis positive denite. As in 2c, the objective, which is maximal
with respect to positive-denite Mmatrices, is
2
TTr AA1/2.(D.12)
This is also maximal with respect to all possible M. Therefore
the proposition holds.
Collectively, these arguments prove equation D.6.
Propositions 3 and 4 imply the strong min-max property for the PSW
cost.
Proposition 5. The strong min-max property for the PSW cost:
min
YRk×Tmax
MRk×k2
TTr XWY+Tr M1
TYYI
=max
MRk×kmin
YRk×T2
TTr XWY+Tr M1
TYYI
=− 2
T1/2Tr WXXW1/2.(D.13)
Appendix E: Proof of Theorem 2
Here we prove theorem 2.
116 C. Pehlevan, A. Sengupta, and D. Chklovskii
E.1 Proof of Item 1. Item 1 directly follows from the xed-point equa-
tions of the dynamical system 3.8 (¯ for xed point):
¯
W=¯
YX=¯
FC,
I=¯
Y¯
Y=¯
FC¯
F.(E.1)
E.2 Proof of Item 2. We will prove item 2, making use of the normalized
neural lters:
R:=FC1/2,(E.2)
where the input covariance matrix Cis dened as in equation 2.1. At the
xed point, the normalized neural lters are orthonormal:
¯
R¯
R=¯
FC¯
F=¯
Y¯
Y=I.(E.3)
Normalized lters commute with the covariance matrix:
¯
R¯
RC =C¯
R¯
R.(E.4)
Proof.
¯
R¯
RC =C1/2¯
F¯
FC3/2=C1/2¯
F¯
WC1/2=C1/2¯
F¯
M¯
FC1/2
=C1/2¯
W¯
FC1/2=CC1/2¯
F¯
FC1/2=C¯
R¯
R.(E.5)
Therefore, as argued in section C.2, rows of Rspan a subspace spanned
by some keigenvectors of C.IfCis invertible, the row space of Fis the same
as R(follows from equation E.2), and item 2 follows.
E.3 Proof of item 3.
E.3.1 Preliminaries. In order to perform a linear stability analysis, we lin-
earize the system of equation 3.8 around the xed point. The evolution of
Wand Mperturbations follows from linearization of equation 3.8:
τdδM
dt =δR¯
R+¯
RδR,
dδW
dt =2δRC1/22δW.(E.6)
Although equation 3.8 depends on Wand M, we will nd it convenient
to change variables and work with R, as dened in equation E.2 and Min-
stead. Since R,W,andMare interdependent, we express the perturbations
Why Do Similarity Matching Objectives Lead to Local Networks? 117
of Rin terms of W,andMperturbations:
δR=δM1¯
WC1/2+¯
M1δWC1/2=−¯
M1δM¯
R+¯
M1δWC1/2,(E.7)
which implies that
dδR
dt =−¯
M1dδM
dt ¯
R+¯
M1dδW
dt C1/2.(E.8)
Plugging these in and eliminating δW, we arrive at a linearized equation
for δR:
dδR
dt =−1
τ¯
M1δR¯
R+¯
RδR+2τδM¯
R+2¯
M1δRC 2δR.(E.9)
To asses the stability of δR, we expand it as in section C.3:
δR=δA¯
R+δS¯
R+δB¯
G,(E.10)
where δAis a k×kskew-symmetric matrix, δSis an k×ksymmetric matrix,
and δBis a k×(nk) matrix. ¯
Gis an (nk)×nmatrix, with orthonormal
rows. These rows are chosen to be orthogonal to the rows of ¯
R. As before,
skew-symmetric δAcorresponds to rotations of lters within the normal-
ized lter space, symmetric δSkeeps the normalized lter space invariant
but destroys orthonormality, and δBis a perturbation that takes the normal-
ized neural lters outside the lter space.
Let v1,...,nbe the eigenvectors Cand σ1,...,nbe the corresponding eigen-
values. We label them such that ¯
Rspans the same space as the space
spanned by the rst keigenvectors. We choose rows of ¯
Gto be the remain-
ing eigenvectors— ¯
G:=[vk+1,...,vn].
E.3.1 Proof. Proof of item 3 of theorem 2 follows from studying the sta-
bility of δBcomponent. Multiplying equation E.9 on the right by ¯
G,we
arrive at a decoupled evolution equation:
dδBj
i
dt =
m
Pj
imδBj
m,Pj
im :=2¯
M1
im σj+kδim,(E.11)
where for convenience we change our notation to δBkj =δBj
k.
Equations E.1 and E.3 imply ¯
M2=¯
WC ¯
W=¯
RC2¯
Rand hence:
¯
M=¯
RC ¯
R.(E.12)
Taking into account equations E.3 and E.4, the case at hand reduces to the
proof presented in section C.3: stable solutions are those for which
{σ1,...,σ
k}>{σk+1,...,σ
n}.(E.13)
118 C. Pehlevan, A. Sengupta, and D. Chklovskii
This proves that if at the xed point, normalized neural lters span a sub-
space other than the principal subspace, the xed point is linearly unstable.
Since the span of normalized neural lters is that of the neural lters, item
3 follows.
E.4 Proof of Item 4. Proof of item 4 follows from the linear stabilities of
δAand δS. Multiplying equation E.9 on the right by ¯
Rand separating the
resulting equation into its symmetric and antisymmetric parts, we arrive at
dδA
dt =−1
τ¯
M1δSδS¯
M1¯
M1δM+δM¯
M12δA
+¯
M1δA¯
M+¯
MδA¯
M1+¯
M1δS¯
M¯
MδS¯
M1,
dδS
dt =−1
τ¯
M1δS+δS¯
M1¯
M1δMδM¯
M12δS
+¯
M1δA¯
M¯
MδA¯
M1+¯
M1δS¯
M+¯
MδS¯
M1.(E.14)
To obtain a closed set of equations, we complement these equations with δM
evolution, which we obtain by plugging the expansion E.10 into equation
E.6:
τdδM
dt =2δS.(E.15)
We consider only symmetric δMbelow, since our algorithm preserves the
symmetry of Min runtime.
We now change to a basis where ¯
Mis diagonal. ¯
Mis symmetric and
has an orthonormal set of eigenvectors. Its eigenvalues are the principal
eigenvalues {σ1,...,σ
k}(from section C.3). Let Ube the matrix that contains
the eigenvectors of ¯
Min its columns. Dene
δAU:=UδAU,
δSU:=UδSU,
δMU:=UδMU.(E.16)
In this new basis, the linearized equations, in component form, become
d
dt
δMU
ij
δAU
ij
δSU
ij
=Hij
δMU
ij
δAU
ij
δSU
ij
,(E.17)
Why Do Similarity Matching Objectives Lead to Local Networks? 119
where
Hij :=
00 2
τ
1
σj1
σi2+σj
σi+σi
σj1
τ1
σi1
σj+σj
σiσi
σj
1
σj1
σi
σj
σiσi
σj1
τ1
σi+1
σj+σj
σi+σi
σj2
.
(E.18)
Linear stability is governed by the three eigenvalues of Hij. One of the
eigenvalues is 0 due to the existence of the rotational symmetry in the prob-
lem. The corresponding eigenvector is σjσi,1,0. Note that the third
element of the eigenvector is zero, showing that the orthogonality of the
normalized neural lters is not spoiled even in this mode.
For stability of the principal subspace, the other two eigenvalues must
be negative, which means their sum should be negative and their multipli-
cation should be positive. It is easy to show that both the negativity of the
summation and the positivity of the multiplication hold if and only if for
all (i,j) pairs with i= j,
τ< σi+σj
2σiσj2.(E.19)
Hence we have shown that linear perturbations of xed point weights
decay to a conguration in which normalized neural lters are rotations of
the original normalized neural lters within the subspace. It follows from
equation E.2, that the same holds for neural lters.
Appendix F: Autapse-Free Similarity Matching Network with
Asymmetric Lateral Connectivity
Here, we derive an alternative neural network algorithm for PSP, which
does not feature autaptic connections and has asymmetric lateral connec-
tions. To this end, we replace the gradient descent neural dynamics dened
by equation 2.20 by a coordinate descent dynamics.
In the coordinate descent approach, one nds at every step, the optimal
value of one component of ytwhile keeping the rest xed. By taking the
derivative of the cost 4x
tWyt+2y
tMytwith respect to yt,iand setting it
to zero, we nd
yt,i=
j=1
Wt,ij
Mt,ii
xt,j
j=i
Mt,ij
Mt,ii
yt,j.(F.1)
120 C. Pehlevan, A. Sengupta, and D. Chklovskii
The components can be cycled through in any order until the iteration con-
verges to a xed point. The iteration is guaranteed to converge under very
mild assumptions: diagonals of Mhave to be positive (Luo & Tseng, 1991),
which is satised if Mis initialized that way (see equation 2.21. Finally,
equation F.1) can be interpreted as a Gauss-Seidel iteration and generaliza-
tions to other iterative schemes are possible (see Pehlevan et al., 2015).
The coordinate descent iteration, equation F.1, can be interpreted as the
dynamics of an asynchronous autapse-free neural network, (see Figure 1B),
where synaptic weights are
˜
Wt,ij =Wt,ij
Mt,ii
,˜
Mt,ij =Mt,,j
Mt,ii
,˜
Mt,ii =0.(F.2)
With this denition, the lateral weights are now asymmetric because Mt,ii =
Mt,jj if i= j.
We can derive updates for these synaptic weights from the updates for
Wtand Mt(see equation 2.21). By dening another scalar state variable for
each ith neuron ˜
Dt,i:=τMt,iit1, we arrive at6
˜
Dt+1,i=ηt1
ηt1ηt
τ˜
Dt,i+y2
t,i,
˜
Wt+1,ij =12ηt
1ηt˜
Wt,ij
+1
˜
Dt+1,i2τyt,ixt,j12ηt
1ηty2
t,i˜
Wt,ij,
˜
Mt+1,i,j=i=˜
Mt,ij +1
˜
Dt+1,iyt,iyt,jy2
t,i˜
Mt,ij,
˜
Mt+1,ii =0.(F.3)
Here, in addition to synaptic weights, the neurons need to keep track of
a postsynaptic activity dependent variable ˜
Dt,iand the gradient descent-
ascent learning rate parameters ηt,ηt1and τ. The updates are local.
For the special case of τ=1/2andηt=η/2, these plasticity rules sim-
plify to
˜
Dt+1,i=(1 η)˜
Dt,i+y2
t,i,
6These update rules can be derived as follows. Start by the denition of the synap-
tic weights, equation F.2: Mt+1,ii ˜
Mt+1,ij =Mt+1,ij. By the gradient-descent update equa-
tion 2.21, Mt+1,ij =1ηt
τMt,ij +ηt
τyt,iyt,j=1ηt
τ˜
Mt,ijMt,ii +ηt
τyt,iyt,j, where in the
second equality we again used equation F.2. But note that (1 ηt
τ)Mt,ii =Mt+1,ii ηt
τy2
t,i,
from equation 2.21. Combining all of these, ˜
Mt+1,ij =˜
Mt,ij +ηt
τMt+1,ii (yt,ixt,jy2
t,i˜
Mt,ij).
Similar derivation can be given for feedforward updates.
Why Do Similarity Matching Objectives Lead to Local Networks? 121
˜
Wt+1,ij =˜
Wt,ij +1
˜
Dt+1,iyt,ixt,jy2
t,i˜
Wt,ij
˜
Mt+1,i,j=i=˜
Mt,ij +1
˜
Dt+1,iyt,iyt,jy2
t,i˜
Mt,ij.
˜
Mt+1,ii =0,(F.4)
which is precisely the neural online similarity matching algorithm we pre-
viously gave in Pehlevan et al. (2015). Both feedforward and lateral updates
have the same form as a single-neuron Oja’s rule (Oja, 1982).
Note that the algorithm derived above is essentially the same as the one
in the main text: given the same initial conditions and the same inputs, xt,
they will produce the same outputs, yt. The only difference is a rearrange-
ment of synaptic weights in the neural network implementation.
Appendix G: Autapse-Free Constrained Similarity Matching Network
with Asymmetric Lateral Connectivity
Following similar steps to appendix F, we derive an autapse-free PSW neu-
ral algorithm with asymmetric lateral connections. We replace the gradient
descent neural dynamics dened by equation 3.13 by a coordinate descent
dynamics where at every step, one nds the optimal value of one compo-
nent of yt, while keeping the rest xed:
yt,i=
j=1
Wt,ij
Mt,ii
xt,j
j=i
Mt,ij
Mt,ii
yt,j.(G.1)
The components can be cycled through in any order until the iteration con-
verges to a xed point.
The coordinate descent iteration, equation G.1, can be interpreted as the
dynamics of an asynchronous autapse-free neural network (see Figure 1B)
with synaptic weights:
˜
Wt,ij =Wt,ij
Mt,ii
,˜
Mt,ij =Mt,,j
Mt,ii
,˜
Mt,ii =0.(G.2)
As in appendix F, the new lateral weights are asymmetric.
Updates for these synaptic weights can be derived from the updates for
Wtand Mt(see equation 3.14). Dening another scalar state variable for
each ith neuron ˜
Dt,i:=τMt,iit1, we arrive at
˜
Dt+1,i=ηt1
ηt1ηt
τ˜
Dt,i+y2
t,i1,
˜
Wt+1,ij =(12ηt)˜
Wt,ij
122 C. Pehlevan, A. Sengupta, and D. Chklovskii
+1
˜
Dt+1,i2τyt,ixt,j(12ηt)y2
t,i1˜
Wt,ij,
˜
Mt+1,i,j=i=˜
Mt,ij +1
˜
Dt+1,iyt,iyt,jy2
t,i1˜
Mt,ij,
˜
Mt+1,ii =0.(G.3)
As in appendix F, in addition to synaptic weights, the neurons need to
keep track of a postsynaptic activity dependent variable ˜
Dt,iand gradient
descent-ascent learning rate parameters ηW,t,ηM,tand ηM,t1.
For the special case of ηt=η/2andτ=1/2, these plasticity rules sim-
plify to
˜
Dt+1,i=(1 η)˜
Dt,i+y2
t,i,
˜
Wt+1,ij =(1 η)˜
Wt,ij +1
˜
Dt+1,iyt,ixt,j(1 η)y2
t,i1˜
Wt,ij
˜
Mt+1,i,j=i=˜
Mt,ij +1
˜
Dt+1,iyt,iyt,