Content uploaded by Dmitri B Chklovskii

Author content

All content in this area was uploaded by Dmitri B Chklovskii on Jan 09, 2018

Content may be subject to copyright.

ARTICLE Communicated by Sebastian Seung

Why Do Similarity Matching Objectives Lead

to Hebbian/Anti-Hebbian Networks?

Cengiz Pehlevan

cpehlevan@atironinstitute.org

Center for Computational Biology, Flatiron Institute, New York,

NY 10010, U.S.A.

Anirvan M. Sengupta

anirvans@physics.rutgers.edu

Center for Computational Biology, Flatiron Institute, New York,

NY 10010, U.S.A., and Physics and Astronomy Department,

Rutgers University, New Brunswick, NJ 08901, U.S.A.

Dmitri B. Chklovskii

dchklovskii@atironinstitute.org

Center for Computational Biology, Flatiron Institute, New York, NY 10010, U.S.A.,

and NYU Langone Medical Center, New York 10016, U.S.A.

Modeling self-organization of neural networks for unsupervised learn-

ing using Hebbian and anti-Hebbian plasticity has a long history in

neuroscience. Yet derivations of single-layer networks with such local

learning rules from principled optimization objectives became possible

only recently, with the introduction of similarity matching objectives.

What explains the success of similarity matching objectives in deriving

neural networks with local learning rules? Here, using dimensionality

reduction as an example, we introduce several variable substitutions

that illuminate the success of similarity matching. We show that the full

network objective may be optimized separately for each synapse us-

ing local learning rules in both the ofine and online settings. We for-

malize the long-standing intuition of the rivalry between Hebbian and

anti-Hebbian rules by formulating a min-max optimization problem. We

introduce a novel dimensionality reduction objective using fractional

matrix exponents. To illustrate the generality of our approach, we ap-

ply it to a novel formulation of dimensionality reduction combined with

whitening. We conrm numerically that the networks with learning rules

derived from principled objectives perform better than those with heuris-

tic learning rules.

Neural Computation 30, 84–124 (2018) © 2017 Massachusetts Institute of Technology

doi:10.1162/NECO_a_01018

Why Do Similarity Matching Objectives Lead to Local Networks? 85

1 Introduction

The human brain generates complex behaviors via the dynamics of elec-

trical activity in a network of approximately 1011 neurons, each making

about 104synaptic connections. As there is no known centralized authority

determining which specic connections a neuron makes or specifying the

weights of individual synapses, synaptic connections must be established

based on local rules. Therefore, a major challenge in neuroscience is to de-

termine local synaptic learning rules that would ensure that the network

acts coherently, that is, to guarantee robust network self-organization.

Much work has been devoted to the self-organization of neural net-

works for solving unsupervised computational tasks using Hebbian and

anti-Hebbian learning rules (Földiak, 1989, 1990; Rubner & Tavan, 1989;

Rubner & Schulten, 1990; Carlson, 1990; Plumbley, 1993a, 1993b; Leen, 1991;

Linsker, 1997). Unsupervised setting is natural in biology because large-

scale labeled data sets are typically unavailable. Hebbian and anti-Hebbian

learning rules are biologically plausible because they are local: the weight

of an (anti-)Hebbian synapse is proportional to the (minus) correlation in

activity between the two neurons the synapse connects.

In networks for dimensionality reduction, for example, feedforward con-

nections use Hebbian rules and lateral anti-Hebbian rules (see Figure 1).

Hebbian rules attempt to align each neuronal feature vector, whose com-

ponents are the weights of synapses impinging onto the neuron, with the

input space direction of greatest variance. Anti-Hebbian rules mediate com-

petition among neurons, which prevents their feature vectors from aligning

in the same direction. A rivalry between the two kinds of rules results in

the equilibrium where synaptic weight vectors span the principal subspace

of the input covariance matrix—the subspace spanned by the eigenvectors

corresponding to the largest eigenvalues.

However, in most existing single-layer networks (see Figure 1), Heb-

bian and anti-Hebbian learning rules were postulated rather than derived

from a principled objective. Having such derivation should yield better-

performing rules and deeper understanding than has been achived using

heuristic rules. But until recently, all derivations of single-layer networks

from principled objectives led to biologically implausible nonlocal learning

rules, where the weight of a synapse depends on the activities of neurons

other than the two the synapse connects.

Recently, single-layer networks with local learning rules have been

derived from similarity matching objective functions (Pehlevan, Hu, &

Chklovskii, 2015; Pehlevan & Chklovskii, 2014; Hu, Pehlevan, & Chklovskii,

2014). But why do similarity matching objectives lead to neural networks

with local, Hebbian, and anti-Hebbian learning rules? A clear answer to this

question has been lacking.

Here, we answer this question by performing several illuminating vari-

able transformations. Specically, we reduce the full network optimization

86 C. Pehlevan, A. Sengupta, and D. Chklovskii

Figure 1: Dimensionality reduction neural networks derived by min-max opti-

mization in the online setting. (A) Network with autapses. (B) Network without

autapses.

problem to a set of trivial optimization problems for each synapse that can

be solved locally. Eliminating neural activity variables leads to a min-max

objective in terms of feedforward and lateral synaptic weight matrices. This

nally formalizes the long-held intuition about the adversarial relationship

of Hebbian and anti-Hebbian learning rules.

In this article, we make the following contributions. In section 2, we

present a more transparent derivation of the previously proposed online

similarity matching algorithm for principal subspace projection (PSP). In

section 3, we propose a novel objective for PSP combined with spheriz-

ing, or whitening, the data, which we name principal subspace whitening

(PSW), and derive from it a biologically plausible online algorithm. In sec-

tions 2 and 3, we also demonstrate that stability in the ofine setting guar-

antees projection onto the principal subspace and give principled learning

rate recommendations. In section 4, by eliminating activity variables from

the objectives, we derive min-max formulations of PSP and PSW that yield

themselves to game-theoretical interpretations. In section 5, by expressing

the optimization objectives in terms of feedforward synaptic weights only,

we arrive at novel formulations of dimensionality reduction in terms of frac-

tional powers of matrices. In section 6, we demonstrate numerically that the

performance of our online algorithms is superior to the heuristic ones.

2 From Similarity Matching to Hebbian/Anti-Hebbian

Networks for PSP

2.1 Derivation of a Mixed PSP from Similarity Matching. The PSP

problem is formulated as follows. Given Tcentered input data samples,

xt∈Rn,ndTprojections, yt∈Rk, onto the principal subspace (k≤n)—the

subspace spanned by eigenvectors corresponding to the ktop eigenvalues

Why Do Similarity Matching Objectives Lead to Local Networks? 87

of the input covariance matrix:

C≡1

T

T

t=1

xtx

t=1

TXX,(2.1)

where we resort to a matrix notation by concatenating input column vectors

into X=[x1,...,xT]. Similarly, outputs are Y=y1,...,yT.

Our goal is to derive a biologically plausible single-layer neural network

implementing PSP by optimizing a principled objective. Biological plausi-

bility requires that the learning rules are local; synaptic weight update, that

is, depends on the activity of only the two neurons the synapse connects.

The only PSP objective known to yield a single-layer neural network with

local learning rules is based on similarity matching (Pehlevan et al., 2015).

This objective, borrowed from multidimensional scaling (MDS), minimizes

the mismatch between the similarity of inputs and outputs (Mardia, Kent,

& Bibby, 1980; Williams, 2001; Cox & Cox, 2000):

PSP : min

Y∈Rk×T

1

T2

XX−YY

2

F.(2.2)

Here, similarity is quantied by the inner products between all pairs of in-

puts (outputs) comprising the Grammians XX(YY).

One can understand intuitively that the objective, equation 2.2, is op-

timized by the projection onto the principal subspace by considering the

following (for a rigorous proof, see Pehlevan & Chklovskii, 2015; Mardia

et al., 1980; Cox & Cox, 2000). First, substitute a singular value decomposi-

tion (SVD) for matrices Xand Yand note that the mismatch is minimized by

matching right singular vectors of Yto that of X. Then, rotating the Grammi-

ans to the diagonal basis reduces the minimization problem to minimizing

the mismatch between the corresponding singular values squared. There-

fore, Yis given by the top kright singular vectors of Xscaled by correspond-

ing singular values. As the objective 2.2 is invariant to the left-multiplication

of Yby an orthogonal matrix, it has innitely many degenerate solutions.

One such solution corresponds to the principal component analysis (PCA).

Unlike non-neural-network formulations of PSP or PCA, similarity

matching outputs principal components (scores) rather than principal

eigenvectors of the input covariance (loadings). Such difference in formu-

lation is motivated by our interest in PSP or PCA neural networks (Dia-

mantaras & Kung, 1996) that output principal components, yt, rather than

principal eigenvectors. Principal eigenvectors are not transmitted down-

stream of the network but can be recovered computationally from the

synaptic weight matrices. Although synaptic weights do not enter objec-

tive 2.2, (Pehlevan et al., 2015), they arise naturally in the derivation of the

online algorithm (see below) and store correlations between input and out-

put neural activities.

88 C. Pehlevan, A. Sengupta, and D. Chklovskii

Next, we derive the min-max PSP objective from equation 2.2, starting

with expanding the square of the Frobenius norm:

arg min

Y∈Rk×T

1

T2

XX−YY

2

F

=arg min

Y∈Rk×T

1

T2Tr −2XXYY+YYYY.(2.3)

We can rewrite equation 2.3 by introducing two new dynamical variable

matrices in place of covariance matrices 1

TXYand 1

TYY:

min

Y∈Rk×Tmin

W∈Rk×nmax

M∈Rk×kLPSP(W,M,Y),where (2.4)

LPSP(W,M,Y)

≡Tr−4

TXWY+2

TYMY+2Tr WW−Tr MM.(2.5)

To see that equation 2.5 is equivalent to equation 2.3, nd optimal W∗=

1

TYXand M∗=1

TYYby setting the corresponding derivatives of objec-

tive 2.5 to zero. Then substitute W∗and M∗into equation 2.5 to obtain

equation 2.3.

Finally, we exchange the order of minimization with respect to Yand W,

as well as the order of minimization with respect to Yand maximization

with respect to Min equation 2.5. The last exchange is justied by the saddle

point property (see proposition 1 in appendix A). Then we arrive at the

following min-max optimization problem,

min

W∈Rk×nmax

M∈Rk×kmin

Y∈Rk×TLPSP(W,M,Y),(2.6)

where LPSP(W,M,Y) is dened in equation 2.5. We call this a mixed objec-

tive because it includes both output variables, Y, and covariances, Wand M.

2.2 Ofine PSP Algorithm. In this section, we present an ofine op-

timization algorithm to solve the PSP problem and analyze xed points

of the corresponding dynamics. These results will be used in the next sec-

tion for the biologically plausible online algorithm implemented by neural

networks.

In the ofine setting, we can solve equation 2.6 by the alternat-

ing optimization approach used commonly in neural networks literature

(Olshausen & Field, 1996, 1997; Arora, Ge, Ma, & Moitra, 2015). We rst

minimize with respect to Ywhile keeping Wand Mxed,

Y∗=arg min

Y∈Rk×T

LPSP(W,M,Y),(2.7)

Why Do Similarity Matching Objectives Lead to Local Networks? 89

and, second, make a gradient descent-ascent step with respect to Wand M

while keeping Yxed:

WM

←− WM

+1

2−η∂LPSP(W,M,Y∗)

∂W

η

τ

∂LPSP(W,M,Y∗)

∂M,

(2.8)

where η/2istheWlearning rate and τ>0 is a ratio of learning rates for W

and M. In appendix C, we analyze how τaffects the linear stability of the

xed point dynamics. These two phases are iterated until convergence (see

algorithm 1).1

Optimal Yin equation 2.9 exists because Mstays positive denite if ini-

tialized as such.

2.3 Linearly Stable Fixed Points of Algorithm 1 Correspond to the PSP.

Here we demonstrate that convergence of algorithm 1 to xed Wand M

implies that Yis a PSP of X. To this end, we approximate the gradient

descent-ascent dynamics in the limit of small learning rate with the system

of differential equations:

Y(t)=M−1(t)W(t)X,

dW(t)

dt =2

TY(t)X−2W(t),

1This alternating optimization is identical to a gradient descent-ascent (see proposition

2 in appendix B) in Wand Mon the objective:

lPSP(W,M)≡min

Y∈Rk×TLPSP(W,M,Y).

90 C. Pehlevan, A. Sengupta, and D. Chklovskii

τdM(t)

dt =1

TY(t)Y(t)−M(t),(2.11)

where tis now the time index for gradient descent-ascent dynamics.

To state our main result in theorem 1, we dene the lter matrix F(t)

whose rows are neural lters,

F(t):=M−1(t)W(t),(2.12)

so that, according to equation 2.9,

Y(t)=F(t)X.(2.13)

Theorem 1. Fixed points of the dynamical system 2.11 have the following prop-

erties:

1. The neural lters, F, are orthonormal, that is, FF=I.

2. The neural lters span a k-dimensional subspace in Rnspannedbysomek

eigenvectors of the input covariance matrix.

3. Stability of a xed point requires that the neural lters span the principal

subspace of X.

4. Suppose the neural lters span the principal subspace. Dene

γij :=2+σi−σj2

σiσj

,(2.14)

where i =1,...,k, j =1,...,kand{σ1,...,σ

k}are the top k principal

eigenvalues of C. We assume σk= σk+1. This xed point is linearly stable if

and only if

τ< 1

2−4/γij

(2.15)

for all (i,j)pairs. By linearly stable, we mean that linear perturbations of

Wand Mconverge to a conguration in which the new neural lters are

merely rotations within the principal subspace of the original neural lters.

Proof. See appendix C.

Based on theorem 1, we claim that provided the dynamics converges

to a xed point, algorithm 1 has found a PSP of input data. Note that the

orthonormality of the neural lters is desired and consistent with PSP since,

in this approach, outputs, Y, are interpreted as coordinates with respect to

a basis spanning the principal subspace.

Theorem 1 yields a practical recommendation for choosing learning rate

parameters in simulations. In a typical situation, one will not know the

eigenvalues of the covariance matrix a priori but can rely on the fact, γij ≥2.

Why Do Similarity Matching Objectives Lead to Local Networks? 91

Then equation 2.15 implies that for τ≤1/2, the principal subspace is lin-

early stable, leading to numerical convergence and stability.

2.4 Online Neural Min-Max Optimization Algorithms. Unlike the of-

ine setting considered so far, where all the input data are available from

the outset, in the online setting, input data are streamed to the algorithm

sequentially, one at a time. The algorithm must compute the correspond-

ing output before the next input arrives and transmit it downstream. Once

transmitted, the output cannot be altered. Moreover, the algorithm cannot

store in memory any sizable fraction of past inputs or outputs but only a

few, O(nk), state variables.

Whereas developing algorithms for the online setting is more challeng-

ing than that for the ofine, it is necessary for both data analysis and mod-

eling biological neural networks. The size of modern data sets may exceed

that of available RAM or the output must be computed before the data set is

fully streamed. Biological neural networks operating on the data streamed

by the sensory organs are incapable of storing any signicant fraction of it

and compute the output on the y.

Pehlevan et al. (2015) gave a derivation of a neural online algorithm for

PSP, starting from the original similarity matching cost function, equation

2.2. Here, instead, we start from the min-max form of similarity matching,

equation 2.6, and end up with a class of algorithms that reduce to the al-

gorithm of Pehlevan et al. (2015) for special choices of learning rates. Our

main contribution, however, is that the current derivation is much more in-

tuitive and simpler, with insights to why similarity matching leads to local

learning rules.

We start by rewriting the min-max PSP objective, equation 2.6, as a sum

of time-separable terms that can be optimized independently:

min

W∈Rk×nmax

M∈Rk×k

1

T

T

t=1

lPSP,t(W,M),(2.16)

where

lPSP,t(W,M)≡2Tr WW−Tr MM+min

yt∈Rk×1lt(W,M,yt) (2.17)

and

lt(W,M,yt)=−4x

tWyt+2y

tMyt.(2.18)

This separation in time is a benet of the min-max PSP objective, equation

2.6, and leads to a natural way to derive an online algorithm that was not

available for the original similarity matching cost function, equation 2.2.

92 C. Pehlevan, A. Sengupta, and D. Chklovskii

To solve the optimization problem, equation 2.16, in the online setting,

we optimize sequentially each lPSP,t.Foreacht, rst, minimize equation 2.18

with respect to ytwhile keeping Wtand Mtxed. Second, make a gradient

descent-ascent step with respect to Wtand Mtfor xed Y:

Wt+1=Wt−ηt

2

∂lPSP,t(Wt,Mt)

∂Wt

,

Mt+1=Mt+ηt

2τ

∂lPSP,t(Wt,Mt)

∂Mt

,(2.19)

where 0 <η

t/2<1istheWlearning rate and τ>0 is the ratio of Wand

Mlearning rates. As before, proposition 2 (see appendix B) ensures that the

alternating optimization (Olshausen & Field, 1996, 1997; Arora et al., 2015)

of lPSP,t, follows from gradient descent-ascent.

Algorithm 2 can be implemented by a biologically plausible neural net-

work. The dynamics, equation 2.20, corresponds to neural activity in a re-

current circuit, where Wtis the feedforward synaptic weight matrix and

−Mtis the lateral synaptic weight matrix, (see Figure 1A). Since Mtis al-

ways positive denite, equation 2.18 is a Lyapunov function for neural ac-

tivity. Hence the dynamics is guaranteed to converge to a unique xed

point, yt=M−1

tWtxt, where matrix inversion is computed iteratively in a

distributed manner.

Updates of covariance matrices, equation 2.21, can be interpreted as

synaptic learning rules: Hebbian for feedforward and anti-Hebbian (due

to the “− sign in equation 2.20) for lateral synaptic weights. Importantly,

these rules are local—the weight of each synapse depends only on the activ-

ity of the pair of neurons that synapse connects—and therefore biologically

plausible.

Why Do Similarity Matching Objectives Lead to Local Networks? 93

Even requiring full optimization with respect to ytversus a gradient step

with respect to Wtand Mtmay have a biological justication. As neural

activity dynamics is typically faster than synaptic plasticity, it may settle

before the arrival of the next input.

To see why similarity matching leads to local learning rules, let us con-

sider equations 2.6 and 2.16. Aside from separating in time, useful for

derivation of online learning rules, LPSP(W,M,Y) also separates in synaptic

weights and their pre- and postsynaptic neural activities,

LPSP(W,M,Y)

=

t⎡

⎣

ij 2W2

ij −4Wijxt,jyt,i−

ij M2

ij +2Mijyt,jyt,i⎤

⎦.(2.22)

Therefore, a derivative with respect to a synaptic weight depends on only

the quantities accessible to the synapse.

Finally, we address two potential criticisms of the neural PSP algorithm.

First is the existence of autapses (i.e., self-coupling of neurons) in our net-

work, manifested in nonzero diagonals of the lateral connectivity matrix,

M(see Figure 1A). Whereas autapses are encountered in the brain, they

are rarely seen in principal neurons (Ikeda & Bekkers, 2006). Second is the

symmetry of lateral synaptic weights in our network, which is not observed

experimentally.

We derive an autapse-free network architecture (zeros on the diagonal of

the lateral synaptic weight matrix Mt) with asymmetric lateral connectivity,

Figure 1B, by using coordinate descent (Pehlevan et al., 2015) in place of gra-

dient descent in the neural dynamics stage, equation 2.20 (see appendix F).

The resulting algorithm produces the same outputs as the current algorithm

and, for the special case τ=1/2andηt=η/2, reduces to the algorithm with

“forgetting” of Pehlevan et al. (2015).

3 From Constrained Similarity Matching to Hebbian/Anti-Hebbian

Networks for PSW

The variable substitution method we introduced in the previous section can

be applied to other computational objectives in order to derive neural net-

works with local learning rules. To give an example, we derive a neural

network for PSW, which can be formulated as a constrained similarity

matching problem. This example also illustrates how an optimization con-

straint can be implemented by biological mechanisms.

3.1 Derivation of PSW from Constrained Similarity Matching. The

PSW problem is closely related to PSP: project centered input data sam-

ples onto the principal subspace (k≤n), and “spherize” the data in the

94 C. Pehlevan, A. Sengupta, and D. Chklovskii

subspace so that the variances in all directions are 1. To derive a neural

PSW algorithm, we use the similarity matching objective with an additional

constraint:

PSW : min

Y∈Rk×T

1

T2

XX−YY

2

F,s.t.1

TYY=I.(3.1)

We rewrite equation 3.1 by expanding the Frobenius norm squared and

dropping the Tr YYYYterm, which is constant under the constraint,

thus reducing equation 3.1 to a constrained similarity alignment problem:

min

Y∈Rk×T−1

T2XXYY,s.t.1

TYY=I.(3.2)

To see that objective 3.2 is optimized by the PSW, rst substitute a singular

value decomposition (SVD) for matrices Xand Yand note that the align-

ment is maximized by matching right singular vectors of Yto Xand rotating

to the diagonal basis (for a rigorous proof, see Pehlevan & Chklovskii, 2015).

Since the squared singular values of Yequal unity, the objective, equation

3.2, is reduced to a summation of ksquared singular values of Xand is op-

timized by choosing the top k. Then Yis given by the top kright singu-

lar vectors of Xscaled by √T. As before, objective 3.2 is invariant to the

left-multiplication of Yby an orthogonal matrix and therefore has innitely

many degenerate solutions.

Next, we derive a mixed PSW objective from equation 3.2 by introduc-

ing two new dynamical variable matrices: the input-output correlation ma-

trix, W=1

TXY, and the Lagrange multiplier matrix, M, for the whitening

constraint:

min

Y∈Rk×Tmin

W∈Rk×nmax

M∈Rk×kLPSW (W,M,Y),(3.3)

where

LPSW (W,M,Y)

≡−2

TTr XWY+Tr WW+Tr M1

TYY−I.(3.4)

To see that equation 3.4 is equivalent to equation 3.2, nd optimal W∗=

1

TYXby setting the corresponding derivatives of the objective 3.4 to zero.

Then, substitute W∗into equation 3.4 to obtain the Lagrangian of equation

3.2.

Finally, we exchange the order of minimization with respect to Yand W,

as well as the order of minimization with respect to Yand maximization

with respect to Min equation 3.4 (see proposition 5 in appendix D for a

proof). Then we arrive at the following min-max optimization problem with

Why Do Similarity Matching Objectives Lead to Local Networks? 95

a mixed objective:

min

W∈Rk×nmax

M∈Rk×kmin

Y∈Rk×TLPSW (W,M,Y),(3.5)

where LPSW (W,M,Y) is dened in equation 3.4.

3.2 Ofine PSW Algorithm. Next, we give an ofine algorithm for the

PSW problem, using the alternating optimization procedure as before. We

solve equation 3.5 by, rst, optimizing with respect to Yfor xed Wand M

and, second, making a gradient descent-ascent step with respect to Wand

Mwhile keeping Yxed.2We arrive at algorithm 3.

Convergence of algorithm 3 requires the input covariance matrix, C,to

have at least knonzero eigenvalues. Otherwise, a consistent solution can-

not be found because update 3.7 forces Yto be full rank while equation 3.6

lowers its rank.

3.3 Linearly Stable Fixed Points of Algorithm 3 Correspond to PSW.

Here we claim that convergence of algorithm 3 to xed Wand Mimplies

PSW of X. In the limit of small learning rate, the gradient descent-ascent

dynamics can be approximated with the system of differential equations:

Y(t)=M−1(t)W(t)X,

2This alternating optimization is identical to a gradient descent-ascent (see proposition

2 in appendix B) in Wand Mon the objective:

lPSW (W,M)≡min

Y∈Rk×TLPSW (W,M,Y).

96 C. Pehlevan, A. Sengupta, and D. Chklovskii

dW(t)

dt =2

TY(t)X−2W(t),

τdM(t)

dt =1

TY(t)Y(t)−I(t),(3.8)

where tis now the time index for gradient descent-ascent dynamics. We

again dene the neural lter matrix F=M−1W.

Theorem 2. Fixed points of the dynamical system, equation 3.8, have the follow-

ing properties:

1. The outputs are whitened for 1

TYY=I.

2. The neural lters span a k-dimensional subspace in Rnspannedbysomek

eigenvectors of the input covariance matrix.

3. Stability of the xed point requires that the neural lters span the principal

subspace of X.

4. Suppose the neural lters span the principal subspace. This xed point is

linearly stable if and only if

τ< σi+σj

2σi−σj2(3.9)

for all (i,j)pairs, i = j. By linear stability, we mean that linear perturba-

tions of Wand Mconverge to a rotation of the original neural lters within

the principal subspace.

Proof. see Appendix E.

Based on theorem 2 we claim that, provided algorithm 3 converges, this

xed point corresponds to a PSW of input data. Unlike the PSP case, the

neural lters are not orthonormal.

3.4 Online Algorithm for PSW. As before, we start by rewriting the

min-max PSW objective, equation 3.5, as a sum of time-separable terms that

can be optimized independently:

min

W∈Rk×nmax

M∈Rk×k

1

T

T

t=1

lPSW,t(W,M),(3.10)

where

lPSW,t(W,M)≡Tr WW−Tr (M)+1

2min

yt∈Rk×1lt(W,M,yt),(3.11)

and lt(W,M,yt) is dened in equation 2.18.

In the online setting, equation 3.10 can be optimized by sequentially min-

imizing each lPSW,t.Foreacht, rst, minimize equation 2.18 with respect to

Why Do Similarity Matching Objectives Lead to Local Networks? 97

ytfor xed Wtand Mt; second, update Wtand Mtaccording to a gradient

descent-ascent step for xed yt:

Wt+1=Wt−ηt

∂lPSW,t(Wt,Mt)

Wt

,

Mt+1=Mt+ηt

τ

∂lPSW,t(Wt,Mt)

Mt

,(3.12)

where 0 <η

t<1istheWlearning rate and τ>0 is the ratio of Wand M

learning rates. As before, proposition 2 (see appendix B) ensures that the

alternating optimization (Olshausen & Field, 1996, 1997; Arora et al., 2015)

of lPSW,t, follows from gradient descent-ascent.

Algorithm 4 can be implemented by a biologically plausible single-layer

neural network with lateral connections as in algorithm 2, Figure 1A. Up-

dates to synaptic weights, equation 3.14, are local, Hebbian/anti-Hebbian

plasticity rules. An autapse-free network architecture, Figure 1B, may be ob-

tained using coordinate descent (Pehlevan et al., 2015) in place of gradient

descent in the neural dynamics stage, equation 3.13 (see appendix G).

The lateral connection weights are the Lagrange multipliers introduced

in the ofine problem, equation 3.4. In the PSP network, they resulted from

a variable transformation of the output covariance matrix. This difference

carries over to the learning rules, where in algorithm 4, the lateral learning

rule is enforcing the whitening of the output, but in algorithm 2, the lateral

learning rule sets the lateral weight matrix to the output covariance matrix.

98 C. Pehlevan, A. Sengupta, and D. Chklovskii

4 Game-Theoretical Interpretation of Hebbian/Anti-Hebbian

Learning

In the original similarity matching objective, equation 2.2, the only vari-

ables are neuronal activities, which, at the optimum, represent principal

components. In section 2, we rewrote these objectives by introducing ma-

trices Wand Mcorresponding to synaptic connection weights, equation

2.5. Here, we eliminate neural activity variables altogether and arrive at a

min-max formulation in terms of feedforward, W, and lateral, M, connec-

tion weight matrices only. This formulation lends itself to a game-theoretical

interpretation.

Since in the ofine PSP setting, optimal M∗in equation 2.6 is an invert-

ible matrix (because M∗=1

TY∗Y∗; see also appendix A), we can restrict

our optimization to invertible matrices, M, only. Then, we can optimize

objective, equation 2.5, with respect to Yand substitute its optimal value

Y∗=M−1WX into equations 2.5 and 2.6 to obtain

min

W∈Rk×nmax

M∈Rk×k−2

TTr XWM−1WX+2Tr WW−Tr MM,

s.t. Mis invertible. (4.1)

This min-max objective admits a game-theoretical interpretation where

feedforward, W, and lateral, M, synaptic weight matrices oppose each

other. To reduce the objective, feedforward synaptic weight vectors of each

output neuron attempt to align with the direction of maximum variance of

input data. However, if this was the only driving force, then all output neu-

rons would learn the same synaptic weight vectors and represent the same

top principal component. At the same time, linear dependency between dif-

ferent feedforward synaptic weight vectors can be exploited by the lateral

synaptic weights to increase the objective by cancelling the contributions of

different components. To avoid this, the feedforward synaptic weight vec-

tors become linearly independent and span the principal subspace.

A similar interpretation can be given for PSW, where feedforward, W,

and lateral, M, synaptic weight matrices oppose each other adversarially.

5 Novel Formulations of Dimensionality Reduction Using

Fractional Exponents

In this section, we point to a new class of dimensionality reduction objec-

tive functions that naturally follow from the min-max objectives 2.5 and 2.6.

Eliminating both the neural activity variables, Y, and the lateral connection

weight matrix, M, we arrive at optimization problems in terms of the feed-

forward weight matrix, W, only. The rows of optimal Wform a nonorthogo-

nal basis of the principal subspace. Such formulations of principal subspace

Why Do Similarity Matching Objectives Lead to Local Networks? 99

problems involve fractional exponents of matrices and, to the best of our

knowledge, have not been proposed previously.

By replacing maxMminYoptimization in the min-max PSP objective,

equation 2.6, by its saddle point value (see proposition 1 in appendix A),

we nd the following objective expressed solely in terms of W:

min

W∈Rk×nTr −3

T2/3WXXW2/3+2WW,(5.1)

The rows of the optimal Ware not principal eigenvectors; rather, the row

space of Wspans the principal subspace.

By replacing maxMminYoptimization in the min-max PSW objective,

equation 3.5, by its optimal value (see proposition 5 in Appendix D):

min

W∈Rk×nTr −2

T1/2WXXW1/2+WW.(5.2)

As before, the rows of the optimal Ware not principal eigenvectors; rather,

the row space of Wspans the principal eigenspace.

We observe that the only material difference between equations 5.1 and

5.2 is in the value of the fractional exponent. Based on this, we conjecture

that any objective function of such form with a fractional exponent from a

continuous range is optimized by Wspanning the principal subspace. Such

solutions would differ in the eigenvalues associated with the corresponding

components.

A supporting argument for our conjecture comes from the work of Miao

and Hua (1998), who studied the cost:

min

W∈Rk×nTr −log WXXW+WW.(5.3)

Equation 5.3 can be seen as a limiting case of our conjecture, where the frac-

tional exponent goes to zero. Indeed, Miao and Hua (1998) proved that the

rows of optimal Ware an orthonormal basis for the principal eigenspace.

6 Numerical Experiments

Next, we test our ndings using a simple articial data set. We generated

an n=10 dimensional data set and simulated our ofine and online algo-

rithms to reduce this data set to k=3 dimensions, using different values of

the parameter τ. The results are plotted in Figures 2, 3, 4, and 5, along with

details of the simulations in the gures’ caption.

Consistent with theorems 1 and 2, small perturbations to PSP and PSW

xed points decayed (solid lines) or grew (dashed lines) depending on the

value of τ(see Figure 2A). Ofine simulations that start from random initial

conditions converged to the PSP (or the PSW) solution if the xed point was

100 C. Pehlevan, A. Sengupta, and D. Chklovskii

Figure 2: Demonstration of the stability of the PSP (top row) and PSW (bottom

row) algorithms. We constructed an n=10 by T=2000 data matrix Xfrom its

SVD, where the left and right singular vectors are chosen randomly; the top

three singular values are set to {√3T,√2T,√T}; and the rest of the singular

values are chosen uniformly in [0,0.1√T]. Learning rates were ηt=1/103+t.

Errors were dened using deviation of the neural lters from their optimal val-

ues (Pehlevan et al., 2015). Let Ube the 10 ×3 matrix whose columns are the

top three left singular vectors of X.PSPerror:

F(t)F(t)−UU

F, PSW er-

ror:

F(t)F(t)−USU

F, with S=diag ([1/3,1/2,1])in Matlab notation. Solid

(dashed) lines indicate linearly stable (unstable) choices of τ. (A) Small per-

turbations to the xed point. Wand Mmatrices were initialized by adding a

random gaussian variable, N(0,10−6), elementwise to their xed point values.

(B) Ofine algorithm, initialized with random Wand Mmatrices. (C) Online

algorithm, initialized with the same initial condition as in panel B. A random

column of Xis processed at each time.

linearly stable, (see Figure 2B). Interestingly, the online algorithms’ perfor-

mances were very close to those of the ofine, (see Figure 2C).

The error for linearly unstable simulations in Figure 2 saturates rather

than blowing up. This may seem at odds with theorems 1 and 2, which

stated that if there is a stable xed point of the dynamics, it should be the

PSP/PSW solution. A closer look resolves this paradox. In Figure 3, we plot

the evolution of an element of the Mmatrix in the ofine algorithms for

Why Do Similarity Matching Objectives Lead to Local Networks? 101

Figure 3: Evolution of a synaptic weight. The same data set was used as in Fig-

ure 2. η=10−3.

Figure 4: Effect of τof performance. Error after 2 ×104gradient steps are plot-

ted as a function of different choices of τ. The same data set was used as in

Figure 2 with the same network initialization and learning rates. Both curves

start from τ=0.01 and go to the maximum value allowed for linear stability.

stable and unstable choices of τ. When the principal subspace is linearly

unstable, the synaptic weights exhibit undamped oscillations. The dynam-

ics seems to be conned to a manifold with a xed distance (in terms of

the error metric) from the principal subspace. That the error does not grow

to innity is a result of the stabilizing effect of min-max antagonism of the

synaptic weights. Online algorithms behave similarly.

102 C. Pehlevan, A. Sengupta, and D. Chklovskii

Figure 5: Comparison of the online PSP algorithm with the subspace network

(Oja, 1989) and the GHA (Sanger, 1989). The data set and the error metric are as

in Figure 2. For fairness of comparison, the learning rates in all networks were

set to η=10−3.τ=1/2 for the online PSP algorithm. Feedforward connectivity

matrices were initialized randomly. For the online PSP algorithm, lateral con-

nectivity matrix was initialized to the identity matrix. Curves show averages

over 10 trials.

Next, we studied in detail the effect of τparameter on the convergence

(see Figure 4). In the ofine algorithm, we plot the error after a xed num-

ber of gradient steps, as a function of τ. For PSP, there is an optimal τ.

Decreasing τbeyond the optimal value does not lead to a degradation in

performance; however increasing it leads to a rapid increase in the error.

For PSW, there is a plateau of low error for low values of τbut a rapid in-

crease as one approaches the linear instability threshold. Online algorithms

behave similarly.

Finally, we compared the performance of our online PSP algorithm to

neural PSP algorithms with heuristic learning rules such as the subspace

network (Oja, 1989) and the generalized Hebbian algorithm (GHA) (Sanger,

1989), on the same data set. We found that our algorithm converges much

faster (see Figure 5). Previously, the original similarity matching network

(Pehlevan et al., 2015), a special case of the online PSP algorithm of this

article, was shown to converge faster than the APEX (Kung, Diamantaras,

& Taur, 1994) and (Földiak’s 1989) networks.

7 Discussion

In this article, through transparent variable substitutions, we demonstrated

why biologically plausible neural networks can be derived from similarity

matching objectives, mathematically formalized the adversarial relation-

ship between Hebbian feedforward and anti-Hebbian lateral connections

Why Do Similarity Matching Objectives Lead to Local Networks? 103

using min-max optimization, and formulated dimensionality reduction

tasks as optimizations of fractional powers of matrices. The formalism we

developed should generalize to unsupervised tasks other than dimension-

ality reduction and could provide a theoretical foundation for both natural

and articial neural networks.

In comparing our networks with biological ones, most importantly, our

networks rely only on local learning rules that can be implemented by

synaptic plasticity. While Hebbian learning is famously observed in neu-

ral circuits (Bliss & Lømo, 1973; Bliss & Gardner-Medwin, 1973), our net-

works also require anti-Hebbian learning, which can be interpreted as the

long-term potentiation of inhibitory postsynaptic potentials. Experimen-

tally, such long-term potentiation can arise from pairing action potentials

in inhibitory neurons with subthreshold depolarization of postsynaptic

pyramidal neurons (Komatsu, 1994; Maffei, Nataraj, Nelson, & Turrigiano,

2006). However, plasticity in inhibitory synapses does not have to be Heb-

bian, that is, depend on the correlation between pre- and postsynaptic ac-

tivity (Kullmann, Moreau, Bakiri, & Nicholson, 2012).

To make progress, we had to make several simplications sacricing bi-

ological realism. In particular, we assumed that neuronal activity is a con-

tinuous variable that would correspond to membrane depolarization (in

graded potential neurons) or ring rate (in spiking neurons). We ignored

the nonlinearity of the neuronal input-output function. Such a linear regime

could be implemented via a resting state bias (in graded potential neurons)

or resting ring rate (in spiking neurons).

The applicability of our networks as models of biological networks can

be judged by experimentally testing the following predictions. First, we pre-

dict a relationship between the feedforward and lateral synaptic weight ma-

trices that could be tested using modern connectomics data sets. Second, we

suggest that similarity of output activity matches that of the input, which

could be tested by neuronal population activity measurements using cal-

cium imaging.

Often the choice of a learning rate is crucial to the learning performance

of neural networks. Here, we encountered a nuanced case where the ratio

of feedforward and lateral weights, τ, affects the learning performance sig-

nicantly. First, there is a maximum value of such a ratio, beyond which

the principal subspace solution is linearly unstable. The maximum value

depends on the principal eigenvalues, but for PSP, τ≤1/2 is always lin-

early stable. For PSW there is not always a safe choice. Having the same

learning rates for feedforward and lateral weights, τ=1, may actually be

unstable. Second, linear stability is not the only thing that affects perfor-

mance. In simulations, we observed for PSP, that there is an optimal value

of τ. For PSW, decreasing τseems to increase performance until a plateau is

reached. This difference between PSP and PSW may be attributed to the dif-

ference of origins of lateral connectivity. In PSW algorithms, lateral weights

originate from Lagrange multipliers enforcing an optimization constraint.

104 C. Pehlevan, A. Sengupta, and D. Chklovskii

Low τ, meaning higher lateral learning rates, force the network to satisfy

the constraint during the whole evolution of the algorithm.

Based on these observation, we can make practical suggestions for the τ

parameter. For PSP, τ=1/2 seems to be a good choice, which is also pre-

ferred from another derivation of an online similarity matching algorithm

(Pehlevan et al., 2015). For PSW, the smaller the τ, the better it is, although

one should make sure that the lateral weight learning rate η/τ is still suf-

ciently small.

Appendix A: Proof of Strong Min-Max Property for PSP Objective

Here we show that minimization with respect to Yand maximization with

respect to Mcan be exchanged in equation 2.5. We will make use of the

following min-max theorem (Boyd & Vandenberghe, 2004), for which we

give a proof for completeness:

Theorem 3. Let f :Rn×Rm−→ R. Suppose the saddle-point property holds,

that is, ∃a∗∈Rn,b∗∈Rmsuch that ∀a∈Rn,b∈Rm

f(a∗,b)≤f(a∗,b∗)≤f(a,b∗).(A.1)

Then

max

bmin

af(a,b)=min

amax

bf(a,b)=f(a∗,b∗).(A.2)

Proof. ∀c∈Rn,min

amaxbf(a,b)≤maxbf(c,b), which implies

min

amax

bf(a,b)≤max

bf(a∗,b)

=f(a∗,b∗)=min

af(a,b∗)≤max

bmin

af(a,b).(A.3)

Since maxbminaf(a,b)≤minamaxbf(a,b) is always true, we get an

equality.

Now, we present the main result of this section.

Proposition 1. Dene

f(Y,M,A):=Tr −4

TAY+2

TYMY −Tr MM,(A.4)

where Y,M,andAare arbitrary sized, real-valued matrices. f obeys a strong min-

max property:

Why Do Similarity Matching Objectives Lead to Local Networks? 105

min

Ymax

Mf(Y,M,A)=max

Mmin

Yf(Y,M,A)=− 3

T2/3Tr AA2/3.

(A.5)

Proof. We will show that the saddle-point property holds for equation A.4.

Then the result follows from theorem 3.

If the saddle point exists, it is when ∇f=0,

M∗=1

TY∗Y∗,

M∗Y∗=A.(A.6)

Note that M∗is symmetric and positive semidenite. Multiplying the rst

equation by M∗on the left and the right, and using the the second equation,

we arrive at

M∗3=1

TAA.(A.7)

Solutions to equation A.6 are not unique because M∗may not be invertible

depending on A. However, all solutions give the same value of f:

f(Y∗,M∗,A)=Tr −4

TAY∗+2

TY∗M∗Y∗−Tr M∗2

=Tr −4

TY∗M∗Y∗+2

TY∗M∗Y∗−Tr M∗2

=−3Tr M∗2=− 3

T2/3Tr AA2/3.(A.8)

Now, we check if the saddle-point property, equationA.1, holds. The rst

inequality is satised:

f(Y∗,M∗,A)−f(Y∗,M,A)

=Tr 2

TY∗(M∗−M)Y∗−Tr M∗2+Tr MM

=−2Tr(M∗M)+Tr M∗2+Tr MM

=

M∗−M

2

F≥0.(A.9)

The second inequality is also satised:

f(Y,M∗,A)−f(Y∗,M∗,A)

106 C. Pehlevan, A. Sengupta, and D. Chklovskii

=Tr −4

TA(Y−Y∗)+2

TYM∗Y−2

TY∗M∗Y∗

=Tr −4

TY∗M∗Y+2

TYM∗Y+2

TY∗M∗Y∗

=2

TTr (Y−Y∗)M∗(Y−Y∗)≥0,(A.10)

where the last line follows from M∗being positive semidenite.

Equations A.9 and A.10 show that the saddle-point property, equation

A.1, holds, and therefore max and min can be exchanged and the value of

fat the saddle point is f(Y∗,M∗,A)=− 3

T2/3Tr( ( AA)2/3).

Appendix B: Taking a Derivative Using a Chain Rule

Proposition 2. Suppose a differentiable, scalar function H(a1,...,am),where

ai∈Rdiwith arbitrary di. Assume a nite minimum with respect to amexists for

a given set of {a1,...,am−1}:

h(a1,...,am−1)=min

am

H(a1,...,am),(B.1)

and the optimal a∗

m=arg minamH(a1,...,am)is a stationary point,

∂H

∂am{a1,...,am−1,a∗

m}=0.(B.2)

Then, for i =1,...,m−1,

∂h

∂ai{a1,...,am−1}=∂H

∂ai{a1,...,am−1,a∗

m}

.(B.3)

Proof. The result follows from application of the chain rule and the station-

arity of the minimum:

∂h

∂ai{a1,...,am−1}

=∂H

∂ai{a1,...,am−1,a∗

m}+∂H

∂am{a1,...,am−1,a∗

m}∂am

∂ai{a1,...,am−1}

(B.4)

where the second term is zero due to equation B.2.

C Proof of Theorem 1

Here we prove theorem 1 using methodology from Pehlevan et al. (2015).

Why Do Similarity Matching Objectives Lead to Local Networks? 107

The xed points of equation 2.11 are (using¯for xed point):

¯

W=¯

FC,¯

M=¯

FC¯

F,(C.1)

where Cis the input covariance matrix dened as in equation 2.1.

C.1 Proof of Item 1. The result follows from equations 2.12 and C.1:

I=¯

M−1¯

M=¯

M−1¯

FC¯

F=¯

M−1¯

W¯

F=¯

F¯

F.(C.2)

C.2 Proof of Item 2. First note that at xed points, ¯

F¯

Fand Ccommute:

¯

F¯

FC =C¯

F¯

F.(C.3)

Proof. The result follows from equations 2.12 and C1:

¯

F¯

FC =¯

F¯

W=¯

F¯

M¯

F=¯

W¯

F=C¯

F¯

F.(C.4)

¯

F¯

Fand Cshare the same eigenvectors because they commute. Or-

thonormality of neural lters, equation C.2, implies that the krows of ¯

Fare

degenerate eigenvectors of ¯

F¯

Fwith unit eigenvalue. To see this, ¯

F¯

F¯

F=

¯

F. Because the lters are degenerate, the corresponding kshared eigenvec-

tors of Cmay not be the lters themselves but linear combinations of them.

Nevertheless, the shared eigenvectors composed of lters span the same

space as the lters.

Since we are interested in PSP, it is desirable that it is the top keigen-

vectors of Cthat span the lter space. A linear stability analysis around the

xed point reveals that any other combination is unstable and that the PS

is stable if τis chosen appropriately.

C.3 Proof of Item 3

C.3.1 Preliminaries. In order to perform a linear stability analysis, we

linearize the system of equation 2.11 around the xed point. Although

equation 2.11 depends on Wand M, we will nd it convenient to change

variables and work with Fand Minstead.

Using the relation F=M−1W, one can express linear perturbations of F

around its xed point, δF, in terms of perturbations of Wand M:

δF=δM−1¯

W+¯

M−1δW=−¯

M−1δM¯

F+¯

M−1δW.(C.5)

Linearization of equation 2.11 gives

dδW

dt =2δFC −2δW(C.6)

108 C. Pehlevan, A. Sengupta, and D. Chklovskii

and

τdδM

dt =δFC¯

F+¯

FC ¯

δF−δM.(C.7)

Using these, we arrive at

dδF

dt =−1

τ¯

M−1δFC¯

F+¯

FC ¯

δF+(2τ−1)δM¯

F

+2¯

M−1δFC −2δF.(C.8)

Equations C.7 and C.8 dene a closed system of equations.

It will be useful to decompose δFinto components:3

δF=δA¯

F+δS¯

F+δB¯

G,(C.9)

where δAis a k×kantisymmetric matrix, δSis a k×ksymmetric matrix,

and δBis a k×(n−k) matrix. ¯

Gis an (n−k)×nmatrix with orthonormal

rows, which are orthogonal to the rows of ¯

F.δAand δSare perturbations

that keep the neural lters within the lter space. Antisymmetric δAcorre-

sponds to rotations of lters within the lter space, preserving orthonormal-

ity. Symmetric δSdestroys orthonormality. δBis a perturbation that takes

the neural lters outside the lter space.

Let v1,...,nbe the eigenvectors Cand σ1,...,nbe the corresponding eigenval-

ues. We label them such that ¯

Fspans the same space as the space spanned

by the rst keigenvectors. We choose rows of ¯

Gto be the remaining

eigenvectors— ¯

G:=[vk+1,...,vn]. Note that with this choice,

k

Cik ¯

G

kj =σj+m¯

G

ij.(C.10)

C.3.2 Proof. The proof of item 3 in theorem 1 follows from studying the

stability of δBcomponent.

Multiplying equation C.8 on the right by ¯

G,onearrivesatadecoupled

equation for δB:

dδBj

i

dt =mPj

imδBj

m,Pj

im :=2¯

M−1

im σj+k−δim,(C.11)

where, for convenience, we changed our notation to δBkj =δBj

k.Foreach

j, the dynamics is linearly stable if all eigenvalues of all Pjare negative. In

turn, this implies that for stability, eigenvalues of ¯

Mshould be greater than

σk+1,...,n.

3See lemma 3 in Pehlevan et al. (2015) for a proof of why such a decomposition always

exists.

Why Do Similarity Matching Objectives Lead to Local Networks? 109

Eigenvalues of ¯

Mare

eig( ¯

M)={σ1,...,σ

k}.(C.12)

Proof. Theeigenvalueequation,

¯

FC¯

Fλ=λλ,(C.13)

implies that

C¯

Fλ=λ¯

Fλ,(C.14)

which can be seen by multiplying equation C.13 on the left by ¯

F, using

the commutation of ¯

F¯

Fand C, and the orthonormality of neural lters.

Further, orthonormality of neural lters implies

¯

F¯

F¯

Fλ=¯

Fλ.(C.15)

Then ¯

Fλis a shared eigenvector between Cand ¯

F¯

F.4Shared eigenvec-

tors of Cwith unit eigenvalue in ¯

F¯

Fare v1,...,vk. Since the eigenvalue of

¯

Fλwith respect to ¯

F¯

Fis 1, ¯

Fλmust be one of v1,...,vk. Then equation

C.14 implies that λ={σ1,...,σ

k}and

eig( ¯

M)={σ1,...,σ

k}.(C.16)

Then it follows that linear stability requires

{σ1,...,σ

k}>{σk+1,...,σ

n}.(C.17)

This proves our claim that if at the xed point, the neural lters span a sub-

space other than the principal subspace, the xed point is linearly unstable.

C.4 Proof of Item 4. We now assume that the xed point is the principal

subspace. From item 3, we know that the δBperturbations are stable. The

proof of item 4 in theorem 1 follows from the linear stabilities of δAand δS.

Multiplying equation C.8 on the right by ¯

F,

dδA

dt +dδS

dt =2−1

τ¯

M−1(δA+δS)¯

M−¯

M−1δM−δA

−2+1

τδS.(C.18)

4One might worry that ¯

Fλ=0, but this would require ¯

F¯

Fλ=λ=0,whichisa

contradiction.

110 C. Pehlevan, A. Sengupta, and D. Chklovskii

Unlike the case of δB, this equation is coupled to δM, whose dynamics,

equation C.7, reduces to

τdδM

dt =(δA+δS)¯

M+¯

M(−δA+δS)−δM.(C.19)

We will consider only symmetric δMperturbations, although if antisym-

metric perturbations were allowed, they would stably decay to zero because

the only antisymmetric term on the right-hand side of equation C.19 would

come from δM.

From equations C.18 and C.19, it follows that

d

dt δA+δS−(2τ−1)¯

M−1δM=−4δS.(C.20)

The right-hand side is symmetric. Therefore, the antisymmetric part of the

left-hand side must equal zero. This gives us an integral of the dynamics

:=δA(t)−τ−1

2¯

M−1δM(t)−δM(t)¯

M−1,(C.21)

where is a constant, skew symmetric matrix. This reveals an interesting

point: after the perturbation, δAand δMwill not decay to 0even if the xed

point is stable. In hindsight, this is expected because due to the symmetry

of the problem, there is a manifold of stable xed points (bases in princi-

pal subspace), and perturbations within this manifold should not decay. A

similar situation was observed in Pehlevan et al. (2015).

The symmetric part of equation C.20 gives

d

dt δS−τ−1

2¯

M−1δM+δM¯

M−1=−4δS,(C.22)

which, using equation C.19, implies

dδS

dt =1−1

2τ¯

M−1δA¯

M−¯

MδA¯

M−1

+1−1

2τ¯

M−1δS¯

M+¯

MδS¯

M−1+2δS−4δS

−1−1

2τ¯

M−1δM+δM¯

M−1.(C.23)

To summarize, we analyze the linear stability of the system of equations,

dened by equations C.19, C.21, and C.23.

Why Do Similarity Matching Objectives Lead to Local Networks? 111

Next, we change to a basis where ¯

Mis diagonal. ¯

Mis symmetric, its

eigenvalues are the principal eigenvectors {σ1,...,σ

k}as shown in ap-

pendix C.3, and it has an orthonormal set of eigenvectors. Let Ube the ma-

trix that contains the eigenvectors of ¯

Min its columns. Dene

δAU:=UδAU,

δSU:=UδSU,

δMU:=UδMU,

U:=UU.(C.24)

Expressing equations C.19, C.21, and C.23 in this new basis, in component

form, and eliminating δAU

ij:

d

dt δMU

ij

δSU

ij =Hij δMU

ij

δSU

ij +⎡

⎣

1

τσj−σi

1−1

2τσj

σi−σi

σj⎤

⎦U

ij,(C.25)

where

Hij :=

⎡

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎣

1−1

2τσj−σi1

σi−1

σj−1

τ

1

τσj+σi

1−1

2τσj

σi−σi

σjτ−1

21

σi−1

σj−1

σi+1

σj

1−1

2τσj

σi+σi

σj+2−4

⎤

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎦

.

(C.26)

This is a closed system of equations for each (i,j) pair! The xed point

of this system of equations is at

δSU

ij =0,

δMU

ij =

U

ij

1

σj−σi−τ−1

21

σi−1

σj.(C.27)

Hence, if the linear perturbations are stable, the perturbations that destroy

the orthonormality of neural lters will decay to zero and orthonormality

will be restored.

112 C. Pehlevan, A. Sengupta, and D. Chklovskii

The stability of the xed point is governed by the trace and the determi-

nant of the matrix Hij. The trace is

Tr(Hij)=−4+2−1

τσi

σj+σj

σi

,−1

τ,(C.28)

and the determinant is

det(Hij)=8+2

τ−4σi

σj+σj

σi.(C.29)

The system C.25 is linearly stable if both the trace is negative and the deter-

minant is positive.

Dening the following function of covariance eigenvalues,

γij :=σi

σj+σj

σi=2+σi−σj2

σiσj

,(C.30)

the trace is negative if and only if

τ<1+1/γij

2−4/γij

.(C.31)

The determinant is positive if and only if

τ< 1

2−4/γij

.(C.32)

Since γij >0, equation C.32 implies equation C.31. For stability, equation

C.32 has to be satised for all (i,j) pairs. When i=j,γii =2, equation C.32 is

satised because the right-hand side is innity. When i= j, equation C.32 is

nontrivial and depends on relations between covariance eigenvalues. Since

γij ≥2, τ≤1/2 is always stable.

Collectively, our results prove item 4 of theorem 1.

Appendix D: Proof of Strong Min-Max Property for PSW Objective

Here we show that minimization with respect to Yand maximization with

respect to Mcan be exchanged in equation 3.4. We do this by explicitly cal-

culating the value of

−2

TTr XWY+Tr M1

TYY−I (D.1)

with respect to min-max and max-min optimizations, and showing that the

value does not change.

Why Do Similarity Matching Objectives Lead to Local Networks? 113

Proposition 3. Let A∈Rk×Twith k ≤T. Then

min

Y∈Rk×Tmax

M∈Rk×k−2

TTr AY+Tr M1

TYY−I

=− 2

T1/2Tr AA1/2.(D.2)

Proof. The left side of equation D.2 is a constrained optimization problem:

min

Y∈Rk×T−2

TTr AYs.t.1

TYY=I.(D.3)

Suppose an SVD of A=k

i=1σA,iuA,iv

A,iand an SVD of Y=

k

i=1σY,iuY,iv

Y,i. The constraint sets σY,i=√T. Then the optimization

problem reduces to

min

uY,1,...,uY,k,vY,1,...,vY,k−2

√T

k

i=1

σA,i

k

j=1

u

A,iuY,jv

A,ivY,j,

s.t.u

Y,iuY,j=δij,v

Y,ivY,j=δij.(D.4)

Note that k

j=1u

A,iuY,jv

A,ivY,j≤15and therefore the cost is lower bounded

by −2

√Tk

i=1σA,i. The lower bound is achieved when uA,i=uY,iand

vA,i=vY,i, with the optimal value of the objective −2

√Tk

i=1σA,i=

−2

√TTr( ( AA)1/2).

Proposition 4. Let A∈Rk×Twith k ≤T. Then

max

M∈Rk×kmin

Y∈Rk×T−2

TTr AY+Tr M1

TYY−I

=− 2

T1/2Tr AA1/2.(D.5)

Proof. Note that we only need to consider the symmetric part of M,because

its antisymmetric component does not contribute to the cost. Below, we use

Mto mean its symmetric part. We will evaluate the value of the objective

−2

TTr AY+Tr M1

TYY−I (D.6)

5Dene αj:=u

A,iuY,jand βj:=v

A,ivY,j. Because u

Y,iuY,j=v

Y,ivY,j=δij, it follows

that k

i=1α2

i=1andk

i=1β2

i≤1. The sum in question is k

i=1αiβi, which is an inner

product of a unit vector and a vector with magnitude less than or equal to 1. Hence, the

maximal inner product can be 1.

114 C. Pehlevan, A. Sengupta, and D. Chklovskii

considering the following cases:

1. A=0. In this case, the rst term in equation D.6 drops. Minimization

of the second term with respect to Ygives −∞ if Mhasanegative

eigenvalue or a 0 if Mis positive semidenite. Hence, the max-min

objective is zero, and the proposition holds.

2. A= 0and Ais full rank.

a. Mhas at least one negative eigenvalue. Then minimization of

equation D.6 with respect to Ygives −∞.

b. Mis positive semidenite and has at least one zero eigenvalue.

Then minimization of equation D.6 with respect to Ygives −∞.

To achieve this solution, one chooses all columns of Yto be one

of the zero eigenvectors. The sign of the eigenvector is chosen

such that Tr AYis positive. Multiplying Yby a positive scalar,

one can reduce the objective indenitely.

c. Mis positive denite. Then Y∗=M−1Aminimizes equation D.6

with respect to Y. Plugging this back to equation D.6, we get the

objective

−1

TTr AM−1A−Tr (M).(D.7)

The positive-denite Mthat maximizes equation D.7 can be

found by setting its derivative to zero:

M∗2=1

TAA.(D.8)

Plugging this back in equation D.7, one gets the objective

−2

√TTr AA1/2,(D.9)

which is maximal with respect to all possible M. Therefore the

proposition holds.

3. A= 0and Ahas rank r<k.

a. Mhas at least one negative eigenvalue. Then, minimization of

equation D.6 with respect to Ygives −∞, as before.

b. Mis positive semidenite and has at least one zero eigenvalue.

i. If at least one of the zero eigenvectors of Misnotaleftzero-

singular vector of A, then minimization of equation D.6 with

respect to Ygives −∞. To achieve this solution, one chooses

all columns of Yto be the zero eigenvector of Mthat is not

a left zero-singular vector of A. The sign of the eigenvector

is chosen such that Tr AYis positive. Multiplying Yby a

positive scalar, one can reduce the objective indenitely.

Why Do Similarity Matching Objectives Lead to Local Networks? 115

ii. If all of the zero eigenvectors of Mare also left zero-singular

vectors of A, then equation D.6 can be reformulated in the

subspace spanned by the top reigenvectors of M. Suppose an

SVD for A=r

i=1σA,iuA,iv

M,iwith σA,1≥σA,2≥...≥σA,r.

One can decompose Y=YA+Y⊥, where columns of Y⊥are

perpendicular to the space spanned by {uA,1,...,uA,r}. Then

the value of the objective equation D.6 depends only on

YA. Dening new matrices ˜

Ai,:=u

A,iA,˜

Yi,:=u

A,iYA,˜

Mij =

u

A,iMuA,j, where i,j=1,...,r, we can rewrite equation D.6

as

−2

TTr ˜

A˜

Y+Tr ˜

M1

T˜

Y˜

Y−I.(D.10)

Now ˜

Ais full rank and ˜

Mis positive denite. As in 2c, the

objective, which is maximal with respect to positive-denite

˜

Mmatrices, is

−2

√TTr ˜

A˜

A1/2=− 2

√TTr AA1/2.(D.11)

c. Mis positive denite. As in 2c, the objective, which is maximal

with respect to positive-denite Mmatrices, is

−2

√TTr AA1/2.(D.12)

This is also maximal with respect to all possible M. Therefore

the proposition holds.

Collectively, these arguments prove equation D.6.

Propositions 3 and 4 imply the strong min-max property for the PSW

cost.

Proposition 5. The strong min-max property for the PSW cost:

min

Y∈Rk×Tmax

M∈Rk×k−2

TTr XWY+Tr M1

TYY−I

=max

M∈Rk×kmin

Y∈Rk×T−2

TTr XWY+Tr M1

TYY−I

=− 2

T1/2Tr WXXW1/2.(D.13)

Appendix E: Proof of Theorem 2

Here we prove theorem 2.

116 C. Pehlevan, A. Sengupta, and D. Chklovskii

E.1 Proof of Item 1. Item 1 directly follows from the xed-point equa-

tions of the dynamical system 3.8 (¯ for xed point):

¯

W=¯

YX=¯

FC,

I=¯

Y¯

Y=¯

FC¯

F.(E.1)

E.2 Proof of Item 2. We will prove item 2, making use of the normalized

neural lters:

R:=FC1/2,(E.2)

where the input covariance matrix Cis dened as in equation 2.1. At the

xed point, the normalized neural lters are orthonormal:

¯

R¯

R=¯

FC¯

F=¯

Y¯

Y=I.(E.3)

Normalized lters commute with the covariance matrix:

¯

R¯

RC =C¯

R¯

R.(E.4)

Proof.

¯

R¯

RC =C1/2¯

F¯

FC3/2=C1/2¯

F¯

WC1/2=C1/2¯

F¯

M¯

FC1/2

=C1/2¯

W¯

FC1/2=CC1/2¯

F¯

FC1/2=C¯

R¯

R.(E.5)

Therefore, as argued in section C.2, rows of Rspan a subspace spanned

by some keigenvectors of C.IfCis invertible, the row space of Fis the same

as R(follows from equation E.2), and item 2 follows.

E.3 Proof of item 3.

E.3.1 Preliminaries. In order to perform a linear stability analysis, we lin-

earize the system of equation 3.8 around the xed point. The evolution of

Wand Mperturbations follows from linearization of equation 3.8:

τdδM

dt =δR¯

R+¯

RδR,

dδW

dt =2δRC1/2−2δW.(E.6)

Although equation 3.8 depends on Wand M, we will nd it convenient

to change variables and work with R, as dened in equation E.2 and Min-

stead. Since R,W,andMare interdependent, we express the perturbations

Why Do Similarity Matching Objectives Lead to Local Networks? 117

of Rin terms of W,andMperturbations:

δR=δM−1¯

WC1/2+¯

M−1δWC1/2=−¯

M−1δM¯

R+¯

M−1δWC1/2,(E.7)

which implies that

dδR

dt =−¯

M−1dδM

dt ¯

R+¯

M−1dδW

dt C1/2.(E.8)

Plugging these in and eliminating δW, we arrive at a linearized equation

for δR:

dδR

dt =−1

τ¯

M−1δR¯

R+¯

RδR+2τδM¯

R+2¯

M−1δRC −2δR.(E.9)

To asses the stability of δR, we expand it as in section C.3:

δR=δA¯

R+δS¯

R+δB¯

G,(E.10)

where δAis a k×kskew-symmetric matrix, δSis an k×ksymmetric matrix,

and δBis a k×(n−k) matrix. ¯

Gis an (n−k)×nmatrix, with orthonormal

rows. These rows are chosen to be orthogonal to the rows of ¯

R. As before,

skew-symmetric δAcorresponds to rotations of lters within the normal-

ized lter space, symmetric δSkeeps the normalized lter space invariant

but destroys orthonormality, and δBis a perturbation that takes the normal-

ized neural lters outside the lter space.

Let v1,...,nbe the eigenvectors Cand σ1,...,nbe the corresponding eigen-

values. We label them such that ¯

Rspans the same space as the space

spanned by the rst keigenvectors. We choose rows of ¯

Gto be the remain-

ing eigenvectors— ¯

G:=[vk+1,...,vn].

E.3.1 Proof. Proof of item 3 of theorem 2 follows from studying the sta-

bility of δBcomponent. Multiplying equation E.9 on the right by ¯

G,we

arrive at a decoupled evolution equation:

dδBj

i

dt =

m

Pj

imδBj

m,Pj

im :=2¯

M−1

im σj+k−δim,(E.11)

where for convenience we change our notation to δBkj =δBj

k.

Equations E.1 and E.3 imply ¯

M2=¯

WC ¯

W=¯

RC2¯

Rand hence:

¯

M=¯

RC ¯

R.(E.12)

Taking into account equations E.3 and E.4, the case at hand reduces to the

proof presented in section C.3: stable solutions are those for which

{σ1,...,σ

k}>{σk+1,...,σ

n}.(E.13)

118 C. Pehlevan, A. Sengupta, and D. Chklovskii

This proves that if at the xed point, normalized neural lters span a sub-

space other than the principal subspace, the xed point is linearly unstable.

Since the span of normalized neural lters is that of the neural lters, item

3 follows.

E.4 Proof of Item 4. Proof of item 4 follows from the linear stabilities of

δAand δS. Multiplying equation E.9 on the right by ¯

Rand separating the

resulting equation into its symmetric and antisymmetric parts, we arrive at

dδA

dt =−1

τ¯

M−1δS−δS¯

M−1−¯

M−1δM+δM¯

M−1−2δA

+¯

M−1δA¯

M+¯

MδA¯

M−1+¯

M−1δS¯

M−¯

MδS¯

M−1,

dδS

dt =−1

τ¯

M−1δS+δS¯

M−1−¯

M−1δM−δM¯

M−1−2δS

+¯

M−1δA¯

M−¯

MδA¯

M−1+¯

M−1δS¯

M+¯

MδS¯

M−1.(E.14)

To obtain a closed set of equations, we complement these equations with δM

evolution, which we obtain by plugging the expansion E.10 into equation

E.6:

τdδM

dt =2δS.(E.15)

We consider only symmetric δMbelow, since our algorithm preserves the

symmetry of Min runtime.

We now change to a basis where ¯

Mis diagonal. ¯

Mis symmetric and

has an orthonormal set of eigenvectors. Its eigenvalues are the principal

eigenvalues {σ1,...,σ

k}(from section C.3). Let Ube the matrix that contains

the eigenvectors of ¯

Min its columns. Dene

δAU:=UδAU,

δSU:=UδSU,

δMU:=UδMU.(E.16)

In this new basis, the linearized equations, in component form, become

d

dt ⎡

⎢

⎢

⎣

δMU

ij

δAU

ij

δSU

ij

⎤

⎥

⎥

⎦=Hij ⎡

⎢

⎢

⎣

δMU

ij

δAU

ij

δSU

ij

⎤

⎥

⎥

⎦

,(E.17)

Why Do Similarity Matching Objectives Lead to Local Networks? 119

where

Hij :=⎡

⎢

⎢

⎢

⎢

⎣

00 2

τ

1

σj−1

σi−2+σj

σi+σi

σj−1

τ1

σi−1

σj+σj

σi−σi

σj

−1

σj−1

σi

σj

σi−σi

σj−1

τ1

σi+1

σj+σj

σi+σi

σj−2

⎤

⎥

⎥

⎥

⎥

⎦

.

(E.18)

Linear stability is governed by the three eigenvalues of Hij. One of the

eigenvalues is 0 due to the existence of the rotational symmetry in the prob-

lem. The corresponding eigenvector is σj−σi,1,0. Note that the third

element of the eigenvector is zero, showing that the orthogonality of the

normalized neural lters is not spoiled even in this mode.

For stability of the principal subspace, the other two eigenvalues must

be negative, which means their sum should be negative and their multipli-

cation should be positive. It is easy to show that both the negativity of the

summation and the positivity of the multiplication hold if and only if for

all (i,j) pairs with i= j,

τ< σi+σj

2σi−σj2.(E.19)

Hence we have shown that linear perturbations of xed point weights

decay to a conguration in which normalized neural lters are rotations of

the original normalized neural lters within the subspace. It follows from

equation E.2, that the same holds for neural lters.

Appendix F: Autapse-Free Similarity Matching Network with

Asymmetric Lateral Connectivity

Here, we derive an alternative neural network algorithm for PSP, which

does not feature autaptic connections and has asymmetric lateral connec-

tions. To this end, we replace the gradient descent neural dynamics dened

by equation 2.20 by a coordinate descent dynamics.

In the coordinate descent approach, one nds at every step, the optimal

value of one component of ytwhile keeping the rest xed. By taking the

derivative of the cost −4x

tWyt+2y

tMytwith respect to yt,iand setting it

to zero, we nd

yt,i=

j=1

Wt,ij

Mt,ii

xt,j−

j=i

Mt,ij

Mt,ii

yt,j.(F.1)

120 C. Pehlevan, A. Sengupta, and D. Chklovskii

The components can be cycled through in any order until the iteration con-

verges to a xed point. The iteration is guaranteed to converge under very

mild assumptions: diagonals of Mhave to be positive (Luo & Tseng, 1991),

which is satised if Mis initialized that way (see equation 2.21. Finally,

equation F.1) can be interpreted as a Gauss-Seidel iteration and generaliza-

tions to other iterative schemes are possible (see Pehlevan et al., 2015).

The coordinate descent iteration, equation F.1, can be interpreted as the

dynamics of an asynchronous autapse-free neural network, (see Figure 1B),

where synaptic weights are

˜

Wt,ij =Wt,ij

Mt,ii

,˜

Mt,ij =Mt,,j

Mt,ii

,˜

Mt,ii =0.(F.2)

With this denition, the lateral weights are now asymmetric because Mt,ii =

Mt,jj if i= j.

We can derive updates for these synaptic weights from the updates for

Wtand Mt(see equation 2.21). By dening another scalar state variable for

each ith neuron ˜

Dt,i:=τMt,ii/ηt−1, we arrive at6

˜

Dt+1,i=ηt−1

ηt1−ηt

τ˜

Dt,i+y2

t,i,

˜

Wt+1,ij =1−2ηt

1−ηt/τ ˜

Wt,ij

+1

˜

Dt+1,i2τyt,ixt,j−1−2ηt

1−ηt/τ y2

t,i˜

Wt,ij,

˜

Mt+1,i,j=i=˜

Mt,ij +1

˜

Dt+1,iyt,iyt,j−y2

t,i˜

Mt,ij,

˜

Mt+1,ii =0.(F.3)

Here, in addition to synaptic weights, the neurons need to keep track of

a postsynaptic activity dependent variable ˜

Dt,iand the gradient descent-

ascent learning rate parameters ηt,ηt−1and τ. The updates are local.

For the special case of τ=1/2andηt=η/2, these plasticity rules sim-

plify to

˜

Dt+1,i=(1 −η)˜

Dt,i+y2

t,i,

6These update rules can be derived as follows. Start by the denition of the synap-

tic weights, equation F.2: Mt+1,ii ˜

Mt+1,ij =Mt+1,ij. By the gradient-descent update equa-

tion 2.21, Mt+1,ij =1−ηt

τMt,ij +ηt

τyt,iyt,j=1−ηt

τ˜

Mt,ijMt,ii +ηt

τyt,iyt,j, where in the

second equality we again used equation F.2. But note that (1 −ηt

τ)Mt,ii =Mt+1,ii −ηt

τy2

t,i,

from equation 2.21. Combining all of these, ˜

Mt+1,ij =˜

Mt,ij +ηt

τMt+1,ii (yt,ixt,j−y2

t,i˜

Mt,ij).

Similar derivation can be given for feedforward updates.

Why Do Similarity Matching Objectives Lead to Local Networks? 121

˜

Wt+1,ij =˜

Wt,ij +1

˜

Dt+1,iyt,ixt,j−y2

t,i˜

Wt,ij

˜

Mt+1,i,j=i=˜

Mt,ij +1

˜

Dt+1,iyt,iyt,j−y2

t,i˜

Mt,ij.

˜

Mt+1,ii =0,(F.4)

which is precisely the neural online similarity matching algorithm we pre-

viously gave in Pehlevan et al. (2015). Both feedforward and lateral updates

have the same form as a single-neuron Oja’s rule (Oja, 1982).

Note that the algorithm derived above is essentially the same as the one

in the main text: given the same initial conditions and the same inputs, xt,

they will produce the same outputs, yt. The only difference is a rearrange-

ment of synaptic weights in the neural network implementation.

Appendix G: Autapse-Free Constrained Similarity Matching Network

with Asymmetric Lateral Connectivity

Following similar steps to appendix F, we derive an autapse-free PSW neu-

ral algorithm with asymmetric lateral connections. We replace the gradient

descent neural dynamics dened by equation 3.13 by a coordinate descent

dynamics where at every step, one nds the optimal value of one compo-

nent of yt, while keeping the rest xed:

yt,i=

j=1

Wt,ij

Mt,ii

xt,j−

j=i

Mt,ij

Mt,ii

yt,j.(G.1)

The components can be cycled through in any order until the iteration con-

verges to a xed point.

The coordinate descent iteration, equation G.1, can be interpreted as the

dynamics of an asynchronous autapse-free neural network (see Figure 1B)

with synaptic weights:

˜

Wt,ij =Wt,ij

Mt,ii

,˜

Mt,ij =Mt,,j

Mt,ii

,˜

Mt,ii =0.(G.2)

As in appendix F, the new lateral weights are asymmetric.

Updates for these synaptic weights can be derived from the updates for

Wtand Mt(see equation 3.14). Dening another scalar state variable for

each ith neuron ˜

Dt,i:=τMt,ii/ηt−1, we arrive at

˜

Dt+1,i=ηt−1

ηt1−ηt

τ˜

Dt,i+y2

t,i−1,

˜

Wt+1,ij =(1−2ηt)˜

Wt,ij

122 C. Pehlevan, A. Sengupta, and D. Chklovskii

+1

˜

Dt+1,i2τyt,ixt,j−(1−2ηt)y2

t,i−1˜

Wt,ij,

˜

Mt+1,i,j=i=˜

Mt,ij +1

˜

Dt+1,iyt,iyt,j−y2

t,i−1˜

Mt,ij,

˜

Mt+1,ii =0.(G.3)

As in appendix F, in addition to synaptic weights, the neurons need to

keep track of a postsynaptic activity dependent variable ˜

Dt,iand gradient

descent-ascent learning rate parameters ηW,t,ηM,tand ηM,t−1.

For the special case of ηt=η/2andτ=1/2, these plasticity rules sim-

plify to

˜

Dt+1,i=(1 −η)˜

Dt,i+y2

t,i,

˜

Wt+1,ij =(1 −η)˜

Wt,ij +1

˜

Dt+1,iyt,ixt,j−(1 −η)y2

t,i−1˜

Wt,ij

˜

Mt+1,i,j=i=˜

Mt,ij +1

˜

Dt+1,iyt,iyt,