Content uploaded by Cengiz Pehlevan

Author content

All content in this area was uploaded by Cengiz Pehlevan on Nov 26, 2018

Content may be subject to copyright.

Content uploaded by Dmitri B Chklovskii

Author content

All content in this area was uploaded by Dmitri B Chklovskii on Mar 02, 2015

Content may be subject to copyright.

LETTER Communicated by Erkki Oja

A Hebbian/Anti-Hebbian Neural Network for Linear

Subspace Learning: A Derivation from Multidimensional

Scaling of Streaming Data

Cengiz Pehlevan

cpehlevan@simonsfoundation.org

Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA 20147,

and Simons Center for Analysis, Simons Foundation, New York, NY 10010, U.S.A.

Tao Hu

taohu@tees.tamus.edu

Texas A&M University, College Station, TX 77843, U.S.A.

Dmitri B. Chklovskii

dchklovskii@simonsfoundation.org

Simons Center for Analysis, Simons Foundation, New York, NY 10010, U.S.A.

Neural network models of early sensory processing typically reduce the

dimensionality of streaming input data. Such networks learn the princi-

pal subspace, in the sense of principal component analysis, by adjusting

synaptic weights according to activity-dependent learning rules. When

derived from a principled cost function, these rules are nonlocal and

hence biologically implausible. At the same time, biologically plausible

local rules have been postulated rather than derived from a principled

cost function. Here, to bridge this gap, we derive a biologically plausible

network for subspace learning on streaming data by minimizing a prin-

cipled cost function. In a departure from previous work, where cost was

quantiﬁed by the representation, or reconstruction, error, we adopt a mul-

tidimensional scaling cost function for streaming data. The resulting al-

gorithm relies only on biologically plausible Hebbian and anti-Hebbian

local learning rules. In a stochastic setting, synaptic weights converge

to a stationary state, which projects the input data onto the principal

subspace. If the data are generated by a nonstationary distribution, the

network can track the principal subspace. Thus, our result makes a step

toward an algorithmic theory of neural computation.

1 Introduction

Early sensory processing reduces the dimensionality of streamed inputs

(Hyv¨

arinen, Hurri, & Hoyer, 2009), as evidenced by a high ratio of input

to output nerve ﬁber counts (Shepherd, 2003). For example, in the human

Neural Computation 27, 1461–1495 (2015) c

2015 Massachusetts Institute of Technology

doi:10.1162/NECO_a_00745

1462 C. Pehlevan, T. Hu, and D. Chklovskii

Figure 1: An Oja neuron and our neural network. (A) A single Oja neuron

computes the principal component, y, of the input data, x, if its synaptic weights

follow Hebbian updates. (B) A multineuron network computes the principal

subspace of the input if the feedforward connection weight updates follow a

Hebbian and the lateral connection weight updates follow an anti-Hebbian rule.

retina, information gathered by approximately 125 million photoreceptors

is conveyed to the lateral geniculate nucleus through 1 million or so gan-

glion cells (Hubel, 1995). By learning a lower-dimensional subspace and

projecting the streamed data onto that subspace, the nervous system de-

noises and compresses the data simplifying further processing. Therefore,

a biologically plausible implementation of dimensionality reduction may

offer a model of early sensory processing.

For a single neuron, a biologically plausible implementation of dimen-

sionality reduction in the streaming, or online, setting has been proposed

in the seminal work of Oja (1982; see Figure 1A). At each time point, t,an

input vector, xt, is presented to the neuron, and, in response, it computes a

scalar output, yt=wxt, where wis a row-vector of input synaptic weights.

Furthermore, synaptic weights ware updated according to a version of

Hebbian learning called Oja’s rule,

w←w+ηyt(x

t−wyt), (1.1)

where ηis a learning rate and designates a transpose. Then the neuron’s

synaptic weight vector converges to the principal eigenvector of the covari-

ance matrix of the streamed data (Oja, 1982). Importantly, Oja’s learning

rule is local, meaning that synaptic weight updates depend on the activi-

ties of only pre- and postsynaptic neurons accessible to each synapse and

therefore biologically plausible.

Oja’s rule can be derived by an approximate gradient descent of the

mean squared representation error (Cichocki & Amari, 2002; Yang, 1995), a

so-called synthesis view of principal component analysis (PCA) (Pearson,

A Hebbian/Anti-Hebbian Neural Network 1463

1901; Preisendorfer & Mobley, 1988):

min

w

t

xt−wwxt2

2.(1.2)

Computing principal components beyond the ﬁrst requires more than

one output neuron and motivated numerous neural networks. Some well-

known examples are the generalized Hebbian algorithm (GHA) (Sanger,

1989), F¨

oldiak’s network (F¨

oldiak, 1989), the subspace network (Karhunen

& Oja, 1982), Rubner’s network (Rubner & Tavan, 1989; Rubner & Schulten,

1990), Leen’s minimal coupling and full coupling networks (Leen, 1990,

1991), and the APEX network (Kung & Diamantaras, 1990; Kung, Diaman-

taras, & Taur, 1994). We refer to Becker and Plumbley (1996), Diamantaras

and Kung (1996), and Diamantaras (2002) for a detailed review of these and

further developments.

However, none of the previous contributions was able to derive a multi-

neuronal single-layer network with local learning rules by minimizing a

principled cost function, in a way that Oja’s rule, equation 1.1, was de-

rived for a single neuron. The GHA and the subspace rules rely on nonlocal

learning rules: feedforward synaptic updates depend on other neurons’

synaptic weights and activities. Leen’s minimal network is also nonlocal:

feedforward synaptic updates of a neuron depend on its lateral synaptic

weights. While F¨

oldiak’s, Rubner’s, and Leen’s full coupling networks use

local Hebbian and anti-Hebbian rules, they were postulated rather than

derived from a principled cost function. APEX network perhaps comes

closest to our criterion: the rule for each neuron can be related separately to

a cost function that includes contributions from other neurons. But no cost

function describes all the neurons combined.

At the same time, numerous dimensionality-reduction algorithms have

been developed for data analysis needs, disregarding the biological plausi-

bility requirement. Perhaps the most common approach is again principal

component analysis (PCA), which was originally developed for batch pro-

cessing (Pearson, 1901) but later adapted to streaming data (Yang, 1995;

Crammer, 2006; Arora, Cotter, Livescu, & Srebro, 2012; Goes, Zhang, Arora,

& Lerman, 2014). (For a more detailed collection of references, see, e.g.,

Balzano, 2012.) These algorithms typically minimize the representation er-

ror cost function:

min

FX−FFX2

F,(1.3)

where Xis a data matrix and Fis a wide matrix (for detailed notation, see

below). The minimum of equation 1.3 is when rows of Fare orthonormal

1464 C. Pehlevan, T. Hu, and D. Chklovskii

and span the m-dimensional principal subspace, and therefore FFis the

projection matrix to the subspace (Yang, 1995).1

A gradient descent minimization of such cost function can be approx-

imately implemented by the subspace network (Yang, 1995), which, as

pointed out above, requires nonlocal learning rules. While this algorithm

can be implemented in a neural network using local learning rules, it re-

quires a second layer of neurons (Oja, 1992), making it less appealing.

In this letter, we derive a single-layer network with local Hebbian and

anti-Hebbian learning rules, similar in architecture to F¨

oldiak’s (1989) (see

Figure 1B), from a principled cost function and demonstrate that it recov-

ers a principal subspace from streaming data. The novelty of our approach

is that rather than starting with the representation error cost function tra-

ditionally used for dimensionality reduction, such as PCA, we use the

cost function of classical multidimensional scaling (CMDS), a member of

the family of multidimensional scaling (MDS) methods (Cox & Cox, 2000;

Mardia, Kent, & Bibby, 1980). Whereas the connection between CMDS and

PCA has been pointed out previously (Williams, 2001; Cox & Cox, 2000;

Mardia et al., 1980), CMDS is typically performed in the batch setting. In-

stead, we developed a neural network implementation of CMDS for stream-

ing data.

The rest of the letter is organized as follows. In section 2, by minimizing

the CMDS cost function, we derive two online algorithms implementable

by a single-layer network, with synchronous and asynchronous synaptic

weight updates. In section 3, we demonstrate analytically that synaptic

weights deﬁne a principal subspace whose dimension mis determined by

the number of output neurons and that the stability of the solution requires

that this subspace corresponds to top mprincipal components. In section 4,

we show numerically that our algorithm recovers the principal subspace of

a synthetic data set and does it faster than the existing algorithms. Finally,

in section 5, we consider the case when data are generated by a nonstation-

ary distribution and present a generalization of our algorithm to principal

subspace tracking.

2 Derivation of Online Algorithms from the CMDS Cost Function

CMDS represents high-dimensional input data in a lower-dimensional out-

put space while preserving pairwise similarities between samples (Young

& Householder, 1938; Torgerson, 1952).2Let Tcentered input data sam-

ples in Rnbe represented by column vectors xt=1,...,Tconcatenated into an

1Recall that in general, the projection matrix to the row space of a matrix Pis given by

PPP−1P,providedPPisfull rank (Plumbley, 1995). If the rows of Pare orthonormal,

this reduces to PP.

2Whereas MDS in general starts with dissimilarities between samples that may not live

in Euclidean geometry, in CMDS, data are assumed to have a Euclidean representation.

A Hebbian/Anti-Hebbian Neural Network 1465

n×Tmatrix X=[x1,...,xT]. The corresponding output representations

in Rm,m≤n, are column vectors, yt=1,...,T, concatenated into an m×T-

dimensional matrix Y=[y1,...,yT]. Similarities between vectors in Eu-

clidean spaces are captured by their inner products. For the input (output)

data, such inner products are assembled into a T×TGram matrix XX

(YY).3For a given X, CMDS ﬁnds Yby minimizing the so-called strain

cost function (Carroll & Chang, 1972):

min

YXX−YY2

F.(2.1)

For discovering a low-dimensional subspace, the CMDS cost function,

equation 2.1, is a viable alternative to the representation error cost function,

equation 1.3, because its solution is related to PCA (Williams, 2001; Cox

& Cox, 2000; Mardia et al., 1980). Speciﬁcally, Yis the linear projection

of Xonto the (principal sub-)space spanned by mprincipal eigenvectors

of the sample covariance matrix CT=1

TT

t=1xtx

t=XX. The CMDS cost

function deﬁnes a subspace rather than individual eigenvectors because

left orthogonal rotations of an optimal Ystay in the subspace and are also

optimal, as is evident from the symmetry of the cost function.

In order to reduce the dimensionality of streaming data, we minimize the

CMDS cost function, equation 2.1, in the stochastic online setting. At time

T, a data sample, xT, drawn independently from a zero-mean distribution

is presented to the algorithm, which computes a corresponding output, yT,

prior to the presentation of the next data sample. Whereas in the batch

setting, each data sample affects all outputs, in the online setting, past

outputs cannot be altered. Thus, at time T, the algorithm minimizes the cost

depending on all inputs and ouputs up to time Twith respect to yTwhile

keeping all the previous outputs ﬁxed:

yT=arg min

yT

XX−YY2

F=arg min

yT

T

t=1

T

t=1

(x

txt−y

tyt)2,(2.2)

where the last equality follows from the deﬁnition of the Frobenius norm.

By keeping only the terms that depend on current output yT, we get

yT=arg min

yT−4x

TT−1

t=1

xty

tyT+2y

TT−1

t=1

yty

tyT

−2

xT

2yT2+yT4.(2.3)

3When input data are pairwise Euclidean distances, assembled into a matrix Q,the

Gram matrix, XX, can be constructed from Qby HZH,whereZij =−1/2Q2

ij,H=In−

(1/n)11is the centering matrix, 1is the vector of nunitary components, and Inis the

n-dimensional identity matrix (Cox & Cox, 2000; Mardia et al., 1980).

1466 C. Pehlevan, T. Hu, and D. Chklovskii

In the large-Tlimit, expression 2.3 simpliﬁes further because the ﬁrst two

terms grow linearly with Tand therefore dominate over the last two. After

dropping the last two terms, we arrive at

yT=arg min

yT−4x

TT−1

t=1

xty

tyT+2y

TT−1

t=1

yty

tyT.(2.4)

We term the cost in expression 2.4 the online CMDS cost. Because the on-

line CMDS cost is a positive semideﬁnite quadratic form in yT, this optimiza-

tion problem is convex. While it admits a closed-form analytical solution

via matrix inversion, we are interested in biologically plausible algorithms.

Next, we consider two algorithms that can be mapped onto single-layer

neural networks with local learning rules: coordinate descent leading to

asynchronous updates and Jacobi iteration leading to synchronous updates.

2.1 A Neural Network with Asynchronous Updates. The online CMDS

cost function, equation 2.4, can be minimized by coordinate descent, which

at every step ﬁnds the optimal value of one component of yTwhile keeping

the rest ﬁxed. The components can be cycled through in any order until the

iteration converges to a ﬁxed point. Such iteration is guaranteed to converge

under very mild assumptions: diagonals of T−1

t=1yty

thave to be positive

(Luo & Tseng, 1991), meaning that each output coordinate has produced

at least one nonzero output before current time step T. This condition is

almost always satisﬁed in practice.

The cost to be minimized at each coordinate descent step with respect to

the ith channel’s activity is

yT,i=arg min

yT,i−4x

TT−1

t=1

xty

tyT+2y

TT−1

t=1

yty

tyT.

Keeping only those terms that depend on yT,iyields

yT,i=arg min

yT,i−4

k

xT,kT−1

t=1

xt,kyt,iyT,i

+4

j=i

yT,jT−1

t=1

yt,jyt,iyT,i+2T−1

t=1

y2

t,iy2

T,i⎤

⎦.

A Hebbian/Anti-Hebbian Neural Network 1467

By taking a derivative with respect to yT,iand setting it to zero, we arrive

at the following closed-form solution:

yT,i=kT−1

t=1yt,ixt,kxT,k

T−1

t=1y2

t,i

−j=iT−1

t=1yt,iyt,jyT,j

T−1

t=1y2

t,i

.(2.5)

To implement this algorithm in a neural network, we denote normalized

input-output and output-output covariances,

WT,ik =T−1

t=1yt,ixt,k

T−1

t=1y2

t,i

,MT,i,j=i=T−1

t=1yt,iyt,j

T−1

t=1y2

t,i

,MT,ii =0,(2.6)

allowing us to rewrite the solution, equation 2.5, in a form suggestive of a

linear neural network,

yT,i←

n

j=1

WT,ijxT,j−

m

j=1

MT,ijyT,j,(2.7)

where WTand MTrepresent the synaptic weights of feedforward and lateral

connections respectively (see Figure 1B).

Finally, to formulate a fully online algorithm, we rewrite equation 2.6 in a

recursive form. This requires introducing a scalar variableDT,irepresenting

the cumulative squared activity of a neuron iup to time T−1,

DT,i=

T−1

t=1

y2

t,i,(2.8)

Then at each time point, T, after the output yTis computed by the network,

the following updates are performed:

DT+1,i←DT,i+y2

T,i,

WT+1,ij ←WT,ij +yT,i(xT,j−WT,ijyT,i)/DT+1,i,

MT+1,i,j=i←MT,ij +yT,i(yT,j−MT,ijyT,i)/DT+1,i.(2.9)

Equations 2.7 and 2.9 deﬁne a neural network algorithm that minimizes

the online CMDS cost function, equation 2.4, for streaming data by alternat-

ing between two phases: neural activity dynamics and synaptic updates.

After a data sample is presented at time T, in the neuronal activity phase,

neuron activities are updated one by one (i.e., asynchronously; see equation

2.7) until the dynamics converges to a ﬁxed point deﬁned by the following

1468 C. Pehlevan, T. Hu, and D. Chklovskii

equation:

yT=WTxT−MTyT⇒yT=(Im+MT)−1WTxT,(2.10)

where Imis the m-dimensional identity matrix.

In the second phase of the algorithm, synaptic weights are updated,

according to a local Hebbian rule, equation 2.9, for feedforward connec-

tions and, according to a local anti-Hebbian rule (due to the minus sign in

equation 2.7), for lateral connections. Interestingly, these updates have the

same form as the single-neuron Oja’s rule, equation 1.1 (Oja, 1982), except

that the learning rate is not a free parameter but is determined by the cu-

mulative neuronal activity 1/DT+1,i.4To the best of our knowledge, such a

single-neuron rule (Hu, Towﬁc, Pehlevan, Genkin, & Chklovskii, 2013) has

not been derived in the multineuron case. An alternative derivation of this

algorithm is presented in section A.1 in the appendix.

Unlike the representation error cost function, equation 1.3, the CMDS

cost function, equation 2.1, is formulated only in terms of input and output

activity. Yet the minimization with respect to Yrecovers feedforward and

lateral synaptic weights.

2.2 A Neural Network with Synchronous Updates. Here, we present

an alternative way to derive a neural network algorithm from the large-T

limit of the online CMDS cost function, equation 2.4. By taking a derivative

with respect to yTand setting it to zero, we arrive at the following linear

matrix equation:

T−1

t=1

yty

tyT=T−1

t=1

ytx

txT.(2.11)

We solve this system of equations using Jacobi iteration (Strang, 2009) by

ﬁrst splitting the output covariance matrix that appears on the left side of

equation 2.11 into its diagonal component DTand the remainder RT,

T−1

t=1

yty

t=DT+RT,

4The single-neuron Oja’s rule derived from the minimization of a least squares opti-

mization cost function ends up with the identical learning rate (Diamantaras, 2002; Hu

et al., 2013). Motivated by this fact, such learning rate has been argued to be optimal for

the APEX network (Diamantaras & Kung, 1996; Diamantaras, 2002) and used by others

(Yang, 1995).

A Hebbian/Anti-Hebbian Neural Network 1469

where the ith diagonal element of DT,DT,i=T−1

t=1y2

t,i,asdeﬁnedinequa-

tion 2.8. Then equation 2.11 is equivalent to

yT=D−1

TT−1

t=1

ytx

txT−D−1

TRTyT.

Interestingly, the matrices obtained on the right side are algebraically

equivalent to the feedforward and lateral synaptic weight matrices deﬁned

in equation 2.6:

WT=D−1

TT−1

t=1

ytx

tand MT=D−1

TRT.(2.12)

Hence, the Jacobi iteration for solving equation 2.11,

yT←WTxT−MTyT,(2.13)

converges to the same ﬁxed point as the coordinate descent, equation 2.10.

Iteration 2.13 is naturally implemented by the same single-layer linear

neural network as for the asynchronous update, Figure 1B. For each stim-

ulus presentation the network goes through two phases. In the ﬁrst phase,

iteration 2.13 is repeated until convergence. Unlike the coordinate descent

algorithm, which updated the activity of neurons one after another, here,

activities of all neurons are updated synchronously. In the second phase,

synaptic weight matrices are updated according to the same rules as in the

asynchronous update algorithm, equation 2.9.

Unlike the asynchronous update, equation 2.7, for which convergence

is almost always guaranteed (Luo & Tseng, 1991), convergence of iteration

2.13 is guaranteed only when the spectral radius of Mis less than 1 (Strang,

2009). Whereas we cannot prove that this condition is always met, the

synchronous algorithm works well in practice. While in the rest of the

letter, we consider only the asynchronous updates algorithm, our results

hold for the synchronous updates algorithm provided it converges.

3 Stationary Synaptic Weights Deﬁne a Principal Subspace

What is the nature of the lower-dimensional representation found by our

algorithm? In CMDS, outputs yT,iare the Euclidean coordinates in the prin-

cipal subspace of the input vector xT(Cox & Cox, 2000; Mardia et al., 1980).

While our algorithm uses the same cost function as CMDS, the minimiza-

tion is performed in the streaming, or online, setting. Therefore, we cannot

take for granted that our algorithm will ﬁnd the principal subspace of the

input. In this section, we provide analytical evidence, by a stability analysis

1470 C. Pehlevan, T. Hu, and D. Chklovskii

in a stochastic setting, that our algorithm extracts the principal subspace of

the input data and projects onto that subspace. We start by previewing our

results and method.

Our algorithm performs a linear dimensionality reduction since the

transformation between the input and the output is linear. This can be

seen from the neural activity ﬁxed point, equation 2.10, which we rewrite

as

yT=FTxT,(3.1)

where FTis a matrix deﬁned in terms of the synaptic weight matrices WT

and MT:

FT:=Im+MT−1WT.(3.2)

Relation 3.1 shows that the linear ﬁlter of a neuron, which we term a neural

ﬁlter, is the corresponding row of FT. The space that neural ﬁlters span, the

row space of FT, is termed a ﬁlter space.

First, we prove that in the stationary state of our algorithm, neural ﬁlters

are indeed orthonormal vectors (see section 3.2, theorem 1). Second, we

demonstrate that the orthonormal ﬁlters form the basis of a space spanned

by some meigenvectors of the covariance of the inputs C(see section 3.3,

theorem 2). Third, by analyzing linear perturbations around the stationary

state, we ﬁnd that stability requires these meigenvectors to be the principal

eigenvectors and therefore the ﬁlter space to be the principal subspace (see

section 3.4, theorem 3).

These results show that even though our algorithm was derived starting

from the CMDS cost function, equation 2.1, FTconverges to the optimal

solution of the representation error cost function, equation 1.3. This corre-

spondence suggests that F

TFTis the algorithm’s current estimate of the pro-

jection matrix to the principal subspace. Further, in equation 1.3, columns

of Fare interpreted as data features. Then columns of F

T, or neural ﬁlters,

are the algorithm’s estimate of such features.

Rigorous stability analyses of PCA neural networks (Oja, 1982, 1992; Oja

& Karhunen, 1985; Sanger, 1989; Hornik & Kuan, 1992; Plumbley, 1995)

typically use the ODE method (Kushner & Clark, 1978). Using a theorem

of stochastic approximation theory (Kushner & Clark, 1978), the conver-

gence properties of the algorithm are determined using a corresponding

deterministic differential equation.5

5Application of stochastic approximation theory to PCA neural networks depends on

a set of mathematical assumptions. See Zuﬁria (2002) for a critique of the validity of these

assumptions and an alternative approach to stability analysis.

A Hebbian/Anti-Hebbian Neural Network 1471

Unfortunately the ODE method cannot be used for our network. While

the method requires learning rates that depend only on time, in our net-

work, learning rates (1/DT+1,i) are activity dependent. Therefore we take

a different approach. We directly work with the discrete-time system, as-

sume convergence to a stationary state, to be deﬁned below, and study the

stability of the stationary state.

3.1 Preliminaries. We adopt a stochastic setting where the input to the

network at each time point, xt,isann-dimensional independent and iden-

tically distributed random vector with zero mean, xt=0, where brackets

denote an average over the input distribution, and covariance C=xtx

t.

Our analysis is performed for the stationary state of synaptic weight

updates; that is, when averaged over the distribution of input values, the

updates on Wand Maverage to zero. This is the point of convergence of our

algorithm. For the rest of the section, we drop the time index Tto denote

stationary state variables.

The remaining dynamical variables, learning rates 1/DT+1,i, keep de-

creasing at each time step due to neural activity. We assume that the algo-

rithm has run for a sufﬁciently long time such that the change in learning

rate is small and it can be treated as a constant for a single update. Moreover,

we assume that the algorithm converges to a stationary point sufﬁciently

fast such that the following approximation is valid at large T,

1

DT+1,i

=1

T

t=1y2

t,i

≈1

Ty2

i,

where yis calculated with stationary state weight matrices.

We collect these assumptions into a deﬁnition:

Deﬁnition 1 (Stationary State). In the stationary state,

ΔWij=ΔMij=0,

and

1

Di

=1

Ty2

i,

with T large.

1472 C. Pehlevan, T. Hu, and D. Chklovskii

The stationary state assumption leads us to deﬁne various relations be-

tween synaptic weight matrices, summarized in the following corollary:

Corollary 1. In the stationary state,

yixj=y2

iWij,(3.3)

and

yiyj=y2

i(Mij +δij),(3.4)

where δij is the Kronecker delta.

Proof. The stationarity assumption when applied to the update rule on W,

equation 2.9, leads immediately to equation 3.3. The stationarity assumption

applied to the update rule on M, equation 2.9, gives

yiyj=y2

iMij,i= j.

The last equality does not hold for i=jsince diagonal elements of Mare

zero. To cover the case i=j, we add an identity matrix to M, and hence one

recovers equation 3.4.

Remark. Note that equation 3.4 implies y2

iMij =y2

jMji—that lateral con-

nection weights are not symmetrical.

3.2 Orthonormality of Neural Filters. Here we prove the orthonor-

mality of neural ﬁlters in the stationary state. First, we need the following

lemma:

Lemma 1. In the stationary state, the following equality holds:

Im+M=WF.(3.5)

Proof. By equation 3.4, y2

iMik +δik=yiyk. Using y=Fx,wesubsti-

tute for ykon the right-hand side: y2

iMik +δik=jFkjyixj. Next, the

stationarity condition, equation 3.3, yields y2

iMik +δik=y2

ijFkj

Wij.

Canceling y2

ion both sides proves the lemma.

Now we can prove our theorem:

Theorem 1. In the stationary state, neural ﬁlters are orthonormal:

FF=Im.(3.6)

A Hebbian/Anti-Hebbian Neural Network 1473

Proof. First, we substitute for F(but not for F) its deﬁnition (see equation

3.2): FF=(Im+M)−1WF. Next, using lemma 1, we substitute WFby

(Im+M). The right-hand side becomes (Im+M)−1(Im+M)=Im.

Remark. Theorem 1 implies that rank(F)=m.

3.3 Neural Filters and Their Relationship to the Eigenspace of the

Covariance Matrix. How is the ﬁlter space related to the input? We partially

answer this question in theorem 2, using the following lemma:

Lemma 2. In the stationary state, FFand Ccommute:

FFC =CFF.(3.7)

Proof. See section A.2.

Now we can state our second theorem.

Theorem 2. At the stationary state state, the ﬁlter space is an m-dimensional

subspace in Rnthat is spanned by some m eigenvectors of the covariance matrix.

Proof. Because FFand Ccommute (see lemma 2), they must share the

same eigenvectors. Equation 3.6 of theorem 1 implies that meigenvalues

of FFare unity and the rest are zero. Eigenvectors associated with unit

eigenvalues span the row space of Fand are identical to some meigenvectors

of C.6

Which meigenvectors of Cspan the ﬁlter space? To show that these are

the eigenvectors corresponding to the largest eigenvalues of C, we perform

a linear stability analysis around the stationary point and show that any

other combination would be unstable.

3.4 Linear Stability Requires Neural Filters to Span a Principal Sub-

space. The strategy here is to perturb Ffrom its equilibrium value and show

that the perturbation is linearly stable only if the row space of Fis the space

spanned by the eigenvectors corresponding to the mhighest eigenvalues of

C. To prove this result, we need two more lemmas.

Lemma 3. Let Hbe an m ×n real matrix with orthonormal rows and Gan (n−

m)×n real matrix with orthonormal rows, whose rows are chosen to be orthogonal

to the rows of H.Anyn×m real matrix Qcan be decomposed as

Q=AH +SH +BG,

6If this fact is not familiar, we recommend Strang’s (2009)discussion of singular value

decomposition.

1474 C. Pehlevan, T. Hu, and D. Chklovskii

where Ais an m ×m skew-symmetric matrix, Sis an m ×m symmetric matrix,

and Bis an m ×(n−m)matrix.

Proof. Deﬁne B:=QG

,A:=1

2(QH

−HQ

)and S:=1

2(QH

+

HQ

). Then AH+SH+BG=Q(HH+GG)=Q.

We denote an arbitrary perturbation of Fas δF, where a small parameter

is implied. We can use lemma 3 to decompose δFas

δF=δAF+δSF+δBG,(3.8)

where the rows of Gare orthogonal to the rows of F. Skew-symmetric δA

corresponds to rotations of ﬁlters within the ﬁlter space; it keeps neural ﬁl-

ters orthonormal. Symmetric δSkeeps the ﬁlter space invariant but destroys

orthonormality. δBis a perturbation that takes the neural ﬁlters outside the

ﬁlter space.

Next, we calculate how δFevolves under the learning rule, δF.

Lemma 4. A perturbation to the stationary state has the following evolution under

the learning rule to linear order in perturbation and linear order in T−1:

ΔδFij=1

T

kIm+M−1

ik

y2

k⎡

⎣

l

δFklClj −

lpr

δFkl FrpClpFrj

−

lpr

FklδFrpClp Frj⎤

⎦−1

TδFij.(3.9)

Proof. The proof is provided in section A.3.

Now we can state our main result in the following theorem:

Theorem 3. The stationary state of neuronal ﬁlters is stable, in large-T limit, only

if the m-dimensional ﬁlter space is spanned by the eigenvectors of the covariance

matrix corresponding to the m highest eigenvectors.

Proof. The Full proof is given in section A.4. Here we sketch the proof.

To simplify our analysis, we choose a speciﬁc Gin lemma 3 without

losing generality. Let v1,...,nbe eigenvectors of Cand v1,...,nbe corresponding

eigenvalues, labeled so that the ﬁrst meigenvectors span the row space of

F(or ﬁlter space). We choose rows of Gto be the remaining eigenvectors:

G:=[vm+1,...,vn].

By extracting the evolution of components of δFfrom equation 3.9 using

equation 3.8, we are ready to state the conditions under which perturbations

of Fare stable. Multiplying equation 3.9 on the right by Ggives the

A Hebbian/Anti-Hebbian Neural Network 1475

evolution of δB:

δBj

i=

k

Pj

ikδBj

kwhere Pj

ik ≡1

TIm+M−1

ik

y2

kvj+m−δik.

Here we changed our notation to δBkj =δBj

kto make it explicit that for

each j, we have one matrix equation. These equations are stable when all

eigenvalues of all Pjare negative, which requires, as shown in section A.4,

v1,...,v

m>vm+1,...,v

n.

This result proves that the perturbation is stable only if the ﬁlter space

is identical to the space spanned by eigenvectors corresponding to the m

highest eigenvalues of C.

It remains to analyze the stability of δAand δSperturbations. Multiply-

ing equation 3.9 on the right by Fgives

δAij=0andδSij= −2

TδSij.

δAperturbation, which rotates neural ﬁlters, does not decay. This behavior

is inherently related to the discussed symmetry of the strain cost function,

equation 2.1, with respect to left rotations of the Ymatrix. Rotated yvectors

are obtained from the input by rotated neural ﬁlters, and hence δApertur-

bation does not affect the cost. But δSdestroys orthonormality, and these

perturbations do decay, making the orthonormal solution stable.

To summarize our analysis, if the dynamics converges to a stationary

state, neural ﬁlters form an orthonormal basis of the principal subspace.

4 Numerical Simulations of the Asynchronous Network

Here, we simulate the performance of the network with asynchronous up-

dates, equations 2.7 and 2.9, on synthetic data. The data were generated by

a colored gaussian process with an arbitrarily chosen “actual” covariance

matrix. We choose the number of input channels, n=64, and the number of

output channels, m=4. In the input data, the ratio of the power in the ﬁrst

four principal components to the power in the remaining 60 components

was 0.54. Wand Mwere initialized randomly, and the step size of synaptic

updates was initialized to 1/D0,i=0.1. The coordinate descent step is cy-

cled over neurons until the magnitude of change in yTin one cycle is less

than 10−5times the magnitude of yT.

We compared the performance of the asynchronous updates network,

equation 2.7 and 2.9, with two previously proposed networks, APEX (Kung

1476 C. Pehlevan, T. Hu, and D. Chklovskii

Figure 2: Performance of the asynchronous neural network compared with ex-

isting algorithms. Each algorithm was applied to 40 different random data sets

drawn from the same gaussian statistics, described in text. Weight initializa-

tions were random. Solid lines indicate means, and shades indicate standard

deviations across 40 runs. All errors are in decibels (dB). For formal metric def-

initions, see the text. (A) Strain error as a function of data presentations. The

dotted line is the best error in batch setting, calculated using eigenvalues of the

actual covariance matrix. (B) Subspace error as a function of data presentations.

(C) Nonorthonormality error as a function of data presentations.

& Diamantaras, 1990; Kung et al., 1994) and F¨

oldiak’s (1989), on the same

data set (see Figure 2). The APEX network uses the same Hebbian and

anti-Hebbian learning rules for synaptic weights, but the architecture is

slightly different in that the lateral connection matrix, M, is lower triangu-

lar. F¨

oldiak’s network has the same architecture as ours (see Figure 1B) and

the same learning rules for feedforward connections. However, the learn-

ing rule for lateral connections is Mij ∝yiyj, unlike equation 2.9. For the

sake of fairness, we applied the same adaptive step-size procedure for all

networks. As in equation 2.9, the step size for each neuron iat time Twas

1/DT+1,i, with DT+1,i=DT,i+y2

T,i. In fact, such a learning rate has been rec-

ommended and argued to be optimal for the APEX network (Diamantaras

& Kung, 1996; Diamantaras, 2002; see also note 4).

To quantify the performance of these algorithms, we used three different

metrics. First is the strain cost function, equation 2.1, normalized by T2(see

Figure 2A). Such a normalization is chosen because the minimum value

of ofﬂine strain cost is equal to the power contained in the eigenmodes

beyond the top m:T2n

k=m+1(vk)2, where {v1,...,v

n}are eigenvalues of

sample covariance matrix CT(Cox & Cox, 2000; Mardia et al., 1980). For

each of the three networks, as expected, the strain cost rapidly drops toward

its lower bound. As our network was derived from the minimization of the

strain cost function, it is not surprising that the cost drops faster than in the

other two.

The second metric quantiﬁes the deviation of the learned subspace from

the actual principal subspace. At each T, the deviation is F

TFT−VV2

F,

A Hebbian/Anti-Hebbian Neural Network 1477

where Vis an m×nmatrix whose rows are the principal eigenvectors,

VVis the projection matrix to the principal subspace, FTis deﬁned the

same way for APEX and F¨

oldiak networks as ours, and F

TFTis the learned

estimate of the projection matrix to the principal subspace. Such a deviation

rapidly falls for each network, conﬁrming that all three algorithms learn

the principal subspace (see Figure 2B). Again, our algorithm extracts the

principal subspace faster than the other two networks.

The third metric measures the degree of nonorthonormality among the

computed neural ﬁlters. At each T,FTF

T−Im2

F. The nonorthonormality

error quickly drops for all networks, conﬁrming that neural ﬁlters converge

to orthonormal vectors (see Figure 2C). Yet again our network orthonor-

malizes neural ﬁlters much faster than the other two networks.

5 Subspace Tracking Using a Neural Network with Local

Learning Rules

We have demonstrated that our network learns a linear subspace of stream-

ing data generated by a stationary distribution. But what if the data are

generated by an evolving distribution and we need to track the correspond-

ing linear subspace? Using algorithm 2.9 would be suboptimal because the

learning rate is adjusted to effectively “remember” the contribution of all

the past data points.

A natural way to track an evolving subspace is to “forget” the con-

tribution of older data points (Yang, 1995). In this section, we derive an

algorithm with “forgetting” from a principled cost function where errors in

the similarity of old data points are discounted:

yT=arg min

yT

T

t=1

T

t=1

β2T−t−t(x

txt−y

tyt)2,(5.1)

where βis a discounting factor 0 ≤β≤1 with β=1 corresponding to our

original algorithm, equation 2.2. The effective timescale of forgetting is

τ:=−1/ln β. (5.2)

By introducing a T×T-dimensional diagonal matrix βTwith diagonal

elements βT,ii =βT−iwe can rewrite equation 5.1 in a matrix notation:

yT=arg min

yT

β

TXXβT−β

TYYβT2

F.(5.3)

Yang (1995) used a similar discounting to derive subspace tracking algo-

rithms from the representation error cost function, equation 1.3.

1478 C. Pehlevan, T. Hu, and D. Chklovskii

To derive an online algorithm to solve equation 5.3, we follow the same

steps as before. By keeping only the terms that depend on current output

yTwe get

yT=arg min

yT−4x

TT−1

t=1

β2(T−t)xty

tyT+2y

TT−1

t=1

β2(T−t)yty

tyT

−2

xT

2yT2+yT4.(5.4)

In equation 5.4, provided that past input-input and input-output outer

products are not forgotten for a sufﬁciently long time (i.e., τ1), the ﬁrst

two terms dominate over the last two for large T. After dropping the last

two terms, we arrive at

yT=arg min

yT−4x

TT−1

t=1

β2(T−t)xty

tyT+2y

TT−1

t=1

β2(T−t)yty

tyT.

(5.5)

As in the nondiscounted case, minimization of the discounted online

CMDS cost function by coordinate descent, equation 5.5, leads to a neural

network with asynchronous updates,

yT,i←

n

j=1

Wβ

T,ijxT,j−

m

j=1

Mβ

T,ijyT,j,(5.6)

and by a Jacobi iteration to a neural network with synchronous updates,

yT←Wβ

TxT−Mβ

TyT,(5.7)

with synaptic weight matrices in both cases given by

Wβ

T,ij =T−1

t=1β2(T−t)yt,ixt,j

T−1

t=1β2(T−t)y2

t,i

,Mβ

T,i,j=i=T−1

t=1β2(T−t)yt,iyt,j

T−1

t=1β2(T−t)y2

t,i

,

Mβ

T,ii =0.(5.8)

Finally, we rewrite equation 5.8 in a recursive form. As before, we intro-

duce a scalar variable Dβ

T,irepresenting the discounted cumulative activity

A Hebbian/Anti-Hebbian Neural Network 1479

of a neuron iup to time T−1,

Dβ

T,i=

T−1

t=1

β2(T−t−1)y2

t,i.(5.9)

Then the recursive updates are

Dβ

T+1,i←β2Dβ

T,i+y2

T,i,

Wβ

T+1,ij ←Wβ

T,ij +yT,i(xT,j−Wβ

T,ijyT,i)/Dβ

T+1,i,

Mβ

T+1,i,j=i←Mβ

T,ij +yT,i(yT,j−Mβ

T,ijyT,i)/Dβ

T+1,i.(5.10)

These updates are local and almost identical to the original updates, equa-

tion 2.9, except the Dβ

T+1,iupdate, where the past cumulative activity is

discounted by β2. For suitably chosen β, the learning rate, 1/Dβ

T+1,i, stays

sufﬁciently large even at large T, allowing the algorithm to react to changes

in data statistics.

As before, we have a two-phase algorithm for minimizing the discounted

online CMDS cost function, equation 5.5. For each data presentation, ﬁrst

the neural network dynamics is run using equation 5.6 or 5.7, until the

dynamics converges to a ﬁxed point. In the second step, synaptic weights

are updated using equation 5.10.

In Figure 3, we present the results of a numerical simulation of our

subspace tracking algorithm with asynchronous updates similar to that

in section 4 but for nonstationary synthetic data. The data are drawn

from two different gaussian distributions: from T=1toT=2500, with

covariance C1, and from T=2501 to T=5000, with covariance C2.We

ran our algorithm with four different βfactors: β=0.998,0.995,0.99,0.98

(τ=499.5,199.5,99.5,49.5).

We evaluate the subspace tracking performance of the algorithm using

a modiﬁcation of the subspace error metric introduced in section 4. From

T=1toT=2500, the error is F

TFT−V

1V12

F, where V1is an m×nmatrix

whose rows are the principal eigenvectors of C1.FromT=2501 to T=5000,

the error is F

TFT−V

2V22

F, where V2is an m×nmatrix whose rows are

the principal eigenvectors of C2. Figure 3A plots this modiﬁed subspace

error. Initially the subspace error decreases, reaching lower values with

higher β. Higher βallows for smaller learning rates, allowing a ﬁne-tuning

of the neural ﬁlters and hence lower error. At T=2501, a sudden jump is

observed corresponding to the change in principal subspace. The network

rapidly corrects its neural ﬁlters to project to the new principal subspace,

and the error falls to before jump values. It is interesting to note that higher

βnow leads to a slower decay due to extended memory in the past.

1480 C. Pehlevan, T. Hu, and D. Chklovskii

Figure 3: Performance of the subspace tracking asynchronous neural network

with nonstationary data. The algorithm with different βfactors was applied

to 40 different random data sets drawn from the same nonstationary statis-

tics, described in the text. Weight initializations were random. Solid lines indi-

cate means, and shades indicate standard deviations. All errors are in decibels

(dB). For formal metric deﬁnitions, see the text. (A) Subspace error as a func-

tion of data presentations. (B) Nonorthonormality error as a function of data

presentations.

We also quantify the degree of nonorthonormality of neural ﬁlters using

the nonorthonormality error deﬁned in section 4. Initially the nonorthonor-

mality error decreases, reaching lower values with higher β. Again, higher

βallows for smaller learning rates, allowing a ﬁne-tuning of the neural

ﬁlters. At T=2501, an increase in orthonormality error is observed as the

network adjusts its neural ﬁlters. Then the error falls to before change val-

ues, with higher βleading to a slower decay due to extended memory in

the past.

6 Discussion

In this letter, we made a step toward a mathematically rigorous model of

neuronal dimensionality reduction satisfying more biological constraints

than was previously possible. By starting with the CMDS cost function,

equation 2.1, we derived a single-layer neural network of linear units using

only local learning rules. Using a local stability analysis, we showed that our

algorithm ﬁnds a set of orthonormal neural ﬁlters and projects the input data

stream to its principal subspace. We showed that with a small modiﬁcation

in learning rate updates, the same algorithm performs subspace tracking.

Our algorithm ﬁnds the principal subspace but not necessarily the princi-

pal components themselves. This is not a weakness since both the represen-

tation error cost, equation 1.3, and CMDS cost, equation 2.1, are minimized

A Hebbian/Anti-Hebbian Neural Network 1481

by projections to principal subspace and ﬁnding the principal components

is not necessary.

Our network is most similar to F¨

oldiak’s (1989) network, which learns

feedforward weights by a Hebbian Oja rule and the all-to-all lateral weights

by an anti-Hebbian rule. Yet the functional form of the anti-Hebbian learn-

ing rule in F¨

oldiak’s network, Mij ∝yiyj, is different from ours, equation

2.9, resulting in three interesting differences. First, because the synaptic

weight update rules in F¨

oldiak’s network are symmetric, if the weights are

initialized symmetric (i.e., Mij =Mji), and learning rates are identical for

lateral weights, they will stay symmetric. As mentioned above, such sym-

metry does not exist in our network (see equations 2.9 and 3.4). Second,

while in F¨

oldiak’s network neural ﬁlters need not be orthonormal (F¨

oldiak,

1989; Leen, 1991), in our network they will be (see theorem 1). Third, in

F¨

oldiak’s (1989) network, output units are decorrelated, since in its station-

ary state, yiyj=0. This need not be true in our network. Yet correlations

among output units do not necessarily mean that information in the output

about the input is reduced.7

Our network is similar to the APEX network (Kung & Diamantaras, 1990)

in the functional form of both the feedforward and the lateral weights. How-

ever the network architecture is different because the APEX network has

a lower-triangular lateral connectivity matrix. Such difference in architec-

ture leads to two interesting differences in the APEX network operation

(Diamantaras & Kung, 1996): (1) the outputs converge to the principal com-

ponents, and (2) lateral weights decay to zero and neural ﬁlters are the

feedforward weights. In our network, lateral weights do not have to de-

cay to zero and neural ﬁlters depend on both the feedforward and lateral

weights (see equation 3.2).

In numerical simulations, we observed that our network is faster than

F¨

oldiak’s and APEX networks in minimizing the strain error, ﬁnding the

principal subspace and orthonormalizing neural ﬁlters. This result demon-

strates the advantage of our principled approach compared to heuristic

learning rules.

Our choice of coordinate descent to minimize the cost function in the ac-

tivity dynamics phase allowed us to circumvent problems associated with

matrix inversion: y←(Im+M)−1Wx. Matrix inversion causes problems

for neural network implementations because it is a nonlocal operation. In

7As pointed out before (Linsker, 1988; Plumbley, 1993, 1995; Kung, 2014), PCA max-

imizes mutual information between a gaussian input, x, and an output, y=Fx, such

that rows of Fhave unit norms. When rows of Fare principal eigenvectors, outputs are

principal components and are uncorrelated. However, the output can be multiplied by

a rotation matrix, Q, and mutual information is unchanged, y=Qy =QFx.yis now

a correlated gaussian, and QF still has rows with unit norms. Therefore, one can have

correlated outputs with maximal mutual information between input and output as long

as rows of Fspan the principal subspace.

1482 C. Pehlevan, T. Hu, and D. Chklovskii

the absence of a cost function, F¨

oldiak (1989) suggested implementing ma-

trix inversion by iterating y←Wx −My until convergence. We derived a

similar algorithm using Jacobi iteration. However, in general, such iterative

schemes are not guaranteed to converge (Hornik & Kuan, 1992). Our coor-

dinate descent algorithm is almost always guaranteed to converge because

the cost function in the activity dynamics phase, equation 2.4, meets the

criteria in Luo and Tseng (1991).

Unfortunately, our treatment still suffers from the problem common to

most other biologically plausible neural networks (Hornik & Kuan, 1992):

a complete global convergence analysis of synaptic weights is not yet avail-

able. Our stability analysis is local in the sense that it starts by assuming

that the synaptic weight dynamics has reached a stationary state and then

proves that perturbations around the stationary state are stable. We have

not made a theoretical statement on whether this state can ever be reached

or how fast such a state can be reached. Global convergence results us-

ing stochastic approximation theory are available for the single-neuron Oja

rule (Oja & Karhunen, 1985), its nonlocal generalizations (Plumbley, 1995),

and the APEX rule (Diamantaras & Kung, 1996); however, applicability of

stochastic approximation theory was questioned recently (Zuﬁria, 2002).

Although a neural network implementation is unknown, Warmuth and

Kuzmin’s (2008) online PCA algorithm stands out as the only algorithm

for which a regret bound has been proved. An asymptotic dependence of

regret on time can also be interpreted as convergence speed.

This letter also contributes to the MDS literature by applying the CMDS

method to streaming data. However, our method has limitations in that to

derive neural algorithms, we used the strain cost, equation 2.1, of CMDS.

Such cost is formulated in terms of similarities, inner products to be exact,

between pairs of data vectors and allowed us to consider a streaming setting

where a data vector is revealed at a time. In the most general formulation

of MDS, pairwise dissimilarities between data instances are given rather

than data vectors themselves or similarities between them (Cox & Cox,

2000; Mardia et al., 1980). This generates two immediate problems for a

generalization of our approach. First, a mapping to the strain cost function,

equation 2.1, is possible only if the dissimilarites are Euclidean distances

(see note 3). In general, dissimilarities do not need to be Euclidean or even

metric distances (Cox & Cox, 2000; Mardia et al., 1980) and one cannot

start from the strain cost, equation 2.1, for derivation of a neural algorithm.

Second, in the streaming version of the general MDS setting, at each step,

dissimilarities between the current and all past data instances are revealed,

unlike our approach where the data vector itself is revealed. It is a chal-

lenging problem for future studies to ﬁnd neural implementations in such

generalized setting.

The online CMDS cost functions, equations 2.4 and 5.5, should also be

valuable for subspace learning and tracking applications where biological

plausibility is not a necessity. Minimization of such cost functions could

A Hebbian/Anti-Hebbian Neural Network 1483

be performed much more efﬁciently in the absence of constraints imposed

by biology.8It remains to be seen how the algorithms presented in this

letter and their generalizations compare to state-of-the-art online subspace

tracking algorithms from machine learning literature (Cichocki & Amari,

2002).

Finally, we believe that formulating the cost function in terms of sim-

ilarities supports the possibility of representation-invariant computations

in neural networks.

Appendix: Derivations and Proofs

A.1 Alternative Derivation of an Asynchronous Network. Here, we

solve the system of equations 2.11 iteratively (Strang, 2009). First, we split

the output covariance matrix that appears on the left-hand side of equation

2.11 into its diagonal component DT, a strictly upper triangular matrix UT,

and a strictly lower triangular matrix LT:

T−1

t=1

yty

t=DT+UT+LT.(A.1)

Substituting this into equation 2.11, we get

DT+ωLTyT=(1−ω)DT−ωUTyT+ωT−1

t=1

ytx

txT,(A.2)

where ωis a parameter. We solve equation 2.11 by iterating

yT←− DT+ωLT−1(1−ω)DT−ωUTyT+ωT−1

t=1

ytx

txT,

(A.3)

until convergence. If symmetric T−1

t=1yty

tis positive deﬁnite, the conver-

gence is guaranteed for 0 <ω<2 by the Ostrowski-Reich theorem (Re-

ich, 1949; Ostrowski, 1954). When ω=1, the iteration, equation A.3, cor-

responds to the Gauss-Seidel method, and, when ω>1, to the succesive

overrelaxation method. The choice of ωfor fastest convergence depends on

8For example, matrix equation 2.11 could be solved by a conjugate gradient descent

method instead of iterative methods. Matrices that keep input-input and output-output

correlations in equation 2.11 can be calculated recursively, leading to a truly online

method.

1484 C. Pehlevan, T. Hu, and D. Chklovskii

the problem, and we will not explore this question here. However, values

around 1.9 are generally recommended (Strang, 2009).

Because in equation A.2, the matrix multiplying yTon the left is lower tri-

angular and on the right is upper triangular, iteration A.3 can be performed

component-by-component (Strang, 2009):

yT,i←− (1−ω)yT,i+ωkT−1

t=1yt,ixt,kxT,k

T−1

t=1y2

t,i

−ωj=iT−1

t=1yt,iyt,jyT,j

T−1

t=1y2

t,i

.(A.4)

Note that yT,iis replaced with its new value before moving to the next

component.

This algorithm can be implemented in a neural network,

yT,i←(1−ω)yT,i+ω

n

j=1

WT,ijxT,j−ω

m

j=1

MT,ijyT,j,(A.5)

where WTand MT, as deﬁned in equation 2.6, represent the synaptic weights

of feedforward and lateral connections, respectively. The case of ω<1can

be implemented by a leaky integrator neuron. The ω=1 case corresponds

to our original asynchronous algorithm, except that now updates are per-

formed in a particular order. For the ω>1 case, which may converge faster,

we do not see a biologically plausible implementation since it requires

self-inhibition.

Finally, to express the algorithm in a fully online form, we rewrite equa-

tion 2.6 via recursive updates, resulting in equation 2.9.

A.2 Proof of Lemma 2

Proof of Lemma 2. In our derivation below, we use results from equations

3.2, 3.3, and 3.4 of the main text.

FFCij =

kl

FkiFklxlxj

=

k

Fkiykxj(from 3.2)

=

k

Fkiy2

kWkj (from 3.3)

A Hebbian/Anti-Hebbian Neural Network 1485

=

kp

Fkiy2

kMkp +δkpFpj (from 3.2)

=

kp

Fkiy2

pMpk +δpkFpj (from 3.4)

=

p

Wpiy2

pFpj (from 3.2)

=

p

ypxiFpj (from 3.3)

=

pk

FpkxkxiFpj =

pk

xixkFpkFpj =CFFij.(from 3.2)

A.3 Proof of Lemma 4. Here we calculate how δFevolves under the

learning rule, δF, and derive equation 3.9.

First, we introduce some new notation to simplify our expressions. We

deﬁne lateral synaptic weight matrix Mwith diagonals set to 1 as

ˆ

M:=Im+M.(A.6)

We use ˜to denote perturbed matrices

˜

F:=F+δF,˜

W:=W+δW,

˜

M:=M+δM,ˆ

˜

M:=I+˜

M=ˆ

M+δM.(A.7)

Note that when the network is run with these perturbed synaptic matrices,

for input x, the network dynamics will settle to the ﬁxed point,

˜

y=ˆ

˜

M

−1˜

Wx =˜

Fx,(A.8)

which is different from the ﬁxed point of the stationary network, y=

ˆ

M−1Wx =Fx.

Nowwecanprovelemma4.

Proof of lemma 4. The proof has the following steps.

1. Since our update rules are formulated in terms of Wand M, it will

be helpful to express δFin terms of δWand δM. The deﬁnition of F,

equation 3.2, gives us the desired relation:

(δ ˆ

M)F+ˆ

M(δF)=δW.(A.9)

1486 C. Pehlevan, T. Hu, and D. Chklovskii

2. We show that in the stationary state,

δF= ˆ

M−1(δW−δMF)+O1

T2.(A.10)

Proof. Average changes due to synaptic updates on both sides of

equation A.9 are equal: [(δ ˆ

M)F+ˆ

M(δF)]=δW. Noting that

the unperturbed matrices are stationary, that is, M=F=

W=0, one gets δMF+ˆ

MδF=δ W+O(T−2),from

which equation A.10 follows.

3. We calculate δWand δMusing the learning rule, in terms of

matrices W,M,C,F,andδF, and plug the result into equation A.10.

This manipulation is going to give us the evolution of δFequation, 3.9.

First, δW:

δWij=˜

Wij

=1

Ty2

i(˜

yixj−˜

y2

i˜

Wij)

=1

Ty2

i

k

˜

Fikxkxj−

kl

˜

Fik ˜

Fil xkxl˜

Wij(from A.8)

=1

Ty2

i

k

˜

FikCkj −

kl

˜

Fik ˜

FilCkl ˜

Wij

=1

Ty2

i

k

FikCkj −

kl

FikFilCklWij +

k

δFikCkj

−2

kl

δFikFilCklWij −

kl

FikFilCklδWij(from A.7)

=1

Ty2

i

k

δFikCkj −2

kl

δFikFilCklWij

−

kl

FikFilCklδWij.(from 3.3)

Next we calculate δM:

δMij=˜

ˆ

Mij

=1

Ty2

i(˜

yi˜

yj−˜

y2

i˜

Mij)−1

Di

δij˜

y2

i

A Hebbian/Anti-Hebbian Neural Network 1487

=1

Ty2

i

kl

˜

Fik ˜

Fjlxkxl−

kl

˜

Fik ˜

Fil xkxl˜

Mij

−δij

kl

˜

Fik ˜

Fil xkxl(from A.8)

=1

Ty2

i

kl

˜

Fik ˜

FjlCkl −

kl

˜

Fik ˜

FilCkl ˜

Mij −δij

kl

˜

Fik ˜

FilCkl

=1

Ty2

i

kl

FikFjlCkl −

kl

FikFilCklMij −δij

kl

FikFilCkl

+

kl

δFikFjlCkl +

kl

FikδFjlCkl −2

kl

δFikFilCklMij

−

kl

FikFilCklδMij −2δij

kl

δFikFilCkl(from A.7)

=1

Ty2

i

kl

δFikFjlCkl +

kl

FikδFjlCkl −2

kl

δFikFilCklMij

−

kl

FikFilCklδMij −2δij

kl

δFikFilCkl.(from 3.4)

Plugging these in equation A.10, we get

δFij=

k

ˆ

M−1

ik

Ty2

k⎡

⎣

l

δFklClj −2

lp

δFklFkpClp

Wkj

−

lp

FklFkpClpδWkj−

lpr

δFklFrpClpFrj −

lpr

FklδFrpClpFrj

+2

lpr

δFklFkpClpMkrFrj +

lpr

FklFkpClpδMkrFrj

+2

lpr

δkrδFklFkpClpFrj⎤

⎦+O1

T2.

Mkr and δMkr terms can be eliminated using the previously derived

relations, equations 3.2 and A.9. This leads to a cancellation of some

1488 C. Pehlevan, T. Hu, and D. Chklovskii

of the terms given above, and ﬁnally we have

δFij=

k

ˆ

M−1

ik

Ty2

k⎡

⎣

l

δFklClj −

lpr

δFklFrpClpFrj

−

lpr

FklδFrpClpFrj −

lpr

FklFkpClp ˆ

MkrδFrj⎤

⎦+O1

T2.

To proceed further, we note that

y2

k=(FCF)kk,(A.11)

which allows us to simplify the last term. Then we get our ﬁnal result:

δFij= 1

T

k

ˆ

M−1

ik

y2

k⎡

⎣

l

δFklClj −

lpr

δFklFrpClpFrj

−

lpr

FklδFrpClpFrj⎤

⎦−1

TδFij +O1

T2.

A.4 Proof of Theorem 3. For ease of reference, we remind that in general

δFcan be written as in equation 3.8:

δF=δAF+δSF+δBG.

Here, δAis an m×mskew symmetric matrix, δSis an m×msymmetric

matrix, and δBis an m×(n−m)matrix. Gis an (n−m)×nmatrix with

orthonormal rows. These rows are chosen to be orthogonal to the rows of F.

Let v1,...,nbe the eigenvectors Cand v1,...,nbe the corresponding eigenvalues.

We label them such that Fspans the same space as the space spanned by the

ﬁrst meigenvectors. We choose rows of Gto be the remaining eigenvectors:

G:=[vm+1,...,vn]. Then, for future reference,

FG=0,GG=I(n−m),and

k

CikG

kj =

k

Cikvj+m

k=vj+mG

ij.

(A.12)

We also refer to the deﬁnition, equation A.6:

ˆ

M:=Im+M.

Proof of Theorem 3. Below, we discuss the conditions under which pertur-

bations of Fare stable. We work to linear order in T−1as stated in theorem 3.

A Hebbian/Anti-Hebbian Neural Network 1489

We treat separately the evolution of δA,δS,andδBunder a general pertur-

bation δF.

1. Stability of δB

1.1 Evolution of δBis given by

δBij= 1

T

kˆ

M−1

ik

y2

kvj+m−δikδBkj.(A.13)

Proof. Starting from equation 3.8 and using equation A.12,

δBij=

k

δFikG

kj

=1

T

k

ˆ

M−1

ik

y2

k

lp

δFklClpGjp −1

TδBij.

Here the last line results from equation A.12 applied to equa-

tion 3.9. We look at the ﬁrst term again using equations A.12

and then 3.8:

1

T

k

ˆ

M−1

ik

y2

k

lp

δFklClpGjp =1

T

k

ˆ

M−1

ik

y2

k

l

δFklvj+mGjl

=1

T

k

ˆ

M−1

ik

y2

kvj+mδBkj.

Combining these gives equation A.13.

1.2 When is equation A.13 stable? Next, we show that stability

requires

{v1,...,v

m}>{vm+1,...,v

n}.

For ease of manipulation, we express equation A.13 as a matrix

equation for each column of δB. For convenience, we change

our notation to δBkj =δBj

k,

δBj

i=

k

Pj

ikδBj

k

where Pj

ik ≡1

TOikvj+m−δik ,and Oik ≡

ˆ

M−1

ik

y2

k.

We have one matrix equation for each j. These equations are

stable if all eigenvalues of all Pjare negative;

1490 C. Pehlevan, T. Hu, and D. Chklovskii

{eig(P)}<0⇒{eig(O)}<1

vj

,j=m+1,...,n.

⇒{eig(O−1)}>v

j,j=m+1,...,n.

1.3 If one could calculate eigenvalues of O−1, the stability con-

dition can be articulated. We start this calculation by noting

that

k

Oikykyj=

k

ˆ

M−1

ik

ykyj

y2

k

=

k

ˆ

M−1

ik ˆ

Mkj =δij.(from 3.4).(A.14)

Therefore,

O−1=yy=FCF.(A.15)

Then we need to calculate the eigenvalues of FCF.Theyare

eig(O−1)={v1,...,v

m}.

Proof. We start with the eigenvalue equation:

FCFλ=λλ.

Multiply both sides by F:

FFCFλ=λFλ.

Next, we use the commutation of FFand C, equation 3.7, and

the orthogonality of neural ﬁlters, FF=Im, equation 3.6, to

simplify the left-hand side:

FFCFλ=CFFFλ=CFλ.

This implies that

C(Fλ)=λ(Fλ). (A.16)

Note that by the orthogonality of neural ﬁlters, the following

is also true:

FF(Fλ)=(Fλ). (A.17)

All the relations above would hold true if λ=0and

(Fλ)=0, but this would require F(Fλ)=λ=0, which is a

A Hebbian/Anti-Hebbian Neural Network 1491

contradiction. Then equations A.16 and A.17 imply that (Fλ)

is a shared eigenvector between Cand FF.FFand Cwere

shown to commute before, and they share a complete set of

eigenvectors. However, some n−meigenvectors of Chave zero

eigenvalues in FF. We had labeled shared eigenvectors with

unit eigenvalue in FFto be v1,...,vm. The eigenvalue of (Fλ)

with respect to FFis 1; therefore, Fλis one of v1,...,vm. This

proves that λ={v1,...,v

m}and

eig(O−1)={v1,...,v

m}.

1.4 From equation A.15, it follows that for stability

{v1,...,v

m}>{vm+1,...,v

n}.

2. Stability of δAand δS. Next, we check the stabilities of δAand δS:

δAij+δSij

=

k

δFikFT

kj (from 3.8)

=−1

T

k

ˆ

M−1

ik

y2

k

lm

FklδFjmClm −1

T(δAij +δSij)

=−1

T

k

ˆ

M−1

ik

y2

k

l

(FCFT)kl (δAT

lj +δST

lj)−(δAij +δSij).

(A.18)

In deriving the last line, we used equations 3.8 and A.12. The k

summation was calculated before equation A.14. Plugging this in

equation A.18, one gets

δAij+δSij=−1

T(δAij +δAT

ij +δSij +δSij)=−2

TδSij

⇒δAij=0 (from skew symmetry of A)

⇒δSij= −2

TδSij.

δAperturbation, which rotates neural ﬁlters to other orthonormal

basis within the principal subspace, does not decay. On the other

hand, δSdestroys orthonormality and these perturbations do decay,

making the orthonormal solution stable.

Collectively, the results above prove theorem 3.

1492 C. Pehlevan, T. Hu, and D. Chklovskii

A.5 Perturbation of the Stationary State due to Data Presentation.

Our discussion of the linear stability of the stationary point assumed general

perturbations. Perturbations that arise from data presentation,

δF=F,(A.19)

form a restricted class of the most general case and have special conse-

quences. Focusing on this case, we show that data presentations do not

rotate the basis for extracted subspace in the stationary state.

We calculate perturbations within the extracted subspace. Using equa-

tions 3.8 and A.12,

δA+δS=δFF

=FF

(from A.19)

=ˆ

M−1(W−ˆ

MF)F(expand 3.2 to ﬁrst order in )

=ˆ

M−1(WF

−ˆ

M). (from 3.6) (A.20)

We look at WF

term more closely:

(WF

)ij =

k

ηi(yixk−y2

iWik)F

kj

=ηiyi

k

Fjkxk−y2

i

k

WikF

kj

=ηi(yiyk−y2

iˆ

Mij)

=ˆ

Mij.

Plugging this back into equation A.20 gives

δA+δS=0,⇒δA=0,&δS=0.(A.21)

Therefore, perturbations that arise from data presentation do not rotate neu-

ral ﬁlter basis within the extracted subspace. This property should increase

the stability of the neural ﬁlter basis within the extracted subspace.

Acknowledgments

We are grateful to L. Greengard, S. Seung, and M. Warmuth for helpful

discussions.

A Hebbian/Anti-Hebbian Neural Network 1493

References

Arora, R., Cotter, A., Livescu, K., & Srebro, N. (2012). Stochastic optimization for

PCA and PLS. In Proceedings of the Allerton Conference on Communication, Control,

and Computing (pp. 861–868). Piscataway, NJ: IEEE.

Balzano, L. K. (2012). Handling missing data in high-dimensional subspace modeling

Doctoral dissertation, University of Wisconsin–Madison.

Becker, S., & Plumbley, M. (1996). Unsupervised neural network learning procedures

for feature extraction and classiﬁcation. Appl. Intell.,6(3), 185–203.

Carroll, J., & Chang, J. (1972). IDIOSCAL (individual differences in orientation scaling):

A generalization of INDSCAL allowing idiosyncratic reference systems as well as an

analytic approximation to INDSCAL. Paper presented at the Psychometric Meeting,

Princeton, NJ.

Cichocki, A., & Amari, S.-I. (2002). Adaptive blind signal and image processing. Hoboken,

NJ: Wiley.

Cox, T., & Cox, M. (2000). Multidimensional scaling. Boca Raton, FL: CRC Press.

Crammer, K. (2006). Online tracking of linear subspaces. In G. Lugois & H. U. Simon

(Eds.), Learning theory (pp. 438–452). New York: Springer.

Diamantaras, K. (2002). Neural networks and principal component analysis. In Y.-H.

Hu & J.-N. Hwang (Eds.), Handbook of neural networks in signal processing. Boca

Raton, FL: CRC Press.

Diamantaras, K., & Kung, S. (1996). Principal component neural networks: Theory and

applications. Hoboken, NJ: Wiley.

F¨

oldiak, P. (1989). Adaptive network for optimal linear feature extraction. In Proceed-

ings of the International Joint Conference on Neural Networks (pp. 401–405). Piscat-

away, NJ: IEEE.

Goes, J., Zhang, T., Arora, R., & Lerman, G. (2014). Robust stochastic principal

component analysis. In Proceedings of the Seventeenth International Conference

on Artiﬁcial Intelligence and Statistics (pp. 266–274). http://jmlr.org/proceedings

/papers/v33/

Hornik, K., & Kuan, C.-M. (1992). Convergence analysis of local feature extraction

algorithms. Neural Networks,5, 229–240.

Hu, T., Towﬁc, Z., Pehlevan, C., Genkin, A., & Chklovskii, D. (2013). A neuron as

a signal processing device. In Proceedings of the Asilomar Conference on Signals,

Systems and Computers (pp. 362–366). Piscataway, NJ: IEEE.

Hubel, D. H. (1995). Eye, brain, and vision. New York: Scientiﬁc American Library/

Scientiﬁc American Books.

Hyv¨

arinen, A., Hurri, J., & Hoyer, P. O. (2009). Natural image statistics: A probabilistic

approach to early computational vision. New York: Springer.

Karhunen, J., & Oja, E. (1982). New methods for stochastic approximation of trun-

cated Karhunen-Loeve expansions. In Proc. 6th Int. Conf. on Pattern Recognition

(pp. 550–553). New York: Springer-Verlag.

Kung, S.-Y. (2014). Kernel methods and machine learning. Cambridge: Cambridge Uni-

versity Press.

Kung, S., & Diamantaras, K. (1990). A neural network learning algorithm for adaptive

principal component extraction (apex). In Proceedings of the IEEE Conference on

Acoustics, Speech, and Signal Processing (pp. 861–864). Piscataway, NJ: IEEE.

1494 C. Pehlevan, T. Hu, and D. Chklovskii

Kung, S.-Y., Diamantaras, K., & Taur, J.-S. (1994). Adaptive principal component

extraction (APEX) and applications. IEEE T Signal Proces.,42, 1202–1217.

Kushner H. J., & Clark D. S. (1978). Stochastic approximation methods for constrained

and unconstrained systems. New York: Springer.

Leen, T. K. (1990). Dynamics of learning in recurrent feature-discovery networks.

In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing

systems, 3 (pp. 70–76). San Mateo, CA: Morgan Kaufmann.

Leen, T. K. (1991). Dynamics of learning in linear feature-discovery networks. Net-

work,2(1), 85–105.

Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer,21,

105–117.

Luo, Z. Q., & Tseng, P. (1991). On the convergence of a matrix splitting algorithm

for the symmetric monotone linear complementarity problem. SIAM J. Control

Optim.,29, 1037–1060.

Mardia, K., Kent, J., & Bibby, J. (1980). Multivariate analysis. Orlando, FL: Academic

Press.

Oja, E. (1982). Simpliﬁed neuron model as a principal component analyzer. J. Math.

Biol.,15, 267–273.

Oja, E. (1992). Principal components, minor components, and linear neural networks.

Neural Networks,5, 927–935.

Oja, E., & Karhunen, J. (1985). On stochastic approximation of the eigenvectors and

eigenvalues of the expectation of a random matrix. J. Math. Anal. Appl.,106(1),

69–84.

Ostrowksi A. M., (1954). On the linear iteration procedures for symmetric matrices.

Rend. Mat. Appl.,14, 140–163.

Pearson, K. (1901). On lines and planes of closest ﬁt to systems of points in space.

Philos Mag.,2, 559–572.

Plumbley, M. D. (1993). A Hebbian/anti-Hebbian network which optimizes infor-

mation capacity by orthonormalizing the principal subspace. In Proceedings of the

International Conference on Artiﬁcial Neural Networks (pp. 86–90). Piscataway, NJ:

IEEE.

Plumbley, M. D. (1995). Lyapunov functions for convergence of principal component

algorithms. Neural Networks,8(1), 11–23.

Preisendorfer, R., & Mobley, C. (1988). Principal component analysis in meteorology and

oceanography. New York: Elsevier Science.

Reich, E. (1949). On the convergence of the classical iterative procedures for sym-

metric matrices. Ann. Math. Statistics,20, 448–451.

Rubner, J., & Schulten, K. (1990). Development of feature detectors by self-

organization. Biol. Cybern.,62, 193–199.

Rubner, J., & Tavan, P. (1989). A self-organizing network for principal-component

analysis. Europhysics Letters,10(7), 693.

Sanger, T. (1989). Optimal unsupervised learning in a single-layer linear feedforward

neural network. Neural Networks,2(6), 459–473.

Shepherd, G. (2003). The synaptic organization of the brain. New York: Oxford Univer-

sity Press.

Strang, G. (2009). Introduction to linear algebra. Wellesley, MA: Wellesley-Cambridge

Press

A Hebbian/Anti-Hebbian Neural Network 1495

Torgerson, W.(1952). Multidimensional scaling: I. Theory and method. Psychometrika,

17, 401–419.

Warmuth, M., & Kuzmin, D. (2008). Randomized online PCA algorithms with regret

bounds that are logarithmic in the dimension. J. Mach. Learn. Res.,9(10), 2287–

2320.

Williams, C. (2001). On a connection between kernel PCA and metric multidimen-

sional scaling. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural

information processing systems, 13 (pp. 675–681). Cambridge, MA: MIT Press.

Yang, B. (1995). Projection approximation subspace tracking. IEEE T Signal Proces.,

43, 95–107.

Young, G., & Householder, A. (1938). Discussion of a set of points in terms of their

mutual distances. Psychometrika,3(1), 19–22.

Zuﬁria, P. J. (2002). On the discrete-time dynamics of the basic Hebbian neural

network node. IEEE Trans. Neural Netw.,13, 1342–1352.

Received September 24, 2014; accepted February 28, 2015.