Information and topology in attractor neural networks.
ABSTRACT A wide range of networks, including those with small-world topology, can be modeled by the connectivity ratio and randomness of the links. Both learning and attractor abilities of a neural network can be measured by the mutual information (MI) as a function of the load and the overlap between patterns and retrieval states. In this letter, we use MI to search for the optimal topology with regard to the storage and attractor properties of the network in an Amari-Hopfield model. We find that while an optimal storage implies an extremely diluted topology, a large basin of attraction leads to moderate levels of connectivity. This optimal topology is related to the clustering and path length of the network. We also build a diagram for the dynamical phases with random or local initial overlap and show that very diluted networks lose their attractor ability.
-
Citations (0)
-
Cited In (0)
Page 1
arXiv:cond-mat/0506535v1 [cond-mat.dis-nn] 21 Jun 2005
EPJ manuscript No.
(will be inserted by the editor)
Information and Topology in Attractor Neural Network
David DominguezaKostadin Koroutchev, Eduardo Serrano and Francisco B. Rodr´ ıguez
EPS, Universidad Autonoma de Madrid, Cantoblanco, Madrid, 28049, Spain
Received:February 2, 2008
Abstract. A wide range of networks, including small-world topology, can be modelled by the connectivity
γ, and randomness ω of the links. Both learning and attractor abilities of a neural network can be measured
by the mutual information (MI), as a function of the load rate and overlap between patterns and retrieval
states. We use MI to search for the optimal topology, for storage and attractor properties of the network.
We find that, while the largest storage implies an optimal MI(γ,ω) at γopt(ω) → 0, the largest basin of
attraction leads to an optimal topology at moderate levels of γopt, whenever 0 ≤ ω < 0.3. This γopt is
related to the clustering and path-length of the network. We also build a diagram for the dynamical phases
with random and local initial overlap, and show that very diluted networks lose their attractor ability.
PACS. 87.10+e General theory and mathematical aspects 64.60.CnOrder-disorder transformations; sta-
tistical mechanics of model systems 89.70.+cInformation theory and communication theory
1 Introduction
The interest on attractor neural networks (ANN), origi-
nally dealing with fully-connected architectures, has been
renewed with the study of more realistic topologies [2],[3].
Among them, the small-world (SW) graph [4], [2], mod-
elled by only two parameters: γ ≡ K/N, the average rate
of links, K, per network size, N; and ω, which controls
the rate of random links (among all K neighbors), can
capture most facts of a wide range of networks [5]. The
load rate α = P/K (where P is the number of indepen-
dent patterns). and the overlap m between neuron states
and memorized patterns are the most used measures of
the retrieval ability of the networks [6].
The overlap as a function of α is plotted in upper pan-
els of Fig.1, for fully-connected (FC, left panel), moderately-
diluted (MD, central) and extremely-diluted (ED, right)
networks. The FC network has a critical αFC
[6], with the overlap mFC
c
∼ 0.97, and a sharp transi-
tion to m → 0 for larger α ≥ αc, where it fails to re-
trieve, as seen in the left panel. However, for ED networks
(K ≪ N), the transition is smooth. In particular, the ran-
dom ED network (RED ω = 1.0, circles and dashed line)
has αRED
c
∼ 0.64[7] but the overlap falls continuously to
mRED
c
∼ 0,
Less attention has been paid to the study of the mutual
information (MI) between stored patterns and the neural
states [8],[9]. The lower panels of Fig.1 display the infor-
mation rate, i, evaluated from the conditional probability
of neuron states σ given the patterns ξ, for the mean-
field (MF) networks we deal with. The FC case shows a
c
∼ 0.138
Send offprint requests to:
a
DD. E-mail: david.dominguez@uam.es
critical information of about iFC
works has null information at αc, iRED
one look for the value of αmaxcorresponding to the max-
imal information imax≡ i(αmax), instead of αc, one finds
iRED
max∼ 0.32. The FC network has the
same iFC
c
.
We address the problem of searchingthe topology which
maximizes the MI. Using the graph framework, we built
networks with the parameters: connectivity rate, γ, run-
ning from the FC (γ = 1) to the ED (γ → 0 networks;
and randomness, ω, ranging from local (ω = 0) to ran-
dom (ω = 1) neighbors. Diluted topologies with ω ∼
0.1, with large clustering coefficient (C) and small mean-
path-length (L) between neurons, so-called small-world
(SW), are rather usefull when one needs fast and robust to
noise information transmition, without spending too much
wiring [2], SW networks may model many biological sys-
tems, for instance, in a brain local connections dominate
in intracortex, while there are a few intercortical connec-
tions [10].
c
∼ 0.132. The RED net-
c
∼ 0.0. However, if
max∼ 0.223 for αRED
max≡ iFC
The right panels of Fig.1 plot also m and i(α) for a
SW ED network (SED, ω = 0.2), with iSED
for the local ED network (LED, ω = 0.0), with iLEC
0.0855, it shows how the information increases with ran-
domness ω. The central panel of Fig.1 plot MD networks.
Comparingdifferent dilution levels, one see that i increases
(decreases) with γ for local (random) networks, and re-
mains about the same for SW topologies. A question arises
about the optimal topology: if the randomness ω is fixed
(by physical constraints), which is the best connectivity γ?
To our knowledge, up to now, no previous answer to this
question were known. We approach this problem from two
scenarios: the stability and the retrieval attractor. We will
max= 0.165, and
max=
Page 2
2D.Dominguez, K.Koroutchev, E.Serrano, F.B.Rodr´ ıguez: Information and Topology in Attractor Neural Network
00.050.1
α
0.150.2
0
0.05
0.1
0.15
0.2
i
0
0.2
0.4
0.6
0.8
1
m
FC, γ=1.0
Simulat.
Theory
µ(Sim.)
00.10.20.3
α
0.40.5
t=200; |J|=40M; m0=1
MD, γ=10−2
ω=0.0
ω=0.2
ω=1.0
00.20.4
α
0.60.8
ED, γ=10−4
Fig. 1.
diluted, γMD= 10−2(center) and extremely-diluted, γED= 10−4(right). Symbols represents simulation with initial overlap
m0= 1 and |J| = 40M, with local (stars, ω = 0.0), small-world (filled squares, ω = 0.2), and random (circles, ω = 1.0)
connections. Lines are for theoretical results: solid, ω = 0.0, dotted, ω = 0.2, and dashed, ω = 1.0. In left, dashed line means
averaging the simulation.
The overlap m and the information i vs α for different architectures: fully-connected, γFC= 1.0 (left), moderately-
show that, concerning the stability of a pattern, the RED
network performs the best, γopt→ 0. However, regarding
the attractor basins, the optimal topology holds for MD,
for instance, ω ∼ 0.1 leads to an optimal γopt∼ 10−2.
The structure of the paper is the following: in the sec-
tion 2, we define the topology and neural-dynamics model,
and review the information measures used in the calcula-
tions. The results are shown in Sec.3, where we study re-
trieval by theory and simulation. We present a diagram for
the phases with local and random initial conditions, and
show a relation between topology and MI. Conclusions are
drawn in last section.
2 The Model
2.1 Topology and Dynamics
The synaptic couplings are Jij≡ CijWij, where C is the
topology matrix C and in W are the learning weights.
The topology splits in local and random links, {Cij =
Cl
ij}. The local part connects the Klnearest neigh-
bors, Cl
ij=?
asymmetric case, in a closed ring. The random part con-
sists of independent random variables {Cr
ij+Cr
k∈Vδ(i−j−k), with V = {1,...,Kl} in the
ij}, distributed
with probability p(Cr
wise, with cr= Kr/N, where Kris the mean number of
random connections of a single neuron. Hence, the neu-
ron connectivity is K = Kl+ Kr. The network topology
is then characterized by two parameters: the connectiv-
ity ratio, defined as γ = K/N, and the randomness ratio,
ω = Kr/K. The symmetry constraints seems to play only
side effects on the information properties. The ω plays
the role of rewiring probability in the small-world model
(SW) [5]. Our model was proposed by Newman and Watts
[11], which has the advantage of avoiding disconnecting
the graph.
Note that the topology C can be defined by an ad-
jacency list connecting neighbors, ik,k = 1,...,K, with
Cij = 1 : j = ik. So the storage cost of this network is
|J| = N·K. The learning algorithm updates W, according
to the Hebb rule
ij= 1) = cr, and Cr
ij= 0 other-
Wµ
ij= Wµ−1
ij
+ ξµ
iξµ
j.(1)
The network starts at W0
learning steps, it reaches a value Wij =
learning stage is a slow dynamics, being stationary in the
time scale of the much faster retrieval stage, we define in
the following.
ij= 0, and after µ = P = αK
?p
µξµ
iξµ
j. The
Page 3
D.Dominguez, K.Koroutchev, E.Serrano, F.B.Rodr´ ıguez: Information and Topology in Attractor Neural Network3
The neural states, σt
to the dynamics:
i∈ {±1}, are updated according
σt+1
i
= sign(ht
i), ht
i≡
?
j
Jijσt
j, i = 1...N (2)
In the case of symmetric synaptic couplings, Jij = Jji,
an energy function Hs= −?
whose minima are the stable states of the dynamics Eq.(2).
In the present paper, we work out the asymmetric
network by simulation (no constraints Jij = Jji). The
theory was carried out for symmetric networks. Biologi-
cal networks are usually asymmetric [10], but this feature
does not allow any thermodynamics approach. As it is
seen in Fig.1, theory and simulation shows similar results,
except for local networks (theory underestimate αmax),
where the symmetry may play some role. A stochastic
macro-dynamics takes place due to the extensive learning
of P = αK patterns.
(i,j)Jijσiσj can be defined,
2.2 The Information Measures
The network state at a given time t is defined by a set of
binary neurons, σt= {σt
ingly, each pattern ξµ= {ξµ
set of site-independent random variables, binary and uni-
formly distributed: p(ξµ
i= ±1) = 1/2. The network learns
a set of independent patterns {ξµ, µ = 1,...,P}.
The task of the neural channel is to retrieve a pattern
(say, ξ ≡ ξµ) starting from a neuron state σ0which is
inside its attractor basin. This is achieved through a net-
work dynamics coupling neurons through a synaptic ma-
trix J ≡ {Ji,j} with cardinality |J| = N ×K. The relevant
order parameter is the overlap between the neural states
and the pattern:
i∈ {±1},i = 1,...,N}. Accord-
i∈ {±1},i = 1,...,N}, is a
mt
N≡
1
N
?
i
ξiσt
i,(3)
at the time step t. Together with the overlap, one needs a
measure of the load, which is the rate of pattern bits per
synapses used to store them. Since the synapses and pat-
terns are independent, the load is given by α = |{ξµ}|/|J| ≡
P/K.
We require the interactions J to be long-range, and
neglect spatial correlation. Hence, we regard a mean-field
network (MFN), the distribution of the states is assumed
to be site-independent. Therefore, according to the law of
large numbers, the overlap can be written, for K,N → ∞,
as mt= ?σtξ?σ,ξ. The brackets represent average over the
joint distribution p(σ,ξ), for a single neuron, understood
as an ensemble distribution for the neuron states {σi} and
pattern {ξi} [9].
This distribution factorizes in the conditional proba-
bility p(σ|ξ) = (1 + mσξ)δ(σ2− 1), [13] and input prob-
ability p(ξ). In p(σ|ξ), all types of noise in the retrieval
process are enclosed (both from environment and over the
dynamical process itself) . With the above expressions and
p(σ) ≡ δ(σ2− 1), we can calculate the MI [9], a quantity
used to measure the prediction that an observer at the
output (σ) can do about the input (ξµ) (we drop the time
index t). It reads MI[σ;ξ] = S[σ]−S[σ|ξ], where the out-
put and conditional entropies are given (in bits) by [13]:
S[σ|ξ] = −1 + m
S[σ] = 1[bit].
2
log2
1 + m
2
−1 − m
2
log2
1 − m
2
,
(4)
We define the information rate as
i(α,m) = MI[σ|{ξµ}]/|J| ≡ αMI[σ;ξ],
since for independent neurons and patterns, MI[σ|{ξµ}] ≡
?
the load rate is scaled as α = P/K.
When the network approaches its saturation limit αc,
the neuron states can not remain close to the patterns,
then mcis usually small. So, while the number of patterns
increases, the information per pattern decreases. There-
fore, information i(α,m) is a non-monotonic function of
the overlap and load rate (see Fig.1), which reaches its
maximum value imax= i(αmax) at some value αmax≤ αc
of the load.
(5)
iµMI[σi|ξµ
i]. The information is i = αMI, Eq.(5), where
3 Results
We studied the information for the stationary and dynam-
ical states of the network, as a function of the topological
parameters, ω and γ. A sample of the results for simula-
tion and theory is shown in Fig.1, where the stationary
states of the overlap and information are plotted for the
FC, MD and ED architectures. It can be seen that the
information increases with dilution and with randomness
of the network. A reason for this behavior is that dilution
decreases the correlation due to the interference between
patterns. However, dilution also increases the mean-path-
length of the network, thus, if the connections are local,
the information flows slowly over the network. Hence, the
neuron states can be eventually trapped in noisy patterns.
So, imaxis small for ω ∼ 0 even if γ = 10−4.
3.1 Theory: Storage
The theoretical approach follows the Gardner calculations[7].
A supposition is that the network state is near a given pat-
tern. At temperature T=0 the MF approximation gives
the fixed point equations:
m = erf(m/√rα),
χ = 2ϕ(m/√rα)/√rα;
(6)
(7)
r =
∞
?
k=0
ak(k + 1)χk, ak= γTr[(C/K)k+2](8)
with erf(x) ≡ 2?x
rameter akis the probability of existence of cycle of length
k + 2 in the graph C. The ak can be calculated either
0ϕ(z)dz, ϕ(z) ≡ e−z2/2/√2π. The pa-
Page 4
4D.Dominguez, K.Koroutchev, E.Serrano, F.B.Rodr´ ıguez: Information and Topology in Attractor Neural Network
10
−4
10
−3
10
−2
10
−1
10
0
γ
0
0.05
0.1
0.15
0.2
0.25
imax
Theory x Simulation
(Stability of m>0 x m0=0.1, t=50)
ω=0.0
ω=0.1
ω=0.2
ω=0.3
ω=1.0
m0=0.1:
ω=0.0
ω=0.1
ω=0.2
ω=0.3
ω=1.0
Theory:
Fig. 2. Maximal information imax = i(αmax) vs γ. Theory for
the stationary states (solid lines), and simulations with m0=
0.1 (dashed lines), with several values of randomness ω.
by using Monte Carlo [14], or by an analytical approach,
which gives ak ∼?
Fourier transform of the probability of links, p(Cij). For
an RED and FC networks one recover the known results
for rRED= 1 and rFC= 1/(1 − χ)2respectively [1].
m
?dθ[p(θ)]keimθ, where p(θ) is the
The theoretical dependence of the information on the
load, for FC, MD and ED networks, with local, small-
world and random connections, are plotted in the solid
lines in Fig.1. A comparison between theory and simula-
tion is also given in Fig.1. It can be seen that both results
agree for most ω > 0, but theory fails for ω = 0. One rea-
son is that theory uses symmetric constraint, while simu-
lation was carried out with asymmetric synapsis. The solid
lines in Fig.2 shows their maxima imax(γ,ω) ≡ i(αmax,γ,ω)
vs. the parameter ω, varying γ. It is seen that thermody-
namical optimum i(γopt) is at ωopt→ 1,γ → 0. This im-
plies that the best topology for information, respect to the
stationary states, is the RED network. It is worth to note
that the simulation converges to the theoretical results if
m0= 1.0 when t → ∞, this means that theory accounts
for the storage capacity of the network. However, quite dif-
ferent qualitative behavior holds for the simulation with
low m0= 0.1, see Fig.2. displays optima i(γopt) for MD
topologies.
3.2 Simulation: Attractors
The theoretical equations for the stationarystates, Eqs.(8),
account only for the existence of the retrieval (R) solution
m > 0. However, they say nothing about its stability. The
zero states (Z), m = 0, are also a solution of Eqs.(8), so
both R and Z may coexist in some region of topological
parameters γ, ω. In order to study the stability of the at-
tractors, we simulated Eq.(2), and check how the network
behaves for different initial conditions.
Both local and random connections are asymmetric.
The simulation was carried out with N × K = 36 · 106
synapses, storing an adjacency list as data structure, in-
stead of Jij. For instance, with γ ≡ K/N = 0.01, we used
K = 600,N = 6·104. In [15] the authors use K = 50,N =
5 · 103, which is far from asymptotic limit. We averaged
over a window in the axis of P.
To check for the attractor properties of the retrieval,
the neuron states start far from a learned pattern, but
inside its basin of attraction, σ0∈ B(ξµ). First, we choose
an initial configuration given by a random correlation with
patterns, p(σ0= ±ξµ|ξµ) = (1±m0)/2, for all neurons (so
we avoid a bias between local/random neighbors). We call
this the mRinitial overlap. The retrieval dynamics starts
with an overlap m0= 0.2, and stops after tf = 50 steps
(unless it converges to a fixed point m∗before t = tf).
Usually, tf = 20 parallel (all neurons) updates is a large
enough delay for retrieval. The information i(α,m;γ,ω) is
calculated. We averaged over a window in the axis of P,
usually δP = 25.
The results are depicted in the dashed curves of Fig.2,
where the imax(γ,ω) are plotted against γ. One see that,
unlike the theoretical results for the storage capacity, there
are MD topologies which performs the best when the at-
tractor properties are considered. Starting with m0= 0.1
yields optima i(γopt) for moderate dilutions, for instance,
with ω = 0.2, it holds γopt∼ 10−2.
Next, we compare mR with another type of initial
distribution. The neurons start with local correlations:
σ0
call it the mL initial overlap. The results are shown in
the Fig.3. In the lower (upper) panels we see the behavior
with the mR (mL) initial overlap. The first observation
now is that the maxima information imax(γ;ω) increases
with dilution (smaller γ) if the network is more random,
ω ≃ 1, while it decreases with dilution if the network is
more local, ω ≃ 0. However, there is a moderate γoptfor
which the information i(γopt) is optimized. For instance
with ω = 0.1, starting with initial overlap mR, the op-
timum is i(γopt) ∼ 0.148 at γopt = 10−2. For mL, the
optimum is i(γopt) ∼ 0.138 at γopt = 10−2. We see that
the initial mRallows for an easier retrieval for any ω, but
local topologies (ω = 0) are very sensitive to the type
of initial overlap, and lose their retrieval abilities if the
connectivity is γopt ≤ 10−2. We also observe in Fig.3 a
feature of the mLcondition: the network improves its re-
trieval ability with learning (m increases with α) before
the information reaches its maxima, which resembles a
stochastic resonance effect.
i= ξµ
i, i = 1,...,(Nm0), and random σ0
iotherwise. We
Page 5
D.Dominguez, K.Koroutchev, E.Serrano, F.B.Rodr´ ıguez: Information and Topology in Attractor Neural Network
t=50; |J|=40M; m0=0.2
γ=10−2
5
00.05
α
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
i
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
i
γ=1.00
00.1
α
γ=10−1
00.10.2
α
ω=0.0
ω=0.1
ω=0.2
0 0.10.2
α
0.3
γ=10−3
0 0.2
α
0.4
γ=10−4
mL
mR
Fig. 3. Information i(α) for initial conditions: mr and ml (always m0= 0.2). Simulations with N.K = 4.107, for γ =
1,10−1,10−2,10−3,10−4(from left to right) and ω = 0.0,0.1,0.2 (circles, squares, diamonds). and several ω.
10−4
10−3
10−2
γ
10−1
100
0.01
0.10
1.00
ω
imax>0.05; m0=0.2
R
R,L
R
Fig. 4. Diagram (ω × γ) with the phases R and L, for initial
overlap m0 = 0.2.
The comparison between upper (mL) and lower (mR)
panels of Fig.3, shows that the non-monotonic behavior
of the information with dilution, is stronger for the lo-
cal than for the random initial overlap. This sensitivity to
the initial conditions can be understood in terms of the
basins of attraction. Random topologies have very deep at-
tractors, specially if the network is diluted enough, while
regular topologies almost lose their retrieval abilities with
dilution. However, since the basins becomes rougher with
dilution, then network takes longer to reach the attrac-
tor, and can be trapped in metastable states. Hence, the
competition between depth-roughness is won by the more
robust MD networks.
The retrieval capability of the network when start at
condition mRor mLis plotted in Fig.4. We represent as
R the phase where the retrieval reaches at least the infor-
mation imax = 0.05, starting from m0 = 0.2, with mR.
The phase L is the same, but starting with mL. Good re-
trieval (imax≥ 0.05) is not allowed with the mLcondition
neither for very connected nor for local diluted topologies.
3.3 Clustering and Mean-Length-Path
We described here the topological features of the network,
as a function of its parameters, the clustering coefficient,
c, and the mean-length-path between neurons, l. When γ
is large, the net has c large, c ∼ 1 and l small, l = O(1),
whatever ω used. When γ is small, then if ω ∼ 0, the net
is clustered (c = O(γ)), and has large paths (l ∼ N/K);
if ω ∼ 1, the net becomes random (c ≪ 1 and l ∼ lnN).
However, if the randomness is about ω ∼ 0.1, then c =