The Effect of Hebbian Learning on Optimisation in Hopfield Networks
ABSTRACT In neural networks, two specific dynamical behaviours are well known: 1) Networks naturally find patterns of activation that locally minimise constraints among interactions. This can be understood as the local minimisation of an energy or potential function, or the optimisation of an objective function. 2) In distinct scenarios, Hebbian learning can create new interactions that form associative memories of activation patterns. In this paper we show that these two behaviours have a surprising interaction – that learning of this type significantly improves the ability of a neural network to find configurations that satisfy constraints/perform effective optimisation. Specifically, the network develops a memory of the attractors that it has visited, but importantly, is able to generalise over previously visited attractors to increase the basin of attraction of superior attractors before they are visited. The network is ultimately transformed into a different network that has only one basin of attraction, but this attractor corresponds to a configuration that is very low energy in the original network. The new network thus finds optimised configurations that were unattainable (had exponentially small basins of attraction) in the original network dynamics.
-
Citations (0)
-
Cited In (0)
Page 1
The Effect of Hebbian Learning on Optimisation in Hopfield Networks
Richard A. Watson, C. L. Buckley, Rob Mills.
In neural networks, two specific dynamical behaviours are well known: 1) Networks naturally find patterns of
activation that locally minimise constraints among interactions. This can be understood as the local
minimisation of an energy or potential function, or the optimisation of an objective function. 2) In distinct
scenarios, Hebbian learning can create new interactions that form associative memories of activation patterns. In
this paper we show that these two behaviours have a surprising interaction – that learning of this type
significantly improves the ability of a neural network to find configurations that satisfy constraints/perform
effective optimisation. Specifically, the network develops a memory of the attractors that it has visited, but
importantly, is able to generalise over previously visited attractors to increase the basin of attraction of superior
attractors before they are visited. The network is ultimately transformed into a different network that has only
one basin of attraction, but this attractor corresponds to a configuration that is very low energy in the original
network. The new network thus finds optimised configurations that were unattainable (had exponentially small
basins of attraction) in the original network dynamics.
?
learning significantly enhances the probability that a dynamical system arrives at low-energy
attractors, and that the attractors thus found optimise constraints to otherwise unattainable levels. We
view the effect as an extension of the ‘emergent collective computational abilities’ (Hopfield 1982)
that come ‘for free’ in physical systems.
Our models employ the Hopfield network (Hopfield 1982) which is an abstract model of
neural networks and a well-understood example of a simple dynamical system that has provided a
vehicle for studying attractor dynamics across many disciplines.
?
?
? ? ? ??
??
?
?
? ?
In this paper we investigate the interaction of two well-known properties of complex systems that
have each been independently well-studied in neural networks: i) The energy-minimisation behaviour
of dynamical systems (Hopfield 1982) which can be interpreted as a local optimisation of constraints
(Hopfield & Tank 1985, 1986), and ii) Hebbian learning (Hebb 1949) with its capacity to implement
associative memory (Hopfield 1982, Hinton & Sejnowski 1983). Specifically, we show that Hebbian
?
1985, 1986), Fig.1.a, and an extensive literature has developed on this, and similarly the optimisation
behaviour of their stochastic counterpart, the Boltzmann machine (Hinton & Sejnowski 1985, Ackley
et al 1985). The energy function simply corresponds to the degree to which internal network
constraints remain unsatisfied – the more unsatisfied constraints, the higher the energy, and a state
change that reduces energy resolves more constraints than it violates. Minima in this function thus
correspond to attractors in the network dynamics that are locally optimal resolutions of these
constraints. However, difficult optimisation problems, or networks with interactions that are difficult
to resolve, have many local optima and this can create a Hopfield network that has a large number of
local attractors. Running the network will obviously not result in optimal solutions in such cases
(Tsirukis et al 1989).
A second well-known neural network behaviour, model induction, is indicated in Fig.1.b.
Training a dynamical system to have a particular energy function may be interpreted as a model
induction process which takes as input a set of points in configuration space, ‘training patterns’, (fig
1.b, left) and returns a model of those points (1.b, centre). The model may act as an associative or
content addressable memory (right) (Hopfield 1982) which takes as input a (possibly partially
? ? ? ???? ????? ????? ?
?
? ??
??
????? ?? ?? ???
?
??????? ??
Many natural dynamical systems have behaviours that can be understood as the local minimisation of
an energy or potential function (Strogatz 1994). Hopfield networks, for example, are recurrent neural
networks with symmetric weights and no positive self-recurrent connections; these conditions
guarantee that the dynamics of the network can be described as the minimisation of an energy
function and that the network exhibits only point attractors. Shortly after their initial introduction it
was suggested that Hopfield networks can be used to solve optimisation problems (Hopfield & Tank
Page 2
specified) input pattern and ‘recalls’ the training pattern that is most representative of that pattern.
Such a memory can be implemented with a dynamical system whose attractors correspond to the
training patterns. Unlike the optimisation scenario, where one would ideally like to avoid local
optima, this use of Hopfield networks exploits the fact that complex networks can have multiple
attractors. A Hopfield network may be trained to implement such a dynamical system with Hebbian
learning. In an associative memory, the intent may be to represent the original training patterns as
accurately as possible, or the training patterns are sometimes interpreted as being a sample of some
underlying distribution of points and the intent is to generalise from the training patterns to estimate
the true distribution.
For example, in some cases, the learning process may afford some simple forms of
generalisation such as the merging training patterns that are very similar into one class that becomes
represented by an idealised exemplar (solid point in Fig.1.b, centre). An appropriately trained
Hopfield network may thereby both classify patterns into different groups and generalise patterns
within a group – an appropriate balance will produce a general model that is not over-fitted to the
training set. Much is known about the capacity of such networks, i.e. the number of patterns they can
store and their limitations with respect to storing very similar patterns (McEliece et al. 1987). In
particular, the recall of ‘spurious’ patterns – patterns that are substantially different from all patterns
in the training set – is naturally considered to be a problem and something to be avoided in associative
memory (e.g. Gascuel et al 1994). A particularly successful approach for avoiding spurious attractors
uses a combination of Hebbian and anti-Hebbian mechanisms in interleaved phases (Hopfield et al.
1983). The Hebbian learning captures associations in the training set and the anti-Hebbian learning is
used to counteract the development of spurious attractors arising in the inherent dynamics of the
network. This can enable the network to learn models of data where only a subset of the states is
visible to the learning process, and consequently to learn models that are not a simple linear
combination of pair-wise dependencies (Hinton & Sejnowski 1985).
Fig 1. Optimisation and model induction. a) An optimisation process takes as input an implicitly defined
function over a space of configurations (left), perhaps defined by a network of dependencies among a set of
problem variables (above), and returns, in the ideal case, a single point in configuration space that corresponds
to the minimum of that function (centre). Various stochastic local search processes (right) are imperfect methods
for approximating this output: gradient descent (GD), Boltzmann machine (BM), Hopfield network optimisation
(HN). All of these methods suffer the restrictions on energy minimisation imposed by local optima. b) Model
induction is a process that takes as input a set of points in configuration space, ‘training patterns’, (left) and
returns a model of those points (centre). The model may act as an associative or content addressable memory
(right) which takes as input a (possibly partially specified) input pattern and ‘recalls’ the training pattern that is
most representative of that input. Such a memory can be implemented with a dynamical system, such as a
Hopfield network, trained by Hebbian learning to exhibit attractors that correspond to the training patterns
(right) – see text.
Optimisation and model induction form complementary parts of a picture of organismic
behaviour: For example, a neural network may be trained by Hebbian learning to represent a
distribution of stimuli and the subsequent energy minimisation behaviour of this network accesses an
associative memory that interprets a new, perhaps partial, stimulus by resolving constraints among
competing ‘hypotheses’ about that stimulus and ‘recalling’ an exemplar pattern (Hinton & Sejnowski
1985). But the notion that energy minimisation (in the Hopfield network, Hopfield & Tank 1985, for
b)
a)
BM
GD
HN
,
,
optimisation
model induction
e.g. associative memory
e.g. dynamical energy minimisation
Page 3
example) performs effective optimisation is inconsistent with the notion that energy minimisation can
recall local attractors in an associative memory (Hopfield 1982). Specifically, if recall works well it
will find one of many local optima that represent each of the input patterns; but when optimisation
works well it will not return a local optimum but the globally-minimum-energy optimum. If Hopfield
networks were effective optimisers then in memory terms it would mean that all stimuli appeared to
be the same pattern. In practise this is not a problem because, in fact, the optimisation afforded by
dynamical energy minimisation is, put bluntly, not a very effective optimisation process.
Gradient descent (GD) (fig.1.a, right), the most basic form of local search, will necessarily
find a local minimum in an energy function; The Boltzmann machine (BM) descends the energy
surface but with a non-zero probability of admitting energy increases that may enable escape from
local optima; The use of Hopfield networks (HN) for optimisation may find superior minima in some
cases by allowing movements in a continuous space between the points of the original discrete
configuration space (indicated by a trajectory which commences from a point that is not on the
original energy surface) (Hopfield & Tank 1985). However, all of these methods suffer the
restrictions on energy minimisation imposed by local optima.
Although the Boltzmann machine is proven to asymptotically approach the global optimum of
an energy function if its ‘temperature’ (a parameter indirectly controlling the likelihood of escaping
local minima) is appropriately annealed (Geoffrey & Sejnowski 1983, Kirkpatrick et al 1983), it
nonetheless, is still a stochastic local search method. Low-energy local optima that distract from the
globally-minimal optimum are still problematic. In short, there is an inevitable trade-off that any local
search method must suffer; to the extent that local gradients are misleading they must be ignored (by
allowing energy increases), and to the extent that local gradients are ignored, the time to find low
energy states is increased. In the Boltzmann machine this trade-off is very obvious; low temperatures
or quickly annealed temperatures find sub-optimal solutions quickly, slowly annealed temperatures
can (in the limit) find optimal solutions but require time exponential in the size of the problem to do
so. Many modifications and enhancements to the original Hopfield network have been proposed for
optimisation purposes but in most cases the basic behaviour of the network remains – a relaxation to a
local minimum.
? ?
takes as input a distribution of training samples, and if it works well, recreates that distribution,
potentially capturing regularities and substructures inherent in the training set.
Note that optimising a model – optimising the goodness of fit between a model and the
training data – is the problem of model induction, not optimisation: It is not the problem of finding the
minimum energy state of a dynamical system. Hebbian learning is known to be effective at optimising
a model (Ackley et al 1983), but the intent of this process is to output a network that has a specific
(multi-attractor) energy function, not to output the minimum-energy configuration of an existing
network (sensu Hopfield & Tank). Similarly, optimising a desired input-output mapping in a feed-
forward network, e.g. with back-propagation or gradient descent (Rumelhart & McClelland 1986),
may be assisted by Hebbian learning (e.g. the ‘Leabra’ algorithm O'Reilly & Munakata, 2000). But
again this is a different objective from finding the minimum-energy state of a dynamical system, i.e. it
aims to output a network that implements an input-output mapping, not a state configuration. In
general, the use of Hebbian learning to identify and amplify the principle components of a training set
(Linsker 1988) as a pre-processing stage for learning an input-output mapping is also common. The
underlying reasons for the improvements in energy minimisation that we demonstrate below are
related at a deep level to those demonstrated in optimising models and learning feed-forward
networks. But in none of these prior works is an associative memory model (in the style of Hopfield
1982) developed within a Hopfield network that is simultaneously performing optimisation (in the
?? ? ? ? ? ?
?
?????????? ?
?
?? ? ? ??? ? ???? ? ??? ???
?
? ? ??
?
??????
?
These two different uses of Hopfield networks (and Boltzmann machines) have an extensive and
intertwined literature, but these two uses have apparently incommensurate objectives. It should be
clear that model induction tasks, where Hebbian learning has been widespread, do not perform
optimisation. Indeed, optimisation takes as input a network that represents a set of constraints or
dependencies among problem variables, and if it works well, returns a single point in configuration
space that minimises the conflicts or costs of unsatisfied dependencies. In contrast, model induction,
Page 4
style of Hopfield and Tank 1985, 1986). This is somewhat surprising, given how well-known each of
these behaviours is – but perhaps understandable given their incommensurate objectives.
Despite their apparent incongruence, we find that bringing these two behaviours (optimisation
via energy minimisation and induction of an associative memory model via Hebbian learning)
together in the same network has surprising consequences that are very significant for the ability of
dynamical systems to find low-energy attractors. The dynamical machinery involved is both that of
model induction via Hebbian learning and of optimisation, but the outcome is an optimisation process
because the ‘model’ that is induced is only a model of the lowest energy configurations. Specifically,
a given energy function is transformed into a different energy function such that low-energy
configurations, possibly the globally minimal energy configuration of the original system, are easily
retrieved.
The basic protocol that we investigate (with variants) is as follows: A network is repeatedly
run for some time from different arbitrary initial conditions. At all time steps, Hebbian learning is
applied to the weights of the network. Accordingly, this alters the energy function of the network and
potentially alters the dynamics of the network considerably. From an optimisation point of view, it
might seem that altering the energy function away from something that represents the true objective
function cannot be a good thing to do. But the application of Hebbian learning in this manner has
systematic and predictable consequences on the energy minimisation behaviour of the network.
Specifically, assuming that the duration of each run of the minimisation process is long compared to
the time to find a local optimum, most learning occurs at local optima. This causes the system to
develop a memory of the attractors that it visits. A learning network capable of inducing an
associative memory, is thus ‘turned upon itself’ – augmenting its behaviour with an induced model of
its own behaviour. In so doing the intrinsic behaviour of the network is modified, thus altering future
behaviour and future learning, and so on. On the face of it, developing an associative memory of
locally optimal attractors seems like it would be fruitless for optimisation. But, the generalisation
ability of the learning process, which has arguably been under-appreciated, produces non-trivial
effects.
The resultant transformation of the energy function is depicted in Fig 2. and proceeds as
follows: i) the natural energy minimisation behaviour of a system repeatedly samples local optima in
the energy function. These act as ‘training samples’ for the concurrent development of an associative
memory that (imperfectly) models the original function. ii) Simple generalisation resultant from the
training process may cause subsets of similar patterns to be represented by a single idealised exemplar
in the associative memory (solid point). As local sampling continues in the energy function now
augmented by the learned model, a slightly different distribution of local optima determines the
training samples that further update the model. Occasionally, the training process may create
‘spurious attractors’ that do not correspond to any of the training points. iii) As sampling continues on
this modified model, the distribution of local optima used for subsequent training becomes a more and
more degenerate representation of the original function, including points that correspond to ‘spurious’
attractors (shaded). Because we are not using an anti-Hebbian phase or any other mechanism to deter
them, the model increasingly amplifies spurious attractors that, through an (as yet unexplained)
generalisation principle, come to correspond to the lowest energy attractors of the original function.
Ultimately the modified system becomes a ‘model’ of the global optimum of the original function;
That is, its only attractor corresponds to the global optimum of the original function.
Page 5
Fig 2. Overview of how Hebbian learning modifies the energy function of a dynamical system. We
investigate the ability of associative memory to transform a complex function into a different function which is
easier to optimise. This is achieved via a continuous process of sampling local optima and inducing a model of
them that becomes an increasingly generalised representation of the original function (i-iii), see text. iv)
Ultimately, (through some generalisation principle as yet unexplained), its only attractor corresponds to the
global optimum of the original function – see text.
To clarify, it is worth emphasising what exactly is shown in this effect. We start with a
dynamical system with complex constrained interactions that produces many local optima. In general,
relaxation of the network results in a configuration that is locally optimal but possibly far from the
minimum energy attractor that is possible in this network. The globally minimal configuration of this
network is rarely visited and in a finite sample of initial conditions it may remain unvisited with high
probability. Let us, so to speak, take a copy of that original network and put it to one side. Now we
apply Hebbian learning to the network, as it repeatedly visits different local attractors, as described
above. The result of this is that the network is modified into a new network. This network no longer
exhibits the full range of behaviours that the original network did, and in the limit has only one
attractor. The globally minimal energy configuration of this new network is easy to find – the network
finds this configuration by relaxation from any initial condition. We then compare the one attractor of
this new system to the attractors of the original system we saved earlier. We find that the
configuration found at this one attractor is a configuration that has very low energy in the original
system, and under certain conditions, is actually the configuration that was the globally minimal
energy configuration of the original system. Hebbian learning does not merely modify the network
into a ‘simpler’ network with fewer attractors, but the attractors of the new system have a special
relationship to the attractors of the original system in that they are especially low-energy
configurations.
If we understand that actually, the lowest energy attractors of a network like this are in fact
the largest attractors of the network (Fontanari 1990) – then this seems less mysterious. The more an
attractor is visited, the more it is learned, and if visitation is proportional to the size of the basin of
attraction and large attractors are correlated with low energy, then the learning learns the low-energy
attractors. The effect would be interesting even if all it showed was that as a dynamical system
repeatedly sampled its local attractors, the attractors that are visited most often (having the largest
basins of attraction and on average tending to have the lowest energy) become ‘over-learned’ such
that they become the only attractors of the system. This alone would indicate that Hebbian learning
can be used not just to model an energy function or distribution of samples but to modify that energy
function such that the lowest energy patterns are found more quickly and reliably, and that the basin
of attraction for these attractors is enlarged making these configurations more robust to perturbations.
But the effect we illustrate is not just this.
The more surprising aspect of the effect is that the point attractors in the trained system
correspond to point attractors of the original system that would not have been sampled on this
timescale without the learning process. That is, although these point attractors existed in the original
system, and they had larger basins than other attractors, their basins of attraction were actually very
small – so small that, on average, they would not have been visited. If this were not the case, they
i)
ii)
iii)
iv)