Abstraction for Genetics-Based Reinforcement Learning
ABSTRACT Abstraction may appear a trivial task for humans and the positive results from this work intuitive, but abstraction has not been routinely used in genetics-based reinforcement learning. One reason is that the time each iteration requires is an important consideration and abstraction increases the time for each iteration. Typically XCS takes 20 minutes to play 1000 games (and remains constant), mXCS with abstraction takes 20 minutes for 100 games (although this can vary greatly depending on the choice of parameters) and the Q-Learning algorithm ranges from 5 minutes for 1000 games initially to 90 minutes for 1000 games after 100,000 games training. However, given a fixed amount of time to train all three algorithms mXCS with abstraction would perform the best, once the initial base rules were found. The Q-Learning algorithm has to visit every single state at least once in order to form a successful playing strategy. Whilst the Q-Learning system would ultimately play a very good game, weeks of computation failed to achieve the level of success the Abstraction algorithm had in a very short space of time (hours rather than weeks). Although better Q-learning algorithms (including generalization capabilities) exist (Sutton & Barto, 1998) this choice of benchmark algorithm showed the scale of the problem, which is difficult to calculate. The improvement in abstraction performance from standard XCS to the modified XCS was due to using simpler reinforcement learning. The Widrow-Hoff delta rule converges much faster, which for simpler domains that can be solved easily is beneficial. However, slower and more graceful learning may be required in complex domains when interacting with higher level features. The abstracted rules allow the system to play on states as a whole, including those that have not been encountered, where these states contain a known pattern. This is useful in data-mining, but with the inherent dangers of interpolation and extrapolation. The abstracted rule-base is also compact as an abstracted rule covers more states than either a generalized LCS rule or a Qlearning state. Unique states may still be covered by the base rules. Abstraction has been shown to give an improvement in a complex, but structured domain. It is anticipated that the Abstraction algorithm would be suited to other domains containing repeated patterns.
Abstraction for Genetics-Based Reinforcement
Dr Will Browne, Dan Scott and Charalambos Ioannides
University of Reading
Abstraction is a higher order cognitive ability that facilitates the production of rules that are
independent of their associations.
In standard reinforcement learning it is often expedient to directly associate situations
(states) with actions in order to maximise the environmental reward signal. This may lead to
problems including a lack of generalisation and not utilising higher order patterns in
complex domains. Thus standard Q-learning has been developed to include models or
genetics-based search (Learning Classifier Systems), which improve learning speeds and
generality. In order to extend reinforcement learning techniques to higher-order rules,
abstraction is considered here.
The process of abstraction can be likened to Information Processing Theory (a branch of
Learning Theory) (Miller, 1956), which suggests that humans have the ability to recognize
patterns in data and chunk these patterns into meaningful units. The individual patterns do
not necessarily remain in a memory store due to the holistic nature of the individual
patterns. However, the chunks of meaningful information remain, and become a basic
element of all subsequent analyses.
The need for abstraction arose from the data-mining of rules in the steel industry through
application of the genetics-based machine learning technique of Learning Classifier Systems
(Holland, 1975), which utilise a Q-learning type update for reinforcement learning. It was
noted that many rules had similar patterns. For example, there were many rules of the type
'if side guide setting < width, then poor quality product' due to different product widths.
This resulted in a rule-base that was unnecessarily hard to interpret and slow to learn. The
initial development of the abstraction method was based on the known problem of
Connect4 due to its vast search space, temporal nature and available patterns.
The contribution of this chapter is that the novel method of abstraction is described and
shown to be effective on a large search space test problem. Abstraction enabled higher order
rules to be learned from base knowledge, which mimic important aspects of human
cognition. Tests showed that the abstracted rules were more compact, had greater utility
and assisted in developmental learning. The emergence of abstracted rules corresponded
with escaping from local minima that would have otherwise trapped basic reinforcement
learning techniques, such as standard Q-learning.
Source: Reinforcement Learning: Theory and Applications, Book edited by Cornelius Weber, Mark Elshaw and Norbert Michael Mayer
ISBN 978-3-902613-14-1, pp.424, January 2008, I-Tech Education and Publishing, Vienna, Austria
Open Access Database www.i-techonline.com
Reinforcement Learning: Theory and Applications
During the application of the Genetics-Based Machine Learning technique of Learning
Classifier Systems (LCS) to data-mine rules in the steel industry, Browne noted that many
rules had similar patterns (Browne 2004). For example, there were many rules of the type 'if
side guide setting < width, then poor quality product' due to different product widths. This
resulted in a rule-base that was unnecessarily hard to interpret and slow to learn. A method
is sought to generate higher order (abstracted) rules from the learnt base rules.
A novel Abstraction algorithm has been proposed (see figure 1) to improve the performance
of a reinforcement learning genetics-based machine learning technique in a complex multi-
step problem (Browne & Scott, 2005). It is hoped that this algorithm will help reinforcement
learning techniques identify higher-order patterns inherent in an environment.
Fig. 1. Abstraction from data to higher order rules.
2.1 Test domain
Connect 4 is a turn-based game between two players, each trying to be the first to achieve
four counters in a row (horizontally, vertically or diagonally). The game takes place on a 7 *
6 board; players take it in turns to drop one of their counters into one of the seven columns.
The counters will drop to the lowest free space in the column. Play continues until the board
is full or one player gets four in a row, see figure 2. Optimum strategies exist (Allis, 1988;
Watkins, 1989), so the problem is both known and bounded.
A client-server program of Connect 4 was written in Java, as Java Applets can easily be
viewed on the internet, allowing a website to be constructed for this project [please visit:
A Q-Learning (Sutton & Barto, 1998) approach to the problem is implemented in order to
provide benchmark learning performance. Two different approaches were taken to training
the Q-Learning system. The first progressively trained the algorithm against increasingly
hard opponents, whilst the second trained for the same number of games, but against the
hardest opponent from the outset.
Abstraction for Genetics-Based Reinforcement Learning
Fig. 2. Connect 4 board, black horizontal win
The Abstraction algorithm requires rules in order to perform abstraction. A well-known
LCS, XCS (Butz, 2004) was implemented to create rules and provide a second benchmark
3. Biological inspiration for abstraction
The human brain has inspired artificial intelligence researchers, such as the development of
Artificial Neural Networks that model aspects of low-level neuronal activity. Higher-level
functional modelling has also been undertaken, see ACT-R and SOAR architectures
(Anderson et al, 2004; Laird et al, 1987). Behavioural studies suggest that pattern
recognition, which includes abstraction, is important to human cognition. Thus this section
considers how the brain abstracts. This includes using the common neuroscience technique
of studying subjects with liaisons to specific brain areas.
It has been observed in cases of autism that there is a lack of abstraction. A well studied case
is that of Kim Peek -due to his Savant abilities and popularity as the inspiration for the main
character in the film Rain Man. He was born with macrocephaly (an enlarged head), an
encephalocele (part of one or more of the skull plates did not seal) and agenesis of the
corpus callosum (the bundle of nerves that connects the two hemispheres of the brain is
missing). Brain studies, such as MRI, show that the there is also no anterior commissure and
damage to the cerebellum.
Kim has the ability to analyse certain types of information in great detail, e.g. Kim's father
indicates that by the age of 16-20 months Kim was able to memorize every book that was
read to him. It is speculated that neurons have made other connections in the absence of a
corpus callosum, resulting in the increased memory capacity (Treffert & Christensen, 2005).
However, Kim has difficulty with motor skills, such as buttoning a shirt, which is likely to
be caused by the damaged cerebellum as it normally coordinates motor activities. His
general IQ is well below normal, but he scores very highly in some subtests.
An absent corpus callosum (ACC) does not regenerate as no new callosal fibers emerge
during an infant's development. Although people with ACC lead productive and
meaningful lives there are common developmental problems that may occur with disorders
of the corpus callosum (DCC). The National Organization for Disorders of the Corpus
Behaviorally individuals with DCC may fall behind their peers in social and
problem solving skills in elementary school or as they approach adolescence. In
typical development, the fibers of the corpus callosum become more efficient as
Reinforcement Learning: Theory and Applications
children approach adolescence. At that point children with an intact corpus
callosum show rapid gains in abstract reasoning, problem solving, and social
comprehension. Although a child with DCC may have kept up with his or her
peers until this age, as the peer-group begins to make use of an increasingly
efficient corpus callosum, the child with DCC falls behind in mental and social
functioning. In this way, the behavioral challenges for individuals with DCC may
become more evident as they grow into adolescence and young adulthood.
Behavioural characteristics related to DCC difficulties on multidimensional tasks, such as
using language in social situations (for example, jokes, metaphors), appropriate motor
responses to visual information (for example, stepping on others' toes, handwriting runs off
the page), and the use of complex reasoning, creativity and problem solving (for example,
coping with math and science requirements in middle school and high school, budgeting)
The connection between the left and right half of the brain is important as each hemisphere
tends to be specialised on certain tasks. The HERA model asserts that the left pre-frontal
cortex is associated with semantic (meaning) memory, whilst the right is associated with
episodic (temporal) memory (Tulving et al., 1994). Memories themselves are associated with
the hippocampus, which assists in transforming short to long term memory. This is intact in
many savants, such as Kim Peek. Thus, it is postulated here that a link is needed between
the separated episodic and semantic memory areas in order for abstract, higher order,
knowledge to form -it is not sufficient just to create long-term generalised memories.
A caveat of the above analysis is that even with modern behavioural studies, functional
MRI, PET scans and other neurological analysis, the brain/mind is highly complex, plastic
and still not fully understood.
4. Learning classifier systems
This section outlines the architecture of XCS, including the required adjustments for the
Connect 4 domain, so that it may train against a pre-coded expert system. A standard XCS
(Butz, 2004, available from www-illigal.ge.uiuc.edu/) was implemented with the
Abstraction algorithm (see section 5). Following these results tests were also conducted with
a modified version of XCS (mXCS) that had its reinforcement learning component adjusted
to complement the Abstraction algorithm.
4.1 Setup and board representation
The board representation formed an important part of the LCS. Each space on the board
could be one of three possible states, red, yellow or empty, however it was considered
useful to further split down the empty squares into two categories, playable and unplayable
(unplayable squares are above the playable squares and become playable in the future as the
A two character representation for each space was chosen, leading to an 84 character long
string representing the board (running from top row to bottom row). The encoding for a red
was chosen as “11” and a yellow was “10”, a playable space was “00” whilst an unplayable
was “01”. Mutation may only generalize by replacing specific characters with a “#”; where
hashes can stand for either a “1” or a “0”.
Abstraction for Genetics-Based Reinforcement Learning
4.2 Gameplay and reward
LCS must decide upon the best move to play at its turn without knowing where its
opponent will play in the subsequent turn. An untrained LCS will often play randomly as it
attempts to learn the best moves to play. After each move has been played by the opponent,
the LCS attempts to match the state of the board to its rules. Attached to each of these
classifiers are three pieces of information: the move that should be played, the win score (the
higher this is the more likely a win will occur) and the accuracy score (accuracy of the win
score). Win scores of less than 50 indicate a predicted loss, greater than 50 is a projected win.
After matching, an action must be selected through explore, exploit or coverage. Exploring
(which is most likely to happen) uses a weighted roulette wheel based on accuracy to choose
a move. Exploiting chooses the move that has the greatest win score and is used for
performance evaluation. Coverage generates a new rule by simply selecting a random move
to play for the current board position.
θGA the GA threshold was set to 1000 games, the GA would run after a set of 1000 games had
been played and the maximum population size was set to 5000. χ, the crossover possibility
was set to generate 500 random crossovers every time the GA is run. Of the 500 crossovers
generated, approximately 100 in every GA run passed validity checks and were inputted
into the new population. μ, the mutation rate was set at a 1% chance to receive a mutation
and then a 2% that each character in that rule would receive a mutation. Deletion θ
probabilities (θdel) were based upon tournament selection of rule fitness and the number of
rules deleted was chosen to keep the population size at 5000.
The standard reinforcement update for LCS is the Widrow-Hoff update (Butz & Wilson,
2002), which is a recency weighted average. A Q-learning type update is used within the
LCS technique for multistep decision problems (Lanzi, P-L., 2002).
5. Abstraction algorithm
The Abstraction algorithm was designed to work upon generated rules, e.g. by the LCS.
Abstraction is independent of the data itself. Other methods, such as the standard coverage
operator, depend directly on the data. Crossover and mutation depend indirectly on the
data as they require the fitness of the hypothesized rules, which is dependent on the data.
Abstraction is a higher order method, as once good rules have been discovered; it could
function without the raw data being available.
The abstraction attempts to find patterns in the rules that performed best within the LCS.
Having found a pattern common to two or more of the LCS rules, the Abstraction algorithm
is to generate a new rule in the abstracted population based solely on this pattern. This
allows the pattern to be matched when it occurs in any state, not just the specific rules that
exist within the LCS.
Not all of the rules generated by the LCS are worthwhile and therefore the Abstraction
algorithm should not be run upon all of the rules within the LCS. The domain is noiseless,
so the parameters chosen to govern the testing of rules for abstraction were the conditions
that a rule must have a 100% win score and a 100% accuracy. Therefore the rules abstracted
by the Abstraction algorithm should only be rules that lead to winning situations.
The main mechanism that allowed the abstraction to perform was a windowing function
that was used in rule generation as well as rule selection (when it came to choosing an
abstracted rule to play). The windowing function acted as a filter that was passed over the