Content uploaded by Alberto Finzi
Author content
All content in this area was uploaded by Alberto Finzi on May 13, 2016
Content may be subject to copyright.
Attentional Action Selection using
Reinforcement Learning
Dario Di Nocera
1
, Alber to Finzi
1
, Silvia Rossi
1
, and Mariacarla Staffa
2
1
Dipartimento di Scienze Fisiche,
2
Dipartimento di Informatica e Sistemistica,
University of Naples “Federico II” – Naples, Italy
d.dinocera@studenti.unina.it,{finzi,srossi,mariacarla.staffa}@unina.it
Abstract. We present a reinforcement learning approach to attentional
allocation and action selection in a behavior-based robotic systems. Re-
inforcement learning is typically used to model and op timize action se-
lection strategies, in this work we deploy it to optimize attentional allo-
cation strategies while action selection is obtained as a side effect. We de-
tail our attentional allocation mechanisms describing the reinforcement
learning problem and analysing its performance in a survival domain.
Keywords: attention allocation, reinforcement learning, action selec-
tion
1 Introduction
Beyond their role in perception orientation and filtering, attentional mechanisms
are considered as key mechanisms in sensorimotor coordination and action con-
trol. Indeed, in bio logical systems, executive attention and attention allocation
strategies are strictly connected with action selection and execution [5, 7]. In
this work we explore this c onnection in a robotic setting deploying a reinforce-
ment learning framework. More specifically, we propose a reinfo rcement learning
approach to attention allocation and action selection in behavior-based robo tic
system. Reinforcement learning (RL) is typically used to model and optimize
action selection strategies both in artificial [10] and biological systems [6, 4] In
contrast, in this work we deploy RL to optimize a ttention allocation strategies,
while action selection is obtained as a side effect of the resulting attentional be-
havior. Reinforcement le arning models for attention allocation have been mainly
proposed for visual attentions and ga ze co ntrol [1, 8], here we apply an analo-
gous approach to executive attention considering the problem o f a supervisory
attentional system [7] suitable for monitoring and coordinating multiple parallel
behaviors.
Our attentional system is obtained as a rea ctive, behavior-based system, en-
dowed with simple, bottom-up, attentional mechanisms capable of monitoring
multiple concurrent tasks. We assume a frequency-based model of attention al-
location [9 ]. Specifically, we introduce simple attentional mechanisms regulating
sensors sampling r ates and action activatio ns [2, 3]: the higher the attention the
2 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
higher the resolution at which a process is monitored and controlled. In this
framework, reinforcement learning is used to select the best regulations for these
mechanisms. We detail the approach des cribing the reinforcement learning prob-
lem and analyzing its performance in a simulated survival domain. The collected
results show that the approach is feasible and effective in different settings. That
is, reinforcement learning applied to attentional allocation allows not only to re-
duce and focus sensor processing, but also to significantly impr ove se nsorimotor
coordination and action selection.
2 Background and Model
2.1 Attentional System
Our attentional system is obtained as a rea c tive behavior-based sy stem where
each behavior is endowed with an a ttentional mechanism represented by an
internal adaptive clock [2].
Fig. 1: Schema theory representation of an attentional behavior.
In Figure 1 we show a schema theory representation of an attentional be-
havior. This is characterized by a Perceptual Schema (PS), which elaborates
sensor data, a Motor Schema (MS), producing the pattern of motor actions,
and a attentive control mechanism, called Adaptive Innate Releasing Mecha-
nism (AIRM), based on a combination of a clock and a releaser. The relea sing
mechanism works as a trigger for the MS activation, while the clock re gulates
sensors’ sampling rate and be haviors’ activations. The clock r e gulation mecha-
nism is our frequency-based attentional mechanism: it regulates the resolution
at which a behavior is monitored and controlled, moreover, it provides a simple
prioritization cr iterion. This attentional mechanism is characterized by:
– An activation period p
b
ranging in an interva l [p
b
min
, p
b
max
], where b is the
behavior’s identifier.
– An monitoring function f (σ
b
(t), p
b
t−1
) : R
n
→ R that adjusts the current
clock period p
b
t
, according to the internal state of the b ehavior and to the
environmental changes.
– A trigger function ρ(t, p
b
t
), which enables/disables the data flow σ
b
(t) fro m
sensors to PS at each p
b
t
time unit.
– Finally, a normalization function φ(f (σ
b
(t), p
b
t−1
)) : R → N that ma ps the
values retur ne d by f(x) into the allowed range [p
b
min
, p
b
max
].
The clock period at time t is regulated as follows:
p
b
t
= ρ(t, p
b
t−1
) × φ(f (σ
b
(t), p
b
t−1
) + (1 − ρ(t, p
b
t−1
)) × p
b
t−1
(1)
Attentional Action Selection using Reinforcement Learning 3
That is, if the behavior is disabled, the clock period remains unchanged, i.e.
p
b
t−1
. Other w ise, when the trigger function is 1, the behavior is activated and
the clock period changes according to the φ(x).
2.2 Reinforcement Learning for Attentional Action Selection
Given the attention mechanisms introduced above, our aim is to exploit Rein-
forcement Learning (RL) to regulate the monitoring functions.
Reinforcement learning and Q-learning. RL [10] solves an optimization problem
represented as a Markov Decis ion Problem (MDP) without a model (that is,
transition and reward functions) and can be used on-line. A MDP is defined by
a tuple (S, A, R, P) where S is the set of states, A is the set of actions, R is the
reward function R : S × A → ℜ, with R(s, a) the immediate reward in s ∈ S
after the execution of a ∈ A; P is the transition function P : S × A × S →
[0, 1]ǫℜ, w ith P(s, a, s
′
) probability of s
′
∈ S after the ex e cution of a ∈ A in
s ∈ S. A solution of a MDP is a policy π : S → A which maps states into
actions. The value function V
π
(s), is the cumulated expected reward from the
sǫS fo llowing π. The q-value Q(s, a) is the expected discounted sum of future
payoffs obtained by exec uting the action a from the state s and following an
optimal policy π
∗
, i.e. Q(s, a) = {R
t+1
+ γV
∗
(s
t+1
) | s
t
= s, a
t
= a}, with V
∗
associated to π
∗
. In Q-learning [12] (Q L ), the Q-values are estimated through the
agent experience after being initialize d to arbitrary numbers. For each exe c ution
of an action a
t
leading from the state s
t
to the sta te s
t+1
, the agent receives a
reward r
t+1
, and the Q-value is updated as follows:
Q(s
t
, a
t
) ← (1 − α
t
) · Q(s
t
, a
t
) + α
t
(R
t+1
+ γ · max
a
t+1
∈A
Q(s
t+1
, a
t+1
)),
where γ is the discount fa c tor (which determines the importance of future
rewards) and α is the learning rate (a facto r of 0 will make the agent not learn
anything, while a factor of 1 would make the agent conside r only the most recent
information). This algorithm converges to the correct Q-values with probability
1 assuming that every action is executed in every state infinitely many times and
α is decayed appropriately. RL requires clever exploration mechanisms, we will
refer to Softmax that uses a Boltzmann distribution [10], to bala nc e exploration
(random policy) and exploitation (Q(s, a) max imization).
Q-learning for attentional regulation. Our le arning pr oblem can be cast as fol-
lows. For each behavior, we introduce a suitable space s tate S while the ac-
tion spac e A represents a set o f possible regulations for its clock. In this pa-
per, we assume that this set spans a discretized set of possible allowed peri-
ods P = {p
1
, . . . , p
n
}, i.e. A coincides with P . Since the c urrent state s ∈ S
should track b oth the attentional state (clock period) and the perceptive state
(i.e. internal and external perceived status), this will be represented by a pair
s = (p, x), where p ∈ P is the current clock period and x ∈ X is for the current
perceived status. Then, an attentional allocation policy π : S → P defines a
mapping between the c urrent state s and the next attentional period p. Given a
reward function R for each behavior, the QL task is to find the o ptimal attention
4 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
allocation policy π: fo r each state s ∈ S we have to find the activation period
p ∈ P that maximizes the behavior’s expected reward. Notice that each be havior
concurrently runs its own QL algorithm as an independent agent (independent
versus cooperative RL is discussed in [11]). We can rely on this model because
here the attentional mechanisms are not mutually dependent (only stigmergic
interactions).
3 Case Study
In o rder to test our approach we consider a Survival Problem: the robot must
survive for a predefined amount of time within an environment (Fig. 2) avoid-
ing obstacles (objects, walls, etc.), escaping from p ossible sources of danger (red
objects) and recharging its batteries when necessary. We consider simulated e n-
(1)
(2) (3) (4)
Fig. 2: Testing Environments
vironments of size 16m × 16m . Obstacles, dangerous, and recharge locations are
cube s of size 0.5m × 0.5m × 0.5m respectively of black, red, and green color
(Fig.2). An experiment ends in a positive way if the robot is able to survive till
the end of the test, while it fails in three cases: the robot collides with an obsta-
cle, the recharge value goes under the minimum value established; the robot goes
very close to a n o bstacle. We tested our approach using a simulated Pioneer3-
DX mobile robot (using the Player/Stage tool), endowed with a blob camera
and 16 sonar sensors.
3.1 Attentional Architecture
In Fig. ?? we illustrate the attentional control system designed for the survival
domain. It combines three behaviors: Avoid, Recharge, and Escape, each endowed
with its releaser and adaptive clock. In the following we detail these behaviors.
Avoid manages obstacle avoidance, its input signal σ
a
(t) is the distance vector
generated by the 8 frontal sonar sensors; its motor schema controls the robot
velocity and angular velocity (v(t), ω(t)) generating a movement away from the
obstacle. The obstacle avoidance is obtained as follows: v(t) is proportional to
the o bstacle proximity, i.e . v(t) = v
max
×
min(σ
a
(t))
max sonar
, where v
max
, min(σ
a
(t)) and
max
sonar, are respectively the maximum velocity, the minimum distance from
the obstacle and the maximum sonar range; ω(t) is obtained as weighted sum of
the angular velocities generated by the active sonar s, i.e. ω(t) =
P
i∈A(t)
rot
max
×
Attentional Action Selection using Reinforcement Learning 5
w
i
, where A(t) is the set of active sonars detecting an obstacle at time t, rot
max
is the maximal ro tation, w
i
is a suitable weight depending on the s onar position
(frontal higher, lateral lower).
Fig. 3: Attentional Architecture Overview.
Recharge monitors an internal function σ
r
(t) representing the energy status.
At each execution cycle the energy decreases of a unit. There fore, Recharge is
active when σ
r
(t) goes below a suitable threshold. When enabled, if a g reen blob
(representing the energy source) is detected by the camera , the motor schema
generates a movement towards it, otherwise it starts look ing a round fo r the
green, generating a random direction.
Escape monitors a function σ
e
(t) that re presents fear and considers the height
(pixels in FOV) of a detected red object in the environment as an indire ct mea-
sure of the distance from the object. The motor schema is enabled whenever the
σ
e
(t) is greater then a suitable threshold and generates a movement away from
the red object. In this case, the red object is avoided with an angular velo city
proportional to the fear, i.e. ω(t) = α × σ
e
(t).
For each behavior, the clock regulation depends on an monitoring function
that should be learned at run-time.
3.2 Reinforcement Learning and Attentional Allocation
In the following we formulate the RL problem in the case study. We start for-
malizing the action space and the state space.
Action Space. In the attentional allocation problem, for each behavior, the action
space is represented by a set of possible periods {p
1
, . . . , p
n
} for the adaptive
clock. In the case study, a ssuming the minimum clock period as 1 machine cycle,
the possible periods’ sets for Avoid, Recharge and Escape are, respectively: P
a
= {1, 2, 4, 8}, P
r
= {1, 4, 8, 12}, P
e
= {1, 4, 8, 12}.
State Space. We recall that for a generic behavior, the state s is determined by
a pair (p, x),where p represents the current clock period a nd x is the current
6 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
perceptive state. For each behavior, the perceptive state is a discr etization of
its perceptive domain (r ange of the input signal). Namely, the domain for Avoid
spans the interval [0, max sonar]; the domain of Recharge is [0, max charge ],
where max charge represents the maximum battery charge; the Escape domain
is in [0, max
fear], whe re max f ear is the maximum height (in pixel) of a
red object in the FOV. The perceptive state is o btained as a discretization o f
the perc eptive domain using e quidimensional intervals. We tested our system
discretizing the perceptive state at different granularities.
Q-values. The resulting Q-table for a generic behavior is described in Table 1.
Perceptive state Attentional State Period 1 Period 2 ... Period k
Interval 1
Period 1 Q
11,1
Q
11,2
... Q
11,k
... ... ... ... ...
Period k Q
1k,1
Q
1k,2
... Q
1k,k
... ... ... ... ... ...
Interval n
Period 1 Q
n1,1
Q
n1,2
... Q
n1,k
... ... ... ... ...
Period k Q
nk,1
Q
nk,2
... Q
nk,k
Table 1: Q-values for a generic behavior.
Reward function. We assume the reward always negative, with a strong penalty
(r
max
) if the system cannot survive. For the other cases the penalty is as follows.
Concerning Avoid, each a ctivation is penalized with one (R
a
t
= r
max
if x
t
<
th
crash and −1 otherwise). As for Recharge, for each activatio n the penalty
is inversely pr oportional to the current charge (R
r
t
= r
max
if x
t
< th
charge
and
(x
t
−max
charge)
max charge
otherwise). Finally each activation of Escape is penalized
proportionally to the cur rent amount of fear (R
e
t
= r
max
for x
t
< th
fear and
−x
t
max fear
otherwise). For our experiments we adopt the following settings:
– r
max: maximum penalty (−1400 units of penalties);
– max time: maximum time allowed to accomplish the task (180 seconds);
– max sonar: maximum sonar range (1 meter);
– th
crash: minimum distance under which the r obot sto ps (0.4 meters);
– max charge: maximum value attainable for the charge (1 50 units of charge);
– th ch arge: minimum value of the charge under which the robot needs to
recharge (140 units of charge);
– max
fear: maximum height of a red blob (dangerous object) pe rceived by
the camera (30 pixels);
– th
fear: minimum height of a red blob be yond which the robot does no t
work (23 pixels);
Setting the state space. First of all, we carried out some tests evaluating the
convergence of the Q-learning process, while changing granularity and dimension
of the state space. Each test consists of 5 experiments, each subdivided into 100 0
Attentional Action Selection using Reinforcement Learning 7
episodes. We set the learning rate at 0 .8. We evaluated the system performance
in 4 different r epresentations of the state space. Namely, for each behavior, we
considered 20, 24, 28, and 32 states, obtained by changing the size of the intervals
used to partition the perceptive domain. While we use a fixed discretization of
clock periods for ea ch test. In Fig. 4, we illustrate the variation of the fitness
values with respect to the state repr e sentation. The fitness function evaluates
Fig. 4: Fitness convergence, varying the states space representation.
the success percentage, i.e. the number of positive endings. We observe that for
all the state representations we get a good percentage of success (up 98% of
positive endings) after 200 episodes. However, the one with 24 states converges
faster reaching 100% positive endings after 300 episodes. In Fig. 5, we s how the
Fig. 5: Reward with different states space representation.
accumulated rewards for each represe ntation. Also in this case, we obtain the
best re gulation with the 24 states setting, therefore, we decided to employ this
representation fo r our experiments.
Setting the learning rate. Learning rate α is a crucial parameter that s trongly af-
fects Q-learning velocity and convergence. We tested 4 differe nt settings, namely,
0.2, 0.4, 0.6, 0.8. The results are depicted in Fig. 6, where we compare the con-
vergence curves. Here, we obtain the best regulation with α = 0.8. This result
seems corrob orated by the reward values depicted in Fig. 7, wher e the minimum
amount of penalties is associa ted with α = 0.8.
8 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
Fig. 6: Fitness convergence, varying the learning rate parameter.
Fig. 7: Rewards relative to different values of the learning rate.
4 Experiments and Results
We tested the attentional s ystem in 4 environments (see Fig . 2) with incr emental
complexity in the number and disposition of the objects (red, green and black
cube s). Each experiment starts with initial values set to 0 in the Q-tables. In Fig.
Fig. 8: Success rate in the survival domain.
8 we show the success rate for each e nvironment. Here, the lear ning curve always
converges to 100%. i.e. during the episodes the system is effective in learning the
attention allocation strategies used to select the actions suitable for survivance.
Fur thermore, we analized the reliability, efficiency and effectiveness of the
learned attentional strategies (RL-AIRM) comparing them with respect to the
Attentional Action Selection using Reinforcement Learning 9
results obtained with manually tuned attentional strategies (AIRM). We tested
these two settings in the 4 environments collecting means and standard devi-
ation of 100 tests. The res ults are shown in Fig. 9 and Table 2. In Fig. 9 we
Fig. 9: Comparison of architectures. Means collected on 100 validation tests on
performance measures.
can see that in almost all the environments RL-AIRM shows a higher success
rate and lesser cos t (less cost means better performance). Concerning efficiency,
in Table 2 we can obser ve that b oth RL-AIRM and AIRM are able to r educe
and focus the behaviors’ activations (i.e. the total number of cycles these be-
haviors are activated). AIRM seems more efficient, but it is also le ss reliable
and effective (as shown in Fig. 9), hence RL-AIRM seems to provide a better
balance of e fficie nc y (minimum a ctivations), r eliability (maximum success rate),
and effectiveness (minimum cost).
RL-AIRM AIRM
Data Env1 Env2 Env3 Env4 Env1 Env2 Env3 Env4
Rewards -386 7 -492 6 -408 23 -329 28 -350 12 -797 584 -1462 1271 - 412 28
Avoid 404 9 404 9 404 10 406 10 320 10 355 20 312 76 394 23
Recharge 224 27 266 43 339 73 270 42 217 45 235 69 252 97 198 24
Escape 192 14 199 37 272 35 234 29 95 1 98 3 90 17 104 3
Survival 180 180 180 180 180 179 4 160 30 180
Failures 0% 0% 0% 0% 0% 6% 28% 0%
Cycles 1135 1 1135 1 1135 1 1135 1 1135 1 1130 20 1000 200 1135 1
Table 2: Comparison of architectures. Means and variances collected on 100
validation tests on performance measures.
Overall, reinforc ement learning seems effective in regulating attention allocation
strategies and behaviors’ activations. The combined use of attentional mecha-
10 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
nisms and le arning strategie s permits g ood p erformance in ter ms of reliability,
adaptivity, effectiveness, and efficiency.
5 Conclusions
We presented a RL approach to attentional allocation a nd action selection in a
robotic setting. Differently from classical RL models for action selection, where
actions are chosen acco rding to the operative/perceptive contexts, in our case
the action selec tion is mediated by the attentional status of the behavior. In our
setting, the learning process adapts and modulates the attentional strategies
while action selection is obtained as a consequence. We discussed the approach
considering lea rning and executive performance in a survival domain. The col-
lected results show that RL is effective in regulating simple attention allocation
mechanisms and the associated behaviors’ activations strategies.
Acknowledgments. Work supported by the European Community, within the
FP7 ICT-287513 SAPHARI project.
References
1. Bandera, C., Vico, F.J., Bravo, J.M., Harmon, M.E., Iii, L.C.B.: Residual q-
learning applied to visual attention. In: ICML-96. pp. 20–27 (1996)
2. Burattini, E., Rossi, S.: Periodic adaptive activation of behaviors in robotic system.
IJPRAI 22(5), 987–999 (2008)
3. Burattini, E., Rossi, S ., Finzi, A., Staffa, M.: Attentional modulation of mutually
dependent behaviors. In: Doncieux, S., Girard, B., Guillot, A., Hallam, J., Meyer,
J.A., Mouret, J.B. (eds.) SAB. Lecture Notes in Computer Science, vol. 6226, pp.
283–292. Springer (2010)
4. Houk, J.C., Adams, J.L., Barto, A .G.: A model of how the basal ganglia generate
and use neural signals that predict reinforcement. In: Houk, J.C., Davis, J.L.,
Beiser, D.G. (eds.) Models of Information Processing in the Basal Ganglia, pp.
249–270. MIT Press, Cambridge, MA (1995)
5. Kahneman, D.: Attention and Effort. Englewood Cliffs, NJ: Prentice-Hall (1973)
6. Montague, P.R., Dayan, P., Sejnowskw, T.J.: A framework for mesencephalic
dopamine systems based on p redictive hebbian learning. J. Neur. pp. 1936–1947
(1996)
7. Norman, D., Shallice, T.: Attention in action: willed and automatic control of
behaviour. Consciousness and Self-regulation: advances in research and theory 4,
1–18 (1986)
8. Paletta, L., Fritz, G., Seifert, C.: Q-learning of sequential attention for visual object
recognition from informative local descriptors. In: ICML-05
9. Senders, J.: The human operator as a monitor and controller of multidegree of
freedom systems pp. 2–6 (1964)
10. Sutton, R., Barto, A.: Reinforcement learning: An introduction, vol. 1. Cambridge
Univ Press (1998)
11. Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents.
In: ICML-93. pp. 330–337. Morgan Kaufmann (1993)
12. Watkins, C., Dayan, P.: Q-learning. Machine learning 8(3), 279–292 (1992)