Content uploaded by Alberto Finzi

Author content

All content in this area was uploaded by Alberto Finzi on May 13, 2016

Content may be subject to copyright.

Attentional Action Selection using

Reinforcement Learning

Dario Di Nocera

1

, Alber to Finzi

1

, Silvia Rossi

1

, and Mariacarla Staﬀa

2

1

Dipartimento di Scienze Fisiche,

2

Dipartimento di Informatica e Sistemistica,

University of Naples “Federico II” – Naples, Italy

d.dinocera@studenti.unina.it,{finzi,srossi,mariacarla.staffa}@unina.it

Abstract. We present a reinforcement learning approach to attentional

allocation and action selection in a behavior-based robotic systems. Re-

inforcement learning is typically used to model and op timize action se-

lection strategies, in this work we deploy it to optimize attentional allo-

cation strategies while action selection is obtained as a side eﬀect. We de-

tail our attentional allocation mechanisms describing the reinforcement

learning problem and analysing its performance in a survival domain.

Keywords: attention allocation, reinforcement learning, action selec-

tion

1 Introduction

Beyond their role in perception orientation and ﬁltering, attentional mechanisms

are considered as key mechanisms in sensorimotor coordination and action con-

trol. Indeed, in bio logical systems, executive attention and attention allocation

strategies are strictly connected with action selection and execution [5, 7]. In

this work we explore this c onnection in a robotic setting deploying a reinforce-

ment learning framework. More speciﬁcally, we propose a reinfo rcement learning

approach to attention allocation and action selection in behavior-based robo tic

system. Reinforcement learning (RL) is typically used to model and optimize

action selection strategies both in artiﬁcial [10] and biological systems [6, 4] In

contrast, in this work we deploy RL to optimize a ttention allocation strategies,

while action selection is obtained as a side eﬀect of the resulting attentional be-

havior. Reinforcement le arning models for attention allocation have been mainly

proposed for visual attentions and ga ze co ntrol [1, 8], here we apply an analo-

gous approach to executive attention considering the problem o f a supervisory

attentional system [7] suitable for monitoring and coordinating multiple parallel

behaviors.

Our attentional system is obtained as a rea ctive, behavior-based system, en-

dowed with simple, bottom-up, attentional mechanisms capable of monitoring

multiple concurrent tasks. We assume a frequency-based model of attention al-

location [9 ]. Speciﬁcally, we introduce simple attentional mechanisms regulating

sensors sampling r ates and action activatio ns [2, 3]: the higher the attention the

2 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staﬀa

higher the resolution at which a process is monitored and controlled. In this

framework, reinforcement learning is used to select the best regulations for these

mechanisms. We detail the approach des cribing the reinforcement learning prob-

lem and analyzing its performance in a simulated survival domain. The collected

results show that the approach is feasible and eﬀective in diﬀerent settings. That

is, reinforcement learning applied to attentional allocation allows not only to re-

duce and focus sensor processing, but also to signiﬁcantly impr ove se nsorimotor

coordination and action selection.

2 Background and Model

2.1 Attentional System

Our attentional system is obtained as a rea c tive behavior-based sy stem where

each behavior is endowed with an a ttentional mechanism represented by an

internal adaptive clock [2].

Fig. 1: Schema theory representation of an attentional behavior.

In Figure 1 we show a schema theory representation of an attentional be-

havior. This is characterized by a Perceptual Schema (PS), which elaborates

sensor data, a Motor Schema (MS), producing the pattern of motor actions,

and a attentive control mechanism, called Adaptive Innate Releasing Mecha-

nism (AIRM), based on a combination of a clock and a releaser. The relea sing

mechanism works as a trigger for the MS activation, while the clock re gulates

sensors’ sampling rate and be haviors’ activations. The clock r e gulation mecha-

nism is our frequency-based attentional mechanism: it regulates the resolution

at which a behavior is monitored and controlled, moreover, it provides a simple

prioritization cr iterion. This attentional mechanism is characterized by:

– An activation period p

b

ranging in an interva l [p

b

min

, p

b

max

], where b is the

behavior’s identiﬁer.

– An monitoring function f (σ

b

(t), p

b

t−1

) : R

n

→ R that adjusts the current

clock period p

b

t

, according to the internal state of the b ehavior and to the

environmental changes.

– A trigger function ρ(t, p

b

t

), which enables/disables the data ﬂow σ

b

(t) fro m

sensors to PS at each p

b

t

time unit.

– Finally, a normalization function φ(f (σ

b

(t), p

b

t−1

)) : R → N that ma ps the

values retur ne d by f(x) into the allowed range [p

b

min

, p

b

max

].

The clock period at time t is regulated as follows:

p

b

t

= ρ(t, p

b

t−1

) × φ(f (σ

b

(t), p

b

t−1

) + (1 − ρ(t, p

b

t−1

)) × p

b

t−1

(1)

Attentional Action Selection using Reinforcement Learning 3

That is, if the behavior is disabled, the clock period remains unchanged, i.e.

p

b

t−1

. Other w ise, when the trigger function is 1, the behavior is activated and

the clock period changes according to the φ(x).

2.2 Reinforcement Learning for Attentional Action Selection

Given the attention mechanisms introduced above, our aim is to exploit Rein-

forcement Learning (RL) to regulate the monitoring functions.

Reinforcement learning and Q-learning. RL [10] solves an optimization problem

represented as a Markov Decis ion Problem (MDP) without a model (that is,

transition and reward functions) and can be used on-line. A MDP is deﬁned by

a tuple (S, A, R, P) where S is the set of states, A is the set of actions, R is the

reward function R : S × A → ℜ, with R(s, a) the immediate reward in s ∈ S

after the execution of a ∈ A; P is the transition function P : S × A × S →

[0, 1]ǫℜ, w ith P(s, a, s

′

) probability of s

′

∈ S after the ex e cution of a ∈ A in

s ∈ S. A solution of a MDP is a policy π : S → A which maps states into

actions. The value function V

π

(s), is the cumulated expected reward from the

sǫS fo llowing π. The q-value Q(s, a) is the expected discounted sum of future

payoﬀs obtained by exec uting the action a from the state s and following an

optimal policy π

∗

, i.e. Q(s, a) = {R

t+1

+ γV

∗

(s

t+1

) | s

t

= s, a

t

= a}, with V

∗

associated to π

∗

. In Q-learning [12] (Q L ), the Q-values are estimated through the

agent experience after being initialize d to arbitrary numbers. For each exe c ution

of an action a

t

leading from the state s

t

to the sta te s

t+1

, the agent receives a

reward r

t+1

, and the Q-value is updated as follows:

Q(s

t

, a

t

) ← (1 − α

t

) · Q(s

t

, a

t

) + α

t

(R

t+1

+ γ · max

a

t+1

∈A

Q(s

t+1

, a

t+1

)),

where γ is the discount fa c tor (which determines the importance of future

rewards) and α is the learning rate (a facto r of 0 will make the agent not learn

anything, while a factor of 1 would make the agent conside r only the most recent

information). This algorithm converges to the correct Q-values with probability

1 assuming that every action is executed in every state inﬁnitely many times and

α is decayed appropriately. RL requires clever exploration mechanisms, we will

refer to Softmax that uses a Boltzmann distribution [10], to bala nc e exploration

(random policy) and exploitation (Q(s, a) max imization).

Q-learning for attentional regulation. Our le arning pr oblem can be cast as fol-

lows. For each behavior, we introduce a suitable space s tate S while the ac-

tion spac e A represents a set o f possible regulations for its clock. In this pa-

per, we assume that this set spans a discretized set of possible allowed peri-

ods P = {p

1

, . . . , p

n

}, i.e. A coincides with P . Since the c urrent state s ∈ S

should track b oth the attentional state (clock period) and the perceptive state

(i.e. internal and external perceived status), this will be represented by a pair

s = (p, x), where p ∈ P is the current clock period and x ∈ X is for the current

perceived status. Then, an attentional allocation policy π : S → P deﬁnes a

mapping between the c urrent state s and the next attentional period p. Given a

reward function R for each behavior, the QL task is to ﬁnd the o ptimal attention

4 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staﬀa

allocation policy π: fo r each state s ∈ S we have to ﬁnd the activation period

p ∈ P that maximizes the behavior’s expected reward. Notice that each be havior

concurrently runs its own QL algorithm as an independent agent (independent

versus cooperative RL is discussed in [11]). We can rely on this model because

here the attentional mechanisms are not mutually dependent (only stigmergic

interactions).

3 Case Study

In o rder to test our approach we consider a Survival Problem: the robot must

survive for a predeﬁned amount of time within an environment (Fig. 2) avoid-

ing obstacles (objects, walls, etc.), escaping from p ossible sources of danger (red

objects) and recharging its batteries when necessary. We consider simulated e n-

(1)

(2) (3) (4)

Fig. 2: Testing Environments

vironments of size 16m × 16m . Obstacles, dangerous, and recharge locations are

cube s of size 0.5m × 0.5m × 0.5m respectively of black, red, and green color

(Fig.2). An experiment ends in a positive way if the robot is able to survive till

the end of the test, while it fails in three cases: the robot collides with an obsta-

cle, the recharge value goes under the minimum value established; the robot goes

very close to a n o bstacle. We tested our approach using a simulated Pioneer3-

DX mobile robot (using the Player/Stage tool), endowed with a blob camera

and 16 sonar sensors.

3.1 Attentional Architecture

In Fig. ?? we illustrate the attentional control system designed for the survival

domain. It combines three behaviors: Avoid, Recharge, and Escape, each endowed

with its releaser and adaptive clock. In the following we detail these behaviors.

Avoid manages obstacle avoidance, its input signal σ

a

(t) is the distance vector

generated by the 8 frontal sonar sensors; its motor schema controls the robot

velocity and angular velocity (v(t), ω(t)) generating a movement away from the

obstacle. The obstacle avoidance is obtained as follows: v(t) is proportional to

the o bstacle proximity, i.e . v(t) = v

max

×

min(σ

a

(t))

max sonar

, where v

max

, min(σ

a

(t)) and

max

sonar, are respectively the maximum velocity, the minimum distance from

the obstacle and the maximum sonar range; ω(t) is obtained as weighted sum of

the angular velocities generated by the active sonar s, i.e. ω(t) =

P

i∈A(t)

rot

max

×

Attentional Action Selection using Reinforcement Learning 5

w

i

, where A(t) is the set of active sonars detecting an obstacle at time t, rot

max

is the maximal ro tation, w

i

is a suitable weight depending on the s onar position

(frontal higher, lateral lower).

Fig. 3: Attentional Architecture Overview.

Recharge monitors an internal function σ

r

(t) representing the energy status.

At each execution cycle the energy decreases of a unit. There fore, Recharge is

active when σ

r

(t) goes below a suitable threshold. When enabled, if a g reen blob

(representing the energy source) is detected by the camera , the motor schema

generates a movement towards it, otherwise it starts look ing a round fo r the

green, generating a random direction.

Escape monitors a function σ

e

(t) that re presents fear and considers the height

(pixels in FOV) of a detected red object in the environment as an indire ct mea-

sure of the distance from the object. The motor schema is enabled whenever the

σ

e

(t) is greater then a suitable threshold and generates a movement away from

the red object. In this case, the red object is avoided with an angular velo city

proportional to the fear, i.e. ω(t) = α × σ

e

(t).

For each behavior, the clock regulation depends on an monitoring function

that should be learned at run-time.

3.2 Reinforcement Learning and Attentional Allocation

In the following we formulate the RL problem in the case study. We start for-

malizing the action space and the state space.

Action Space. In the attentional allocation problem, for each behavior, the action

space is represented by a set of possible periods {p

1

, . . . , p

n

} for the adaptive

clock. In the case study, a ssuming the minimum clock period as 1 machine cycle,

the possible periods’ sets for Avoid, Recharge and Escape are, respectively: P

a

= {1, 2, 4, 8}, P

r

= {1, 4, 8, 12}, P

e

= {1, 4, 8, 12}.

State Space. We recall that for a generic behavior, the state s is determined by

a pair (p, x),where p represents the current clock period a nd x is the current

6 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staﬀa

perceptive state. For each behavior, the perceptive state is a discr etization of

its perceptive domain (r ange of the input signal). Namely, the domain for Avoid

spans the interval [0, max sonar]; the domain of Recharge is [0, max charge ],

where max charge represents the maximum battery charge; the Escape domain

is in [0, max

fear], whe re max f ear is the maximum height (in pixel) of a

red object in the FOV. The perceptive state is o btained as a discretization o f

the perc eptive domain using e quidimensional intervals. We tested our system

discretizing the perceptive state at diﬀerent granularities.

Q-values. The resulting Q-table for a generic behavior is described in Table 1.

Perceptive state Attentional State Period 1 Period 2 ... Period k

Interval 1

Period 1 Q

11,1

Q

11,2

... Q

11,k

... ... ... ... ...

Period k Q

1k,1

Q

1k,2

... Q

1k,k

... ... ... ... ... ...

Interval n

Period 1 Q

n1,1

Q

n1,2

... Q

n1,k

... ... ... ... ...

Period k Q

nk,1

Q

nk,2

... Q

nk,k

Table 1: Q-values for a generic behavior.

Reward function. We assume the reward always negative, with a strong penalty

(r

max

) if the system cannot survive. For the other cases the penalty is as follows.

Concerning Avoid, each a ctivation is penalized with one (R

a

t

= r

max

if x

t

<

th

crash and −1 otherwise). As for Recharge, for each activatio n the penalty

is inversely pr oportional to the current charge (R

r

t

= r

max

if x

t

< th

charge

and

(x

t

−max

charge)

max charge

otherwise). Finally each activation of Escape is penalized

proportionally to the cur rent amount of fear (R

e

t

= r

max

for x

t

< th

fear and

−x

t

max fear

otherwise). For our experiments we adopt the following settings:

– r

max: maximum penalty (−1400 units of penalties);

– max time: maximum time allowed to accomplish the task (180 seconds);

– max sonar: maximum sonar range (1 meter);

– th

crash: minimum distance under which the r obot sto ps (0.4 meters);

– max charge: maximum value attainable for the charge (1 50 units of charge);

– th ch arge: minimum value of the charge under which the robot needs to

recharge (140 units of charge);

– max

fear: maximum height of a red blob (dangerous object) pe rceived by

the camera (30 pixels);

– th

fear: minimum height of a red blob be yond which the robot does no t

work (23 pixels);

Setting the state space. First of all, we carried out some tests evaluating the

convergence of the Q-learning process, while changing granularity and dimension

of the state space. Each test consists of 5 experiments, each subdivided into 100 0

Attentional Action Selection using Reinforcement Learning 7

episodes. We set the learning rate at 0 .8. We evaluated the system performance

in 4 diﬀerent r epresentations of the state space. Namely, for each behavior, we

considered 20, 24, 28, and 32 states, obtained by changing the size of the intervals

used to partition the perceptive domain. While we use a ﬁxed discretization of

clock periods for ea ch test. In Fig. 4, we illustrate the variation of the ﬁtness

values with respect to the state repr e sentation. The ﬁtness function evaluates

Fig. 4: Fitness convergence, varying the states space representation.

the success percentage, i.e. the number of positive endings. We observe that for

all the state representations we get a good percentage of success (up 98% of

positive endings) after 200 episodes. However, the one with 24 states converges

faster reaching 100% positive endings after 300 episodes. In Fig. 5, we s how the

Fig. 5: Reward with diﬀerent states space representation.

accumulated rewards for each represe ntation. Also in this case, we obtain the

best re gulation with the 24 states setting, therefore, we decided to employ this

representation fo r our experiments.

Setting the learning rate. Learning rate α is a crucial parameter that s trongly af-

fects Q-learning velocity and convergence. We tested 4 diﬀere nt settings, namely,

0.2, 0.4, 0.6, 0.8. The results are depicted in Fig. 6, where we compare the con-

vergence curves. Here, we obtain the best regulation with α = 0.8. This result

seems corrob orated by the reward values depicted in Fig. 7, wher e the minimum

amount of penalties is associa ted with α = 0.8.

8 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staﬀa

Fig. 6: Fitness convergence, varying the learning rate parameter.

Fig. 7: Rewards relative to diﬀerent values of the learning rate.

4 Experiments and Results

We tested the attentional s ystem in 4 environments (see Fig . 2) with incr emental

complexity in the number and disposition of the objects (red, green and black

cube s). Each experiment starts with initial values set to 0 in the Q-tables. In Fig.

Fig. 8: Success rate in the survival domain.

8 we show the success rate for each e nvironment. Here, the lear ning curve always

converges to 100%. i.e. during the episodes the system is eﬀective in learning the

attention allocation strategies used to select the actions suitable for survivance.

Fur thermore, we analized the reliability, eﬃciency and eﬀectiveness of the

learned attentional strategies (RL-AIRM) comparing them with respect to the

Attentional Action Selection using Reinforcement Learning 9

results obtained with manually tuned attentional strategies (AIRM). We tested

these two settings in the 4 environments collecting means and standard devi-

ation of 100 tests. The res ults are shown in Fig. 9 and Table 2. In Fig. 9 we

Fig. 9: Comparison of architectures. Means collected on 100 validation tests on

performance measures.

can see that in almost all the environments RL-AIRM shows a higher success

rate and lesser cos t (less cost means better performance). Concerning eﬃciency,

in Table 2 we can obser ve that b oth RL-AIRM and AIRM are able to r educe

and focus the behaviors’ activations (i.e. the total number of cycles these be-

haviors are activated). AIRM seems more eﬃcient, but it is also le ss reliable

and eﬀective (as shown in Fig. 9), hence RL-AIRM seems to provide a better

balance of e ﬃcie nc y (minimum a ctivations), r eliability (maximum success rate),

and eﬀectiveness (minimum cost).

RL-AIRM AIRM

Data Env1 Env2 Env3 Env4 Env1 Env2 Env3 Env4

Rewards -386 7 -492 6 -408 23 -329 28 -350 12 -797 584 -1462 1271 - 412 28

Avoid 404 9 404 9 404 10 406 10 320 10 355 20 312 76 394 23

Recharge 224 27 266 43 339 73 270 42 217 45 235 69 252 97 198 24

Escape 192 14 199 37 272 35 234 29 95 1 98 3 90 17 104 3

Survival 180 180 180 180 180 179 4 160 30 180

Failures 0% 0% 0% 0% 0% 6% 28% 0%

Cycles 1135 1 1135 1 1135 1 1135 1 1135 1 1130 20 1000 200 1135 1

Table 2: Comparison of architectures. Means and variances collected on 100

validation tests on performance measures.

Overall, reinforc ement learning seems eﬀective in regulating attention allocation

strategies and behaviors’ activations. The combined use of attentional mecha-

10 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staﬀa

nisms and le arning strategie s permits g ood p erformance in ter ms of reliability,

adaptivity, eﬀectiveness, and eﬃciency.

5 Conclusions

We presented a RL approach to attentional allocation a nd action selection in a

robotic setting. Diﬀerently from classical RL models for action selection, where

actions are chosen acco rding to the operative/perceptive contexts, in our case

the action selec tion is mediated by the attentional status of the behavior. In our

setting, the learning process adapts and modulates the attentional strategies

while action selection is obtained as a consequence. We discussed the approach

considering lea rning and executive performance in a survival domain. The col-

lected results show that RL is eﬀective in regulating simple attention allocation

mechanisms and the associated behaviors’ activations strategies.

Acknowledgments. Work supported by the European Community, within the

FP7 ICT-287513 SAPHARI project.

References

1. Bandera, C., Vico, F.J., Bravo, J.M., Harmon, M.E., Iii, L.C.B.: Residual q-

learning applied to visual attention. In: ICML-96. pp. 20–27 (1996)

2. Burattini, E., Rossi, S.: Periodic adaptive activation of behaviors in robotic system.

IJPRAI 22(5), 987–999 (2008)

3. Burattini, E., Rossi, S ., Finzi, A., Staﬀa, M.: Attentional modulation of mutually

dependent behaviors. In: Doncieux, S., Girard, B., Guillot, A., Hallam, J., Meyer,

J.A., Mouret, J.B. (eds.) SAB. Lecture Notes in Computer Science, vol. 6226, pp.

283–292. Springer (2010)

4. Houk, J.C., Adams, J.L., Barto, A .G.: A model of how the basal ganglia generate

and use neural signals that predict reinforcement. In: Houk, J.C., Davis, J.L.,

Beiser, D.G. (eds.) Models of Information Processing in the Basal Ganglia, pp.

249–270. MIT Press, Cambridge, MA (1995)

5. Kahneman, D.: Attention and Eﬀort. Englewood Cliﬀs, NJ: Prentice-Hall (1973)

6. Montague, P.R., Dayan, P., Sejnowskw, T.J.: A framework for mesencephalic

dopamine systems based on p redictive hebbian learning. J. Neur. pp. 1936–1947

(1996)

7. Norman, D., Shallice, T.: Attention in action: willed and automatic control of

behaviour. Consciousness and Self-regulation: advances in research and theory 4,

1–18 (1986)

8. Paletta, L., Fritz, G., Seifert, C.: Q-learning of sequential attention for visual object

recognition from informative local descriptors. In: ICML-05

9. Senders, J.: The human operator as a monitor and controller of multidegree of

freedom systems pp. 2–6 (1964)

10. Sutton, R., Barto, A.: Reinforcement learning: An introduction, vol. 1. Cambridge

Univ Press (1998)

11. Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents.

In: ICML-93. pp. 330–337. Morgan Kaufmann (1993)

12. Watkins, C., Dayan, P.: Q-learning. Machine learning 8(3), 279–292 (1992)