Conference PaperPDF Available

Attentional Action Selection Using Reinforcement Learning


Abstract and Figures

Reinforcement learning is typically used to model and optimize action selection strategies, in this work we deploy it to optimize attentional allocation strategies while action selection is obtained as a side effect. We present a reinforcement learning approach to attentional allocation and action selection in a behavior-based robotic systems. We detail our attentional allocation mechanisms describing the reinforcement learning problem and analysing its performance in a survival domain.
Content may be subject to copyright.
Attentional Action Selection using
Reinforcement Learning
Dario Di Nocera
, Alber to Finzi
, Silvia Rossi
, and Mariacarla Staffa
Dipartimento di Scienze Fisiche,
Dipartimento di Informatica e Sistemistica,
University of Naples “Federico II” Naples, Italy,{finzi,srossi,mariacarla.staffa}
Abstract. We present a reinforcement learning approach to attentional
allocation and action selection in a behavior-based robotic systems. Re-
inforcement learning is typically used to model and op timize action se-
lection strategies, in this work we deploy it to optimize attentional allo-
cation strategies while action selection is obtained as a side effect. We de-
tail our attentional allocation mechanisms describing the reinforcement
learning problem and analysing its performance in a survival domain.
Keywords: attention allocation, reinforcement learning, action selec-
1 Introduction
Beyond their role in perception orientation and filtering, attentional mechanisms
are considered as key mechanisms in sensorimotor coordination and action con-
trol. Indeed, in bio logical systems, executive attention and attention allocation
strategies are strictly connected with action selection and execution [5, 7]. In
this work we explore this c onnection in a robotic setting deploying a reinforce-
ment learning framework. More specifically, we propose a reinfo rcement learning
approach to attention allocation and action selection in behavior-based robo tic
system. Reinforcement learning (RL) is typically used to model and optimize
action selection strategies both in artificial [10] and biological systems [6, 4] In
contrast, in this work we deploy RL to optimize a ttention allocation strategies,
while action selection is obtained as a side effect of the resulting attentional be-
havior. Reinforcement le arning models for attention allocation have been mainly
proposed for visual attentions and ga ze co ntrol [1, 8], here we apply an analo-
gous approach to executive attention considering the problem o f a supervisory
attentional system [7] suitable for monitoring and coordinating multiple parallel
Our attentional system is obtained as a rea ctive, behavior-based system, en-
dowed with simple, bottom-up, attentional mechanisms capable of monitoring
multiple concurrent tasks. We assume a frequency-based model of attention al-
location [9 ]. Specifically, we introduce simple attentional mechanisms regulating
sensors sampling r ates and action activatio ns [2, 3]: the higher the attention the
2 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
higher the resolution at which a process is monitored and controlled. In this
framework, reinforcement learning is used to select the best regulations for these
mechanisms. We detail the approach des cribing the reinforcement learning prob-
lem and analyzing its performance in a simulated survival domain. The collected
results show that the approach is feasible and effective in different settings. That
is, reinforcement learning applied to attentional allocation allows not only to re-
duce and focus sensor processing, but also to significantly impr ove se nsorimotor
coordination and action selection.
2 Background and Model
2.1 Attentional System
Our attentional system is obtained as a rea c tive behavior-based sy stem where
each behavior is endowed with an a ttentional mechanism represented by an
internal adaptive clock [2].
Fig. 1: Schema theory representation of an attentional behavior.
In Figure 1 we show a schema theory representation of an attentional be-
havior. This is characterized by a Perceptual Schema (PS), which elaborates
sensor data, a Motor Schema (MS), producing the pattern of motor actions,
and a attentive control mechanism, called Adaptive Innate Releasing Mecha-
nism (AIRM), based on a combination of a clock and a releaser. The relea sing
mechanism works as a trigger for the MS activation, while the clock re gulates
sensors’ sampling rate and be haviors’ activations. The clock r e gulation mecha-
nism is our frequency-based attentional mechanism: it regulates the resolution
at which a behavior is monitored and controlled, moreover, it provides a simple
prioritization cr iterion. This attentional mechanism is characterized by:
An activation period p
ranging in an interva l [p
, p
], where b is the
behavior’s identifier.
An monitoring function f (σ
(t), p
) : R
R that adjusts the current
clock period p
, according to the internal state of the b ehavior and to the
environmental changes.
A trigger function ρ(t, p
), which enables/disables the data flow σ
(t) fro m
sensors to PS at each p
time unit.
Finally, a normalization function φ(f (σ
(t), p
)) : R N that ma ps the
values retur ne d by f(x) into the allowed range [p
, p
The clock period at time t is regulated as follows:
= ρ(t, p
) × φ(f (σ
(t), p
) + (1 ρ(t, p
)) × p
Attentional Action Selection using Reinforcement Learning 3
That is, if the behavior is disabled, the clock period remains unchanged, i.e.
. Other w ise, when the trigger function is 1, the behavior is activated and
the clock period changes according to the φ(x).
2.2 Reinforcement Learning for Attentional Action Selection
Given the attention mechanisms introduced above, our aim is to exploit Rein-
forcement Learning (RL) to regulate the monitoring functions.
Reinforcement learning and Q-learning. RL [10] solves an optimization problem
represented as a Markov Decis ion Problem (MDP) without a model (that is,
transition and reward functions) and can be used on-line. A MDP is defined by
a tuple (S, A, R, P) where S is the set of states, A is the set of actions, R is the
reward function R : S × A , with R(s, a) the immediate reward in s S
after the execution of a A; P is the transition function P : S × A × S
[0, 1]ǫ, w ith P(s, a, s
) probability of s
S after the ex e cution of a A in
s S. A solution of a MDP is a policy π : S A which maps states into
actions. The value function V
(s), is the cumulated expected reward from the
S fo llowing π. The q-value Q(s, a) is the expected discounted sum of future
payoffs obtained by exec uting the action a from the state s and following an
optimal policy π
, i.e. Q(s, a) = {R
+ γV
) | s
= s, a
= a}, with V
associated to π
. In Q-learning [12] (Q L ), the Q-values are estimated through the
agent experience after being initialize d to arbitrary numbers. For each exe c ution
of an action a
leading from the state s
to the sta te s
, the agent receives a
reward r
, and the Q-value is updated as follows:
, a
) (1 α
) · Q(s
, a
) + α
+ γ · max
, a
where γ is the discount fa c tor (which determines the importance of future
rewards) and α is the learning rate (a facto r of 0 will make the agent not learn
anything, while a factor of 1 would make the agent conside r only the most recent
information). This algorithm converges to the correct Q-values with probability
1 assuming that every action is executed in every state infinitely many times and
α is decayed appropriately. RL requires clever exploration mechanisms, we will
refer to Softmax that uses a Boltzmann distribution [10], to bala nc e exploration
(random policy) and exploitation (Q(s, a) max imization).
Q-learning for attentional regulation. Our le arning pr oblem can be cast as fol-
lows. For each behavior, we introduce a suitable space s tate S while the ac-
tion spac e A represents a set o f possible regulations for its clock. In this pa-
per, we assume that this set spans a discretized set of possible allowed peri-
ods P = {p
, . . . , p
}, i.e. A coincides with P . Since the c urrent state s S
should track b oth the attentional state (clock period) and the perceptive state
(i.e. internal and external perceived status), this will be represented by a pair
s = (p, x), where p P is the current clock period and x X is for the current
perceived status. Then, an attentional allocation policy π : S P defines a
mapping between the c urrent state s and the next attentional period p. Given a
reward function R for each behavior, the QL task is to find the o ptimal attention
4 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
allocation policy π: fo r each state s S we have to find the activation period
p P that maximizes the behavior’s expected reward. Notice that each be havior
concurrently runs its own QL algorithm as an independent agent (independent
versus cooperative RL is discussed in [11]). We can rely on this model because
here the attentional mechanisms are not mutually dependent (only stigmergic
3 Case Study
In o rder to test our approach we consider a Survival Problem: the robot must
survive for a predefined amount of time within an environment (Fig. 2) avoid-
ing obstacles (objects, walls, etc.), escaping from p ossible sources of danger (red
objects) and recharging its batteries when necessary. We consider simulated e n-
(2) (3) (4)
Fig. 2: Testing Environments
vironments of size 16m × 16m . Obstacles, dangerous, and recharge locations are
cube s of size 0.5m × 0.5m × 0.5m respectively of black, red, and green color
(Fig.2). An experiment ends in a positive way if the robot is able to survive till
the end of the test, while it fails in three cases: the robot collides with an obsta-
cle, the recharge value goes under the minimum value established; the robot goes
very close to a n o bstacle. We tested our approach using a simulated Pioneer3-
DX mobile robot (using the Player/Stage tool), endowed with a blob camera
and 16 sonar sensors.
3.1 Attentional Architecture
In Fig. ?? we illustrate the attentional control system designed for the survival
domain. It combines three behaviors: Avoid, Recharge, and Escape, each endowed
with its releaser and adaptive clock. In the following we detail these behaviors.
Avoid manages obstacle avoidance, its input signal σ
(t) is the distance vector
generated by the 8 frontal sonar sensors; its motor schema controls the robot
velocity and angular velocity (v(t), ω(t)) generating a movement away from the
obstacle. The obstacle avoidance is obtained as follows: v(t) is proportional to
the o bstacle proximity, i.e . v(t) = v
max sonar
, where v
, min(σ
(t)) and
sonar, are respectively the maximum velocity, the minimum distance from
the obstacle and the maximum sonar range; ω(t) is obtained as weighted sum of
the angular velocities generated by the active sonar s, i.e. ω(t) =
Attentional Action Selection using Reinforcement Learning 5
, where A(t) is the set of active sonars detecting an obstacle at time t, rot
is the maximal ro tation, w
is a suitable weight depending on the s onar position
(frontal higher, lateral lower).
Fig. 3: Attentional Architecture Overview.
Recharge monitors an internal function σ
(t) representing the energy status.
At each execution cycle the energy decreases of a unit. There fore, Recharge is
active when σ
(t) goes below a suitable threshold. When enabled, if a g reen blob
(representing the energy source) is detected by the camera , the motor schema
generates a movement towards it, otherwise it starts look ing a round fo r the
green, generating a random direction.
Escape monitors a function σ
(t) that re presents fear and considers the height
(pixels in FOV) of a detected red object in the environment as an indire ct mea-
sure of the distance from the object. The motor schema is enabled whenever the
(t) is greater then a suitable threshold and generates a movement away from
the red object. In this case, the red object is avoided with an angular velo city
proportional to the fear, i.e. ω(t) = α × σ
For each behavior, the clock regulation depends on an monitoring function
that should be learned at run-time.
3.2 Reinforcement Learning and Attentional Allocation
In the following we formulate the RL problem in the case study. We start for-
malizing the action space and the state space.
Action Space. In the attentional allocation problem, for each behavior, the action
space is represented by a set of possible periods {p
, . . . , p
} for the adaptive
clock. In the case study, a ssuming the minimum clock period as 1 machine cycle,
the possible periods’ sets for Avoid, Recharge and Escape are, respectively: P
= {1, 2, 4, 8}, P
= {1, 4, 8, 12}, P
= {1, 4, 8, 12}.
State Space. We recall that for a generic behavior, the state s is determined by
a pair (p, x),where p represents the current clock period a nd x is the current
6 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
perceptive state. For each behavior, the perceptive state is a discr etization of
its perceptive domain (r ange of the input signal). Namely, the domain for Avoid
spans the interval [0, max sonar]; the domain of Recharge is [0, max charge ],
where max charge represents the maximum battery charge; the Escape domain
is in [0, max
fear], whe re max f ear is the maximum height (in pixel) of a
red object in the FOV. The perceptive state is o btained as a discretization o f
the perc eptive domain using e quidimensional intervals. We tested our system
discretizing the perceptive state at different granularities.
Q-values. The resulting Q-table for a generic behavior is described in Table 1.
Perceptive state Attentional State Period 1 Period 2 ... Period k
Interval 1
Period 1 Q
... Q
... ... ... ... ...
Period k Q
... Q
... ... ... ... ... ...
Interval n
Period 1 Q
... Q
... ... ... ... ...
Period k Q
... Q
Table 1: Q-values for a generic behavior.
Reward function. We assume the reward always negative, with a strong penalty
) if the system cannot survive. For the other cases the penalty is as follows.
Concerning Avoid, each a ctivation is penalized with one (R
= r
if x
crash and 1 otherwise). As for Recharge, for each activatio n the penalty
is inversely pr oportional to the current charge (R
= r
if x
< th
max charge
otherwise). Finally each activation of Escape is penalized
proportionally to the cur rent amount of fear (R
= r
for x
< th
fear and
max fear
otherwise). For our experiments we adopt the following settings:
max: maximum penalty (1400 units of penalties);
max time: maximum time allowed to accomplish the task (180 seconds);
max sonar: maximum sonar range (1 meter);
crash: minimum distance under which the r obot sto ps (0.4 meters);
max charge: maximum value attainable for the charge (1 50 units of charge);
th ch arge: minimum value of the charge under which the robot needs to
recharge (140 units of charge);
fear: maximum height of a red blob (dangerous object) pe rceived by
the camera (30 pixels);
fear: minimum height of a red blob be yond which the robot does no t
work (23 pixels);
Setting the state space. First of all, we carried out some tests evaluating the
convergence of the Q-learning process, while changing granularity and dimension
of the state space. Each test consists of 5 experiments, each subdivided into 100 0
Attentional Action Selection using Reinforcement Learning 7
episodes. We set the learning rate at 0 .8. We evaluated the system performance
in 4 different r epresentations of the state space. Namely, for each behavior, we
considered 20, 24, 28, and 32 states, obtained by changing the size of the intervals
used to partition the perceptive domain. While we use a fixed discretization of
clock periods for ea ch test. In Fig. 4, we illustrate the variation of the fitness
values with respect to the state repr e sentation. The fitness function evaluates
Fig. 4: Fitness convergence, varying the states space representation.
the success percentage, i.e. the number of positive endings. We observe that for
all the state representations we get a good percentage of success (up 98% of
positive endings) after 200 episodes. However, the one with 24 states converges
faster reaching 100% positive endings after 300 episodes. In Fig. 5, we s how the
Fig. 5: Reward with different states space representation.
accumulated rewards for each represe ntation. Also in this case, we obtain the
best re gulation with the 24 states setting, therefore, we decided to employ this
representation fo r our experiments.
Setting the learning rate. Learning rate α is a crucial parameter that s trongly af-
fects Q-learning velocity and convergence. We tested 4 differe nt settings, namely,
0.2, 0.4, 0.6, 0.8. The results are depicted in Fig. 6, where we compare the con-
vergence curves. Here, we obtain the best regulation with α = 0.8. This result
seems corrob orated by the reward values depicted in Fig. 7, wher e the minimum
amount of penalties is associa ted with α = 0.8.
8 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
Fig. 6: Fitness convergence, varying the learning rate parameter.
Fig. 7: Rewards relative to different values of the learning rate.
4 Experiments and Results
We tested the attentional s ystem in 4 environments (see Fig . 2) with incr emental
complexity in the number and disposition of the objects (red, green and black
cube s). Each experiment starts with initial values set to 0 in the Q-tables. In Fig.
Fig. 8: Success rate in the survival domain.
8 we show the success rate for each e nvironment. Here, the lear ning curve always
converges to 100%. i.e. during the episodes the system is effective in learning the
attention allocation strategies used to select the actions suitable for survivance.
Fur thermore, we analized the reliability, efficiency and effectiveness of the
learned attentional strategies (RL-AIRM) comparing them with respect to the
Attentional Action Selection using Reinforcement Learning 9
results obtained with manually tuned attentional strategies (AIRM). We tested
these two settings in the 4 environments collecting means and standard devi-
ation of 100 tests. The res ults are shown in Fig. 9 and Table 2. In Fig. 9 we
Fig. 9: Comparison of architectures. Means collected on 100 validation tests on
performance measures.
can see that in almost all the environments RL-AIRM shows a higher success
rate and lesser cos t (less cost means better performance). Concerning efficiency,
in Table 2 we can obser ve that b oth RL-AIRM and AIRM are able to r educe
and focus the behaviors’ activations (i.e. the total number of cycles these be-
haviors are activated). AIRM seems more efficient, but it is also le ss reliable
and effective (as shown in Fig. 9), hence RL-AIRM seems to provide a better
balance of e fficie nc y (minimum a ctivations), r eliability (maximum success rate),
and effectiveness (minimum cost).
Data Env1 Env2 Env3 Env4 Env1 Env2 Env3 Env4
Rewards -386 7 -492 6 -408 23 -329 28 -350 12 -797 584 -1462 1271 - 412 28
Avoid 404 9 404 9 404 10 406 10 320 10 355 20 312 76 394 23
Recharge 224 27 266 43 339 73 270 42 217 45 235 69 252 97 198 24
Escape 192 14 199 37 272 35 234 29 95 1 98 3 90 17 104 3
Survival 180 180 180 180 180 179 4 160 30 180
Failures 0% 0% 0% 0% 0% 6% 28% 0%
Cycles 1135 1 1135 1 1135 1 1135 1 1135 1 1130 20 1000 200 1135 1
Table 2: Comparison of architectures. Means and variances collected on 100
validation tests on performance measures.
Overall, reinforc ement learning seems effective in regulating attention allocation
strategies and behaviors’ activations. The combined use of attentional mecha-
10 Dario Di Nocera, Alb erto Finzi, Silvia Rossi, and Mariacarla Staffa
nisms and le arning strategie s permits g ood p erformance in ter ms of reliability,
adaptivity, effectiveness, and efficiency.
5 Conclusions
We presented a RL approach to attentional allocation a nd action selection in a
robotic setting. Differently from classical RL models for action selection, where
actions are chosen acco rding to the operative/perceptive contexts, in our case
the action selec tion is mediated by the attentional status of the behavior. In our
setting, the learning process adapts and modulates the attentional strategies
while action selection is obtained as a consequence. We discussed the approach
considering lea rning and executive performance in a survival domain. The col-
lected results show that RL is effective in regulating simple attention allocation
mechanisms and the associated behaviors’ activations strategies.
Acknowledgments. Work supported by the European Community, within the
FP7 ICT-287513 SAPHARI project.
1. Bandera, C., Vico, F.J., Bravo, J.M., Harmon, M.E., Iii, L.C.B.: Residual q-
learning applied to visual attention. In: ICML-96. pp. 20–27 (1996)
2. Burattini, E., Rossi, S.: Periodic adaptive activation of behaviors in robotic system.
IJPRAI 22(5), 987–999 (2008)
3. Burattini, E., Rossi, S ., Finzi, A., Staffa, M.: Attentional modulation of mutually
dependent behaviors. In: Doncieux, S., Girard, B., Guillot, A., Hallam, J., Meyer,
J.A., Mouret, J.B. (eds.) SAB. Lecture Notes in Computer Science, vol. 6226, pp.
283–292. Springer (2010)
4. Houk, J.C., Adams, J.L., Barto, A .G.: A model of how the basal ganglia generate
and use neural signals that predict reinforcement. In: Houk, J.C., Davis, J.L.,
Beiser, D.G. (eds.) Models of Information Processing in the Basal Ganglia, pp.
249–270. MIT Press, Cambridge, MA (1995)
5. Kahneman, D.: Attention and Effort. Englewood Cliffs, NJ: Prentice-Hall (1973)
6. Montague, P.R., Dayan, P., Sejnowskw, T.J.: A framework for mesencephalic
dopamine systems based on p redictive hebbian learning. J. Neur. pp. 1936–1947
7. Norman, D., Shallice, T.: Attention in action: willed and automatic control of
behaviour. Consciousness and Self-regulation: advances in research and theory 4,
1–18 (1986)
8. Paletta, L., Fritz, G., Seifert, C.: Q-learning of sequential attention for visual object
recognition from informative local descriptors. In: ICML-05
9. Senders, J.: The human operator as a monitor and controller of multidegree of
freedom systems pp. 2–6 (1964)
10. Sutton, R., Barto, A.: Reinforcement learning: An introduction, vol. 1. Cambridge
Univ Press (1998)
11. Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents.
In: ICML-93. pp. 330–337. Morgan Kaufmann (1993)
12. Watkins, C., Dayan, P.: Q-learning. Machine learning 8(3), 279–292 (1992)
... Typically, within these approaches, RL is used to directly model and generate the action selection strategies. In contrast, we propose a system where RL is deployed to learn attentional allocation and shifting strategies, while action selection emerges from the regulation of attentional monitoring mechanisms (Di Nocera et al., 2012), which can be affected by the intrinsic motivation of curiosity. Our curiosity model is inspired by the interest/deprivation model proposed by Litman (2005), which captures both optimal-arousal and curiosity-driven approaches of curiosity modeling. ...
... Otherwise, when the trigger function is 1, the behavior is activated and the clock period changes according to the φ(f ). In order to learn attentional monitoring strategies, various methods such as Differential Evolution (Burattini et al., 2010) and RL techniques (Di Nocera et al., 2012) have been deployed, respectively for off-line and on-line tuning of the parameters regulating the attentional monitoring functions. In the following sections, we will present an intrinsically motivated RL (IMRL) approach to the attentional allocation problem in our frequency-based model of attention. ...
... Following the approach by Di Nocera et al. (2012), in this paper we exploit a RL algorithm to learn the attention allocation strategies introduced in section 2.1. In Di Nocera et al. (2012), a Q-learning algorithm is used to tune and adapt the frequencies of sensors sampling, while action selection is obtained as a side effect of this attentional regulation. ...
Full-text available
The concepts of attention and intrinsic motivations are of great interest within adaptive robotic systems, and can be exploited in order to guide, activate, and coordinate multiple concurrent behaviors. Attention allocation strategies represent key capabilities of human beings, which are strictly connected with action selection and execution mechanisms, while intrinsic motivations directly affect the allocation of attentional resources. In this paper we propose a model of Reinforcement Learning (RL), where both these capabilities are involved. RL is deployed to learn how to allocate attentional resources in a behavior-based robotic system, while action selection is obtained as a side effect of the resulting motivated attentional behaviors. Moreover, the influence of intrinsic motivations in attention orientation is obtained by introducing rewards associated with curiosity drives. In this way, the learning process is affected not only by goal-specific rewards, but also by intrinsic motivations.
... Suivant les mêmes principes que (Humphrys, 1996), nous employons une architecture de sélection d'action dans laquelle un agent connaît différents comportements de bas niveau (que nous appelons comportements de base) et doit choisir une action d'après l'utilité/la préférence que lui associent ces comportements de base. De précédents travaux dans le domaine de l'A/R pour la sélection d'action se sont concentrés : ...
... -soit sur l'adaptation du processus de combinaison de comportements de base (Buffet et al., 2002), -soit sur l'apprentissage/l'amélioration de comportements de base sélectionnés au préalables (Humphrys, 1996, Buffet et al., 2003. ...
... Dans notre cas, il va falloir identifier dans l'environnement courant les motivations présentes (comme définies dans la section 2.2) pour associer à chacune un comportement de base qui jouera le rôle d'expert. Pour éviter la tâche laborieuse de régler manuellement un grand nombre de paramètres associés à ce processus de pesée et de sélection d'actions, diverses approches d'apprentissage par renforcement (A/R) ont été étudiées (voir (Lin., 1992, Humphrys, 1996 par exemple). Le présent article se situe dans cette lignée, proposant une approche innovante pour atteindre une plus grande automatisation de la conception de l'agent. ...
Full-text available
Ce document présente mon ``projet de recherche'' sur le thème de l'embodiment (``cognition incarnée'') au croisement des sciences cognitives, de l'intelligence artificielle et de la robotique. Plus précisément, je montre comment je compte explorer la façon dont un agent, artificiel ou biologique, élabore des représentations utiles et pertinentes de son environnement. Dans un premier temps, je positionne mes travaux en explicitant notamment les concepts de l'embodiment et de l'apprentissage par renforcement. Je m'attarde notamment sur la problématique de l'apprentissage par renforcement pour des tâches non-Markoviennes qui est une problématique commune aux différents travaux de recherche que j'ai menés au cours des treize dernières années dans des contextes mono et multi-agents, mais aussi robotique. L'analyse de ces travaux et de l'état de l'art du domaine me conforte dans l'idée que la principale difficulté pour l'agent est bien celle de trouver des représentations adaptées, utiles et pertinentes. J'argumente que l'on se retrouve face à une problématique fondamentale de la cognition, intimement liée aux problèmes de ``l'ancrage des symboles'', du ``frame problem'' et du fait ``d'être en situation'' et qu'on ne pourra y apporter des réponses que dans le cadre de l'embodiment. C'est à partir de ce constat que, dans une dernière partie, j'aborde les axes et les approches que je vais suivre pour poursuivre mes travaux en développant des techniques d'apprentissage robotique qui soient incrémentales, holistiques et motivationnelles.
... The vast majority of RL applications make use of a human-designed evaluative feedback system, which gives rise to the credit assignment problem. One alternative is to use evolutionary methods, as demonstrated by Humphrys (1995Humphrys ( , 1996. Here, Q-learning agents (Watkins and Dayan, 1992) with different reward functions encoded in their genomes compete for control of a robot via a process called W-learning. ...
... Creating a correspondence library, particularly for agents with dissimilar embodiments, is the subject of work by Alissandrakis et al. (2002Alissandrakis et al. ( , 2005; Alissandrakis (2003), and the aforementioned CELL system (Roy, 1999;Roy and Pentland, 2002) implements lexical learning which could facilitate instruction. Humphrys (1995Humphrys ( , 1996; Damoulas et al. (2005a,b) show that it is possible to evolve reward functions for Reinforcement Learning using a genetic algorithm. There are many rule induction methods (Cohen, 1995) which could seed the insight learning and perceptual reconfiguration modules, including the use of decision trees (Quinlan, 1992) and neural networks (Omlin and Giles, 1996). ...
... In these last years some researchers started to pay attention to the role of attentional processes in order to achieve an adaptive emergent behavior of robotics systems. In previous papers [12,15,16] , we highlighted the opportunity of managing the frequency of processing the sensors inputs and action activations in an efficient way. This goal was achieved by introducing " internal clocks " in a robotic architecture, to regulate the frequency of sensors readings (see Section 3). ...
... In previous work [12] , we compared 4 possible monitoring policies (continuous , constant periodic, interval reduction with a priori knowledge and adaptive periodic monitoring) by showing that the adaptive periodic one is the most suitable in case of absence of a priori knowledge and a dynamical environment. Moreover, we evaluated the performance of different adaptation strategies [12] and deployed learning algorithms to optimize such strategies [16] with the sults that, even with simple adaptive mechanisms, an increase of performance and a flexibility in the emergent behavior can be reached. However, in previous works, no real evaluation of possible relationship between human beings abilities in monitoring/tracking and the robots behavior strategies was performed on the same task. ...
Conference Paper
Full-text available
Convoy driving requires both the leader and the follower to accomplish the task. Namely, also the leader has to monitor the following agents behavior and to adapt its own in order to not outdistance them. Our working hypothesis is that effective teamwork can be achieved by adapting periodic monitoring strategies. Inspired by the behavior of human beings, we adopted attentional mechanisms for filtering data and actively focusing the monitoring only on relevant information and agent behaviors. The robotic convoy task is accomplished via a behavior-based control architecture endowed with attentional mechanisms producing a variable frequency of the monitoring. In this paper, we consider a convoy task as a benchmark to evaluate and compare human and robot monitoring behaviors. We illustrate the various parts of the control architecture as well as present and discuss the results of experiments performed in a real world scenario with humans and robots.
... Here, a distributed and implicit representation of behavioral modules is employed, instead, following the ATA approach, we assume an explicit (and symbolic) representation of the behaviors and focus on attention-based control mechanisms for behavioral orchestration. Learning methods for bottom-up attentional regulations suitable for reactive robotic control have been proposed by Di Nocera et al. (2012Nocera et al. ( , 2014) exploiting a reinforcement learning approach. In contrast, we provide a learning method for both top-down and bottom-up attentional regulations. ...
Full-text available
We present a framework for robotic cognitive control endowed with adaptive mechanisms for attentional regulation and task execution. In cognitive psychology, cognitive control is the process that orchestrates executive and cognitive processes supporting adaptive responses and complex goal-directed behaviors. Similar mechanisms can be deployed in robotic systems in order to flexibly execute complex structured tasks. In this work, following a supervisory attentional system paradigm, we propose an approach that permits to learn how to exploit top-down and bottom-up attentional regulations to guide the execution of hierarchically structured tasks. We present the overall framework discussing its functioning in a mobile robot case study considering pick-carry-place tasks. In this setting, we show that the proposed system can be on-line trained by a user in order to execute incrementally complex activities.
... Improving Arbitration -The simple approach adopted, of using a flat architecture, choosing on the basis of the highest Q £ R value, does not directly implement any form of learning at a higher level. A more hierarchical architecture could rely on internal agents trained to arbitrate (Dorigo and Colombetti, 1997), or on using direct Reinforcement Learning such as W-learning (Humphreys, 1997) (especially since the Q-values within the separate agents are already established). Simpler approaches -to realise sequences or combinations of behaviours -could rely on some form of explicit internal memory, which agents would access to provide essentially internal states (along with the external state of the perceived environment). ...
The ability of an animal to generate adaptive behaviour often forms an immediately beneficial response in an unexpected direction. As such, it represents a creative process by which accumulated knowledge of the world is manipulated and exploited in novel ways. This ability would also benefit a simulated animal - an animat - in its interaction with the environment. Presented in this work is a first step at low-level analysis, replication and simulation of such a response, within the framework of behaviour-based Shaped Robotics. This has led to a simple, complete robotic architecture called FACADE, consisting of a number of agents that are implemented as Anticipatory Classifier System trained in basic behavioural responses. FACADE is able to arbitrate and combine these agents to form more complex responses, and to generate a dynamic new behaviour by detecting a reduced global level of reinforcement ("hunger"). The learning rate and performance of this new behaviour can be improved by instantiating it with knowledge acquired from the world models of established agents via concept clustering techniques, which are further manipulated to generate novel combinations.
... In this specific application the values of these parameters are chosen experimentally (see Sect. 3.1.1 and Table 1), but they can also be tuned by learning mechanisms either off-line or on-line as shown in previous works [12,18]. AVOID supervises the human safety during human-robot interaction. ...
Human robot collaborative work requires interactive manipulation and object handover. During the execution of such tasks, the robot should monitor manipulation cues to assess the human intentions and quickly determine the appropriate execution strategies. In this paper, we present a control architecture that combines a supervisory attentional system with a human aware manipulation planner to support effective and safe collaborative manipulation. After detailing the approach, we present experimental results describing the system at work with different manipulation tasks (give, receive, pick, and place).
An integrated celestial navigation scheme for spacecrafts based on an optical interferometer and an ultraviolet Earth sensor is presented in this paper. The optical interferometer is adopted to measure the change in inter-star angles due to stellar aberration, which provides information on the velocity of the spacecraft in the plane perpendicular to the direction of the observed star. In order to enhance the navigation performance, the measurements obtained from the ultraviolet Earth sensor is used to eliminate the unfavorable effect caused by the gravitational deflection of starlight. As the prior knowledge about the optical path delay bias of the optical interferometer may be ambiguous, a Q-learning extended Kalman filter is derived to fuse the two types of measurements, and estimate the kinematic state together with the optical path delay bias. The solution of the autonomous navigation system consists of position, velocity and attitude of the spacecraft. Numerical simulation shows that an evident improvement in navigation accuracy can be achieved by introducing the ultraviolet Earth sensor into the navigation system. In addition, it is shown that the Q-learning extended Kalman filter performs better than the traditional extended Kalman filter.
A robotic system that interacts with humans is expected to flexibly execute structured cooperative tasks while reacting to unexpected events and behaviors. In this paper, we face these issues presenting a framework that integrates cognitive control, executive attention, and hierarchical plan execution. In the proposed approach, the execution of structured tasks is guided by top-down (task-oriented) and bottom-up (stimuli-driven) attentional processes that affect behavior selection and activation, while resolving conflicts and decisional impasses. Specifically, attention is here deployed to stimulate the activations of multiple hierarchical behaviors orienting them towards the execution of finalized and interactive activities. On the other hand, this framework allows a human to indirectly and smoothly influence the robotic task execution exploiting attention manipulation. We provide an overview of the overall system architecture discussing the framework at work in different case studies. In particular, we show that multiple concurrent tasks can be effectively orchestrated and interleaved in a flexible manner; moreover, in a human-robot interaction setting, we test and assess the effectiveness of attention manipulation for interactive plan guidance.
Conference Paper
Human-robot teamwork requires agents to pay attention to both surrounding environment and teammates. Bandwidth and computational limitations prevent an agent to continuously execute this monitoring activity. Inspired by the behavior of human beings, paying frequent attention to timers while approaching deadlines, we provide robots with general monitoring strategies based on attentional mechanisms, for filtering data and actively focusing only on relevant information. We consider a convoy task (led by a human or a robot) as a benchmark to evaluate and compare human and robot monitoring behaviors.
Conference Paper
Full-text available
In this paper, we investigate simple attentional mechanisms suitable for sensing rate regulation and action coordination in the pres-ence of mutually dependent behaviors. We present our architecture along with a case study where a real robotic system is to manage and harmo-nize conflicting tasks. This research focuses on attentional mechanisms for regulating the frequencies of sensor readings and action activations in a behavior-based robotic system. Such mechanisms are to direct sensors toward the most salient sources of information and filter the available sensory data to prevent unnecessary information processing.
Conference Paper
Full-text available
This work provides a framework for learn- ing sequential attention in real-world visual object recognition, using an architecture of three processing stages. The rst stage re- jects irrelevant local descriptors based on an information theoretic saliency measure, pro- viding candidates for foci of interest (FOI). The second stage investigates the informa- tion in the FOI using a codebook matcher and providing weak object hypotheses. The third stage integrates local information via shifts of attention, resulting in chains of descriptor-action pairs that characterize ob- ject discrimination. A Q-learner adapts then from explorative search and evaluative feed- back from entropy decreases on the attention sequences, eventually prioritizing shifts that lead to a geometry of descriptor-action scan- paths that is highly discriminative with re- spect to object recognition. The method- ology is successfully evaluated on indoors (COIL-20 database) and outdoors (TSG-20 database) imagery, demonstrating signican t impact by learning, outperforming standard local descriptor based methods both in recog- nition accuracy and processing time.
Full-text available
We develop a theoretical framework that shows how mesencephalic dopamine systems could distribute to their targets a signal that represents information about future expectations. In particular, we show how activity in the cerebral cortex can make predictions about future receipt of reward and how fluctuations in the activity levels of neurons in diffuse dopamine systems above and below baseline levels would represent errors in these predictions that are delivered to cortical and subcortical targets. We present a model for how such errors could be constructed in a real brain that is consistent with physiological results for a subset of dopaminergic neurons located in the ventral tegmental area and surrounding dopaminergic neurons. The theory also makes testable predictions about human choice behavior on a simple decision-making task. Furthermore, we show that, through a simple influence on synaptic plasticity, fluctuations in dopamine release can act to change the predictions in an appropriate manner.
Discusses the operations of 2 posterior brain systems that may be involved in selective attention: 1 is involved with the selection of spatial information, and the other is involved with the selection of semantic information. While dual-task studies have shown some independence between these forms of processing, other evidence indicates that attending to language information can interfere with the processing of spatial cues. Just as posterior attentional systems act to select candidate sensory stimuli, it is possible that higher levels of the system act to prevent sensory events from inappropriate control of performance of many cognitive tasks. Several lines of evidence suggest that the dorsolateral prefrontal cortex may play an important role in this higher level of attention. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
The main goal of our current research is the design of a robotic architecture that has the capability of adapting the robot's behavior to the rate of change of a dynamic environment. We present a model which takes free inspiration from some features of biological clocks. In particular, we associate the concept of Innate Releasing Mechanisms (IRM) to the concept of periodic behavior activation in order to link the variability of the behavior to the circumstances in which it is activated. We propose an architecture in which the frequency of access to the sensory system is modified in accordance to the environmental changes. To this purpose we use the Schema Theory paradigm. Some first experimental results are reported and discussed.