Available via license: CC BY
Content may be subject to copyright.
Modeling, Identification and Control, Vol. 42, No. 4, 2021, pp. 197–204, ISSN 1890–1328
Model-Free All-Source-All-Destination Learning as
a Model for Biological Reactive Control
M. Knudsen 1S. Hendseth 1G. Tufte 2A. Sandvig 3
1Department of Engineering Cybernetics, Norwegian University of Science and Technology, N-7491 Trondheim,
Norway. E-mail: {Martinius.Knudsen,Sverre.Hendseth}@ntnu.no
2Department of Computer Science, Norwegian University of Science and Technology, N-7491 Trondheim, Norway.
E-mail: {Gunnar.Tufte}@ntnu.no
3Department of Neuromedicine and Movement Science, Norwegian University of Science and Technology, N-7491
Trondheim, Norway. E-mail: {Axel.Sandvig}@ntnu.no
Abstract
We present here a model-free method for learning actions that lead to an all-source-all-destination shortest
path solution. We motivate our approach in the context of biological learning for reactive control. Our
method involves an agent exploring an unknown world with the objective of learning how to get from any
starting state to any goal state in shortest time without having to run a path planning algorithm for each
new goal selection. Using concepts of Lyapunov functions and Bellman’s principle of optimality, our agent
learns universal state-goal distances and best actions that solve this problem.
Keywords: Control, Learning, Shortest path
1 Introduction
We are all naturally born explorers. If a child is not
sleeping or feeding, they are most likely moving. In
seemingly sporadic ways the baby moves arms and legs
without much goal or intent. Movement provides valu-
able sensory data streams which is essential in develop-
ing the child’s sensory motor skills and their knowledge
of their bodies state-space. The brain is in fact con-
stantly bombarded with some 11 million bits per sec-
ond of sensory feedback Britannica (2020), to provide
the brain with data to keep track of the body’s state as
well as information from the surrounding environment.
All this data enables one of the brain’s core functions;
to process this input stream and apply motor actions
in accordance to the individuals needs and objectives.
Exactly how the brain achieves this feat of learning
control policies is still an active research topic. Better
understanding in this area may contribute greatly to-
wards novel methods in AI. Such knowledge could also
facilitate rehabilitation of patients with stroke, spinal
cord and traumatic brain injuries as well as restore
neurological functions to a level which is currently not
possible.
Control can largely be divided into prediction and re-
action. Prediction involves forward simulation, which
estimates future outcomes given an understanding of
the system dynamics and the effect of actions, which
together comprise a model of the system. Reactive con-
trol requires no forward prediction, a property which
allows reaction to be computationally more efficient
than prediction. In addition, reaction can be accom-
plished even without a model, as it only requires a
mapping from state to action. In control engineering,
a highly successful predictive method is that of Model-
Predictive Control (MPC) Camacho and Alba (2013).
MPC has the ability to anticipate future events and
thereby take control actions according to the desired
doi:10.4173/mic.2021.4.5 c
2021 Norwegian Society of Automatic Control
Modeling, Identification and Control
outcome Camacho and Alba (2013). While design of
conventional controllers typically requires a system/-
plant model in order to design a state-action mapping,
intelligent control methods such as reinforcement learn-
ing (RL) learn mappings through exploration and ex-
perimentation of the environment Sutton and Barto
(2018). It is likely that the brain utilizes both model-
based predictive control and model-free reactive con-
trol. Either for different types of tasks or for different
stages of the same task. An example of the latter is
learning to juggle: at first the beginner uses intrin-
sic knowledge of physics to predict where the ball will
land and how hard to toss the ball into the air. At this
stage many mistakes are made and the mental load is
high. As the individual becomes more skilled, mental
load decreases, even though precision increases. This
effect can be observed when comparing EEG signals
of beginners and experts Schiavone et al. (2015). It
would seem the practitioner moves from more predic-
tive based control to reaction. As such, many experts
often cite their reactive behavior as intuition Izquierdo-
Torres and Di Paolo (2005). This is interesting from a
biological point of view as the brain would continuously
strive to solve its computational problems at hand as
cost efficient as possible using the minimum energy to
solve the problem/task.
Inspired by the brains ability to learn through explo-
ration Roussou (2004), we ask how we can implement
more efficient learning by utilizing the feedback we get
about every state-action transition from the environ-
ment. In this paper, instead of defining a single goal
upfront, we treat every state we observe as a poten-
tial goal. This way, should our goal be redefined, we
already have the knowledge to quickly and efficiently
find our way to this newly defined goal. This requires
no additional time-consuming and computationally ex-
pensive training, and also addresses the problem of
catastrophic forgetting McCloskey and Cohen (1989).
Our aim is to explore the environment once and by this
learn how to take actions as to get to any potential goal
in the environment. Using principles from Lyapunov
theory and dynamic programming Bertsekas (2016),
we present an algorithm for solving the all-source-all-
destination (ASAD) problem in a model-free way in
an unknown world. With this, we hypothesize that
the brains reactive control may result from learning
spatio-temporal maps of the distances between states
in state-space. These maps are continuously updated
as new state-space trajectories are explored and faster
paths are found.
The paper is laid out as follows: Section 2 provides
the background of the ASAD problem and the core
principles of the method. In Section 3 we describe the
environment world in which we will apply our method,
as well as the method itself applied through an agent.
In section 4 we present the results, and discuss these
in section 5. Finally we conclude the work in section
6.
2 Background
2.1 All-pairs shortest path
The ASAD problem is essentially an all-pairs shortest
path problem Deo and Pang (1984). The main differ-
ence between our method and the established methods
for solving this problem is the assumption that there
exists a model or encoding of the system that the algo-
rithm can be applied to. Most algorithms will be given
a known graph from the get go, while our method does
not. Being model-free we also do not attempt to gen-
erate a model of the graph nor do we store information
about the graph structure. In addition, most of the es-
tablished algorithms, with some exceptions, primarily
aim to find the shortest paths between states, and not
the actual actions that need to be taken at each state
in order to move towards the goal. For our context,
which is to enable an agent to learn to operate in an
unknown world, learning and remembering what the
optimal actions are is essential.
One of the more established algorithms for solving
the all-pairs problem is the Floyd-Warshall (FW) al-
gorithm Floyd (1962). Like the Bellman-Ford algo-
rithm Bellman (1958); Ford Jr (1956) or the Dijkstra’s
algorithm Dijkstra et al. (1959), it computes the short-
est path in a graph. However, Bellman-Ford and Dijk-
stra are both single-source, shortest-path algorithms,
while FW computes the shortest distances between ev-
ery pair of vertices Vein the input graph. FW accom-
plishes this by incrementally improving an estimate of
the shortest path between two vertices, by evaluating
if there exist better routes through any of the other
vertices in the graph. The algorithm is able to find the
optimal solution in O(|Ve|3) time.
While the original FW algorithm only finds the
shortest paths between vertices in the graph, it can
be extended with the possibility for path reconstruc-
tion by also saving the actions during its run. Our
Python code for the extended FW algorithm can be
seen in Listing 1. We will be using this algorithm for
comparison with our own method in the results sec-
tion 4. While the FW method requires the complete
graph up front, we will show how this is not necessary
in our method.
1def floydWarshall( graph) :
2d i s t = np . a r r a y ( gr a ph )
3a c t i o n = np . o n es ( ( Ve , Ve ) , dt yp e=int )∗n p . NaN
4f o r ii n r a n g e ( Ve ) :
5f o r ji n ra n g e ( Ve ) :
6i f gr ap h [ i ][ j ] != INF :
7a c t i o n [ i ] [ j ] = j
8f o r kin ra n g e ( Ve ) : # source
198
Knudsen et.al., “Model-free ASAD learning as a model for biological reactive control”
9for ii n ra n g e (Ve ) : # d e s t i n a t i o n
10 f o r ji n ra n g e ( Ve ) : # intermediate
11 i f d i s t [ i ] [ j ] >d i s t [ i ] [ k ] + d i s t [ k ] [ j ] :
12 d i s t [ i ] [ j ] = d i s t [ i ] [ k ] + d i s t [ k ] [ j ]
13 a c t i o n [ i ] [ j ] = a c t i o n [ i ] [ k ]
14 return d i s t , ac t i o n
Listing 1: Our Python code for the Floyd-Warshall
algorithm with path reconstruction.
2.2 Lyapunov functions
Our method is inspired by the idea of Lyapunov func-
tions Freeman and Primbs (1996), which states that if
you have a function Vthat returns a positive scalar for
all non-goal states x, i.e. V(xt)≥0. and you can have
this function continuously decrease while the system
moves towards some goal state, i.e. ˙
V(xg)≤0, then
you essentially have a Lyapunov function. In control
theory, Lyapunov functions are often constructed us-
ing a conservation law, but any function satisfying the
definition is applicable Freeman and Primbs (1996). If
a state is reachable from a set of initial states, then
there must exist a Lyapunov function for these initial
states. Put more formally:
If, from a state A a goal B is reachable then, given
some cost metric, there must exist at least one optimal
path between A and B. If this optimal path is followed,
the associated cost along this path should continuously
decrease while remaining positive for all states as the
cost is assumed positive.
This cost could be a Euclidean distance, a path
length, the systems energy or even the time-to-
destination. We here use this latter metric by imag-
ining that the Lyapunov function is the number of
steps it takes to traverse the optimal path between
two states. This time-to-destination is always a pos-
itive value, except at the goal state where it is 0. We
wish to select actions as to continuously decrease this
time-to-destination, which will then satisfy the condi-
tions of a Lyapunov function. Unlike most traditional
Lyapunov-based control algorithms where one finds a
control law that makes the system stable for a single
given goal (ASSD) Freeman and Primbs (1996), our
method finds the set of optimal actions that gets us to
all other reachable states in the system.
2.3 Principle of optimal subpaths
Another closely related concept is that of Bellman’s
principle of optimality, which states that for any opti-
mal path between two states, all subpaths of this path
are also optimal Bellman (1952). For us, this property
means that we only need to know the optimal action to
take for the current state we are at, and not the whole
future action trajectory as required in predictive con-
trol. This is because taking the optimal action moves
us to a state along the optimal subpath, where we again
can select the optimal action for this new state. Thus,
we only need to store a single optimal action at each
state. The path reconstruction method for the FW
algorithm utilizes a similar strategy.
3 Method
Using the principles of Lyapunov functions and opti-
mal subpaths, we find that each state only needs two
pieces of information for each state-goal pair: (i) the
best action to take in the current state and (ii) a value
that encodes the distance between the current state
and the goal. We use the state transition feedback at
each timestep in the environment efficiently to update
these state-goal actions and values.
3.1 The world
Many environments are discrete, such as board games,
grid worlds, choices, or routing (vehicle routing prob-
lem Toth and Vigo (2002)). In reinforcement learn-
ing, discrete worlds have played an important role and
arena for testing out new algorithms, especially for tab-
ular RL methods. Almost all the powerful deep RL
methods we hear about today have a tabular begin-
ning. Similarly. the worlds in which we demonstrate
our method will be discrete and finite in both state-
and action-space.
3.1.1 Graph world
We shall first demonstrate our method on a hypothet-
ical world which can be represented by a graph like
the one in Figure 1. Here, each node S represents a
state of the system, and each edge A represents an
action. At each state all actions are available for se-
lection. An action performed in a given state at the
current timestep will result in a transition to another
state or back to itself. Doing nothing is also regarded
an action as the system evolves with time regardless of
any input. Each transition is of step length one. This
means that an action is applied at every step, and that
the minimal possible path between 2 sequential states
is one timestep/transition.
3.1.2 2-link manipulator
The second world we test our method on is a discrete 2-
link manipulator as depicted in Figure 2. This system
is comprised of two joints and accompanying motors to
rotate them. The action space is thus made up of 4 pos-
sible actions; motor 1 clockwise and counter-clockwise
199
Modeling, Identification and Control
B
A
2
D
C
E
1
2
1
1
1
1
2
2
2
Figure 1: Example graph showing the state-action
transitions. Here, the nodes are the 5 states
A-E and the edges are the actions. There are
2 possible actions, action 1 (red) and action
2 (blue). The agent does not have knowledge
about the transitions prior to exploration.
rotations, and similarly motor 2 rotations. The state-
space is the angles of the joints, θ1and θ2, and is dis-
cretized; a single rotation leads to a unique state. Low-
level motor controllers can ensure the realistic imple-
mentation of such a system. Within this system, not
all states are allowed as some states violate the oper-
ation boundaries either virtually (restricted manipula-
tor configurations) or physically (e.g. obstacles in the
work-space Spong et al. (2005)). In this system the
goal is to move the manipulator through state-space
without violating state constraints. We also want to
be able to define new goals at any initial state with-
out having to initiate path planning each time (path
planning would also fall into the category of predictive
control). The 2-link manipulator task is meant as an
illustrative model to how animals may learn to move
their limbs in a processing efficient manner, yet with
great precision; a feat still quite difficult to achieve in
robotics.
3.2 The agent
The agent’s task is to learn how to optimally get to
other states within its world. Upon initialization, the
agent knows nothing about the structure of its envi-
ronment, it only knows what actions it can take. The
individual states can be learned or known in advance.
The agent goes through two main phases: (i) an ex-
ploration phase and (ii) an exploitation phase. In the
Figure 2: 2-link manipulator
Table 1: Main table for storing state-goal values and
best actions. The table is empty prior to
learning.
State A B C D E
A, shortest time
A, best action
B, shortest time
B, best action
C, shortest time
C, best action
D, shortest time
D, best action
E, shortest time
E, best action
exploration phase, the agent learns the all-to-all op-
timal actions and the path distances, not the actual
paths themselves. The agent does not learn or store
the structure of the graph. The agent’s objective is:
Solve the ASAD problem such that: given any state
Sx, and any goal state Sg, apply the appropriate action
in Sxthat will take it to Sgin a minimal amount of
steps/time.
3.2.1 Storing the best actions and shortest times
For storing the best actions and shortest times, we cre-
ate a table with size given by the number of states and
the action-space. This table could be dynamically gen-
erated through exploration if one does not have prior
information about the number of states beforehand.
No information about the transitions between states is
yet encoded in the table as this is to be learned. The ta-
ble is an NsxNstable where Nsis the number of system
states. Each row is a state which records two values
for each of the other states in the system (columns); (i)
the shortest time that has currently been found from a
state in row R to get to some state in column C, and
(ii) the best action taken by a state in R that resulted
in this shortest time to a state in C. Such a table with
5 states A-E is shown in Table 1.
200
Knudsen et.al., “Model-free ASAD learning as a model for biological reactive control”
Table 2: Memory table.
State Time since state Last action
A
B
C
D
E
3.2.2 Memory: keeping track of previous states
and actions
In addition to the above table that eventually stores
the actual solution to the problem, we also require a
memory table. This tables keeps track of (i) the num-
ber of timesteps since a state was last visited, and (ii)
the last action performed in a state. This is illustrated
in Table 2.
It is necessary for the agent to keep track of these
two metrics as they are essential to comparing trajec-
tory lengths with Table 1, and in the case of finding a
shorter path, recalling the last action that was taken
as to update Table 1.
3.3 Exploration and exploitation phases
3.3.1 Exploration: filling in the table
The agent first explores its environment in which it
performs novel actions in each state in order to explore
alternative paths between states. Updating the table
is done through the following steps:
1. Say we are transitioning from state SAto SB.
Upon arriving at SB, we iterate through all pre-
viously visited states Sxand compare the short-
est path time values stored for each (Sx, SB) pair
with each Sx’s time since visited value. If the
latter value is lower than the former, we update
(Sx, SB) by replacing its shortest path time with
time since visited. At the same time, we also up-
date the shortest path action with the last action
performed in state Sxfrom memory. So, at a given
time step we are not just updating a single cell, we
are updating all previously visited cells, which is
what makes the algorithm so efficient.
2. Select an action in SB, and reset its associated
time since visited counter and update the last ac-
tion with the action currently being performed.
3. Transition to the next state and repeat step 1 and
2 for this new state.
Example exploration illustrated using
graph 1: State SChas stored that the fastest
way to SEis in 3 steps (path SC−SB−SA−SE),
and that the action that was taken in SCfor this to
occur was action 1. Later, again in state SC,action 2
is performed (due to random exploration) which gets
us directly to state SE. Now SCupdates the storage
table with the new minimum steps to get to SE(1
step) and the action taken in SCthat accomplished
this (action 2 ). This update is done for all states.
Upon revisiting a state, a different action from the
previous action is taken to ensure exploration.
The following is our Python code for the exploration
phase:
1def e x p l o r e () :
2s = np r . ra n d i n t ( n S t a t e s )
3f o r i n ra n g e ( s t e p s ) :
4memory . t i m e s i n c e [ s ] = 0
5f o r so i n r a n g e ( n S t a t e s ) :
6i f memory . t i m e s i n c e [ so ] <t a b l e . s h o r t e s t t i m e [
so , s ] :
7t a b le . s h o r t e s t t i m e [ so , s ] = mem ory .
time since [ so ]
8t ab l e . a c ti o n [ s o , s ] = m emory . l a s t a c t i o n [ s o ]
9f o r so i n r a n g e ( n S t a t e s ) :
10 memory . t i m e s i n c e [ s o ] += 1
11 a = a c ti o n ( memo ry . l a s t a c t i o n [ s ] )
12 memory . l a s t ac t i o n [ s ] = a
13 s = t t [ s , a ]
Listing 2: Python code for the exploration algorithm.
tt is the transition table
The exploration phase can easily incorporate simu-
lated annealing Van Laarhoven and Aarts (1987) which
is a procedure to decrease the probability of choosing
a random action over the current best action as time
evolves. This is a commonly used method in RL shown
to make the exploration phase more efficient by exploit-
ing learned paths already in the exploration phase.
3.3.2 Exploitation: selecting optimal actions
When the system has been sufficiently trained we uti-
lize the resulting storage Table 3. In this phase, we
no longer perform random actions for exploration pur-
poses, but instead always choose the best action for the
given state-goal pair. This is our exploitation code:
1def e x p l o i t ( s , go a l ) :
2s t a t e P a t h = [ ]
3s t a t e P a t h . a pp en d ( s )
4i f t a b l e . s h o r t e s t t i m e [ s , g o a l ] <INF :
5while s ! = go a l :
6a = int ( t a b l e . a c t i o n [ s , g o a l ] )
7s = t t [ s , a ]
8s t a t e P a t h . a pp en d ( s )
9return s t a t e P a t h
Listing 3: Python code for the exploitation algorithm
4 Results
4.1 Graph world
Using our algorithm to explore the graph world in Fig-
ure 1, we obtain the results found in Table 3. Our
agent was not given the graph, nor information about
the number of states. We compare our method with the
FW algorithm, which we do need to provide the whole
201
Modeling, Identification and Control
Table 3: The shortest paths found running in the graph
world depicted in Figure 1. Our method
matches the result from the Floyd-Warshall
algorithm, however without knowledge of the
graph.
Our algorithm Floyd-Warshall
States A B C D E A B C D E
A 1 4 3 2 1 1 4 3 2 1
B 1 1 4 3 2 1 1 4 3 2
C 2 1 3 2 1 2 1 3 2 1
D 3 2 1 2 1 3 2 1 2 1
E 4 3 2 1 1 4 3 2 1 1
Table 4: The resulting optimal actions found after ex-
ploring the graph world depicted in Figure 1.
States A B C D E
A 1 2 2 2 2
B 1 2 1 1 1
C 1 1 2 2 2
D 1 1 1 2 2
E 1 1 1 1 2
graph. FW is known to find the optimal all-pairs dis-
tances in minimum time. We see that our method and
FW find identical solutions. The FW algorithm solved
the graph in 0.22 ms on a single CPU and our algorithm
consistantly found a solution after 25 exploring steps
averaging 0.51 ms. However, the timing comparison is
not quite fair, as solving a known graph vs unknown
graph are problems of different complexity.
The resulting best actions for graph 1is shown in
Table 4. Given that we found the optimal paths above,
these are the optimal actions the agent must take when
a goal state is defined.
4.2 2-link manipulator
For the 2-link manipulator, our method efficiently
found the optimal paths and actions between goals.
First we find the solutions to random 10x10 state-space
worlds as shown in Figure 3. In this environment, there
were 10x10=100 unique states, The optimality of the
solution can be easily verified by inspection.
As a more realistic example, we show in Figure 4(a)
a 2-link manipulator in a workspace subject to obsta-
cles. The manipulator has arm lengths 6 and 5, and is
centered in a 24x24 workspace. Any point beyond this
is outside of the manipulator’s reach. We generate a
discretized state-space representation of this workspace
in (b) by checking every possible manipulator config-
uration for collision (both θ1and θ2have the range
[0,2π]). We run our exploration algorithm and show
in (b) a solution path between two randomly chosen
states. Again, the optimality of this path can be eas-
ily verified. The solution graph was consistantly found
after 60.000 steps of our explore() method, which av-
eraged to 9.4 CPU seconds. As expected, the larger
the problem the more training steps are required. Any
path between two states can now be easily navigated
using the exploit() method.
5 Discussion
As agents, we learn how to obtain goals in our world,
and we are quite efficient at doing so. Through ex-
ploration we hypothesize that the brain generates an
internal metric that encodes the closeness of states
given the available action space. This is motivated
by the brain’s high level of neural network recurrency
which enables temporal association Zipser (1991). By
means of a simple state-action mapping, efficient re-
active control can be achieved. To overcome narrow
AI DeepAI.org (2021) we require methods to learn
more generally. We need to utilize all the information
about the environment, not just update our knowledge
when receiving rewards for a specific task. When the
agent here explores its environment, there is no defined
goal, except to explore as much as possible. Such explo-
ration where the only goal is to find novel states, has
been very successful in RL Huang and Weng (2002);
Conti et al. (2018); Krebs et al. (2009), and is also
biologically motivated through play Roussou (2004).
Compared to other established all-pairs shortest
path methods, we believe the method presented here
highlights biological motivation for the ASAD problem,
and how animals may use such an algorithm in reactive
control. Existing shortest path methods traditionally
require information about the complete world graph
prior to execution, and are therefore less relevant in
the agent-based context where the world is unknown.
One might be tempted to simply explore and generate
a model of the world and then run a traditional shortest
path method on this model. Such an approach would
be far less efficient as (i) we would have to store infor-
mation about the world graph, and (ii) we would have
to split the process into two steps in an offline manner.
In this case, it is difficult to know when the environ-
ment has been sufficiently explored such that we can
run the path planning algorithm. Also, if we find new
states, we would need to run the whole process over.
With our method we can continuously update the best
paths and actions, and by balancing exploration and
exploitation we can continue to have some level of ex-
ploration indefinitely. This is especially useful for very
202
Knudsen et.al., “Model-free ASAD learning as a model for biological reactive control”
Figure 3: Solutions to the 2-link manipulator as shown in Figure 2. Regular states are blue, prohibited states
are dark blue, the starting state is green, the goal yellow and the path turquoise. θ1and θ2are
represented by row and column respectively.
(a) Workspace (a) State-space
Figure 4: Solutions to the 2-link manipulator as shown
in Figure 2. (a) the manipulator (yellow) in
its workspace. Obstacles are turquoise. (b)
the systems state-space where the y-axis is
θ1and the x-axis θ2. Dark blue states repre-
sent collision with an obstacle and light blue
states are allowable states. The turquoise
path shows the solution when moving from
the green state to the yellow state.
large and even dynamically changing worlds, as the
real world is, where we require updating our knowledge
continuously. Additionally, we can take advantage of
simulated annealing in order to learn faster.
As with all tabular methods this method has some
limitations; (i) the curse of dimensionality and (ii) con-
finement to discrete systems. However, future inves-
tigation using artificial neural networks will be con-
ducted as to approximate the values and actions in the
same way that they have made deep RL algorithms so
efficient for large and continuous environments.
6 Conclusion
We have presented here an algorithm that efficiently
utilizes state transition feedback from unknown envi-
ronments in order to learn how to solve the ASAD
problem. This was accomplished using ideas from
shortest path solvers, Lyapunov functions and Bell-
man’s principle of optimality to generate a state-goal
value and best action table. We have addressed how RL
methods today often lead to narrow AI, and how our
method can be used for more general learning. Further-
more, we have motivated our approach as relevant for
biological reactive control. Our method enables agents
to efficiently learn optimal solutions.
References
Bellman, R. On the theory of dynamic programming.
Proceedings of the National Academy of Sciences
of the United States of America, 1952. 38(8):716.
doi:doi:10.1073/pnas.38.8.716.
Bellman, R. On a routing problem. Quar-
terly of applied mathematics, 1958. 16(1):87–90.
doi:doi:10.1090/qam/102435.
Bertsekas, D. P. Dynamic Programming and Optimal
Control. Number 3 in Athena Scientific Optimization
and Computation Series. Athena Scientific, Belmont,
Mass, fourth ed edition, 2016.
Britannica, E. Information theory - Physiology.
https://www.britannica.com/science/information-
theory, 2020.
Camacho, E. F. and Alba, C. B. Model Predictive Con-
trol. Springer Science & Business Media, 2013.
Conti, E., Madhavan, V., Such, F. P., Lehman, J.,
Stanley, K., and Clune, J. Improving exploration
in evolution strategies for deep reinforcement learn-
ing via a population of novelty-seeking agents. In
Advances in Neural Information Processing Systems.
pages 5027–5038, 2018.
DeepAI.org. What is narrow ai?
https://deepai.org/machine-learning-glossary-
and-terms/narrow-ai, 2021.
203
Modeling, Identification and Control
Deo, N. and Pang, C.-Y. Shortest-path algo-
rithms: Taxonomy and annotation. Networks.
An International Journal, 1984. 14(2):275–323.
doi:doi:10.1002/net.3230140208.
Dijkstra, E. W. et al. A note on two problems in con-
nexion with graphs. Numerische mathematik, 1959.
1(1):269–271. doi:doi:10.1007/bf01386390.
Floyd, R. W. Algorithm 97: Shortest path.
Communications of the ACM, 1962. 5(6):345.
doi:doi:10.1145/367766.368168.
Ford Jr, L. R. Network flow theory. Technical report,
Rand Corp Santa Monica Ca, 1956.
Freeman, R. A. and Primbs, J. A. Control Lyapunov
functions: New ideas from an old source. In Pro-
ceedings of 35th IEEE Conference on Decision and
Control, volume 4. IEEE, pages 3926–3931, 1996.
doi:doi:10.1109/cdc.1996.577294.
Huang, X. and Weng, J. Novelty and reinforcement
learning in the value system of developmental robots.
2002.
Izquierdo-Torres, E. and Di Paolo, E. Is an Em-
bodied System Ever Purely Reactive? In M. S.
Capcarr`ere, A. A. Freitas, P. J. Bentley, C. G.
Johnson, and J. Timmis, editors, Advances in Ar-
tificial Life, Lecture Notes in Computer Science.
Springer, Berlin, Heidelberg, pages 252–261, 2005.
doi:doi:10.1007/11553090 26.
Krebs, R. M., Schott, B. H., Sch¨utze, H.,
and D¨uzel, E. The novelty exploration
bonus and its attentional modulation. Neu-
ropsychologia, 2009. 47(11):2272–2281.
doi:doi:10.1016/j.neuropsychologia.2009.01.015.
McCloskey, M. and Cohen, N. J. Catastrophic In-
terference in Connectionist Networks: The Sequen-
tial Learning Problem. In G. H. Bower, edi-
tor, Psychology of Learning and Motivation, vol-
ume 24, pages 109–165. Academic Press, 1989.
doi:doi:10.1016/S0079-7421(08)60536-8.
Roussou, M. Learning by doing and learning through
play: An exploration of interactivity in virtual envi-
ronments for children. Computers in Entertainment,
2004. 2(1):10. doi:doi:10.1145/973801.973818.
Schiavone, G., Großekath¨ofer, U., `a Campo,
S., and Mihajlovi´c, V. Towards real-time
visualization of a juggler’s brain. Brain-
Computer Interfaces, 2015. 2(2-3):90–102.
doi:doi:10.1080/2326263X.2015.1101656.
Spong, M., Hutchinson, S., and Vidyasagar, M. Robot
Modeling and Control. Wiley, 2005.
Sutton, R. S. and Barto, A. G. Reinforcement Learn-
ing: An Introduction. Adaptive Computation and
Machine Learning Series. The MIT Press, Cam-
bridge, Massachusetts, second edition edition, 2018.
Toth, P. and Vigo, D. The Vehicle Routing Problem.
SIAM, 2002.
Van Laarhoven, P. J. and Aarts, E. H. Simu-
lated annealing. In Simulated Annealing: The-
ory and Applications, pages 7–15. Springer, 1987.
doi:doi:10.1007/978-94-015-7744-1 2.
Zipser, D. Recurrent network model of the
neural mechanism of short-term active mem-
ory. Neural Computation, 1991. 3(2):179–193.
doi:doi:10.1162/neco.1991.3.2.179.
204