ArticlePDF Available

Generation of Compensation Behavior of Autonomous Robot for Uncertainty of Information with Probabilistic Flow Control


Abstract and Figures

We propose a novel decision making method for autonomous robots under uncertainty of state infor-mation. This method is a modification of QMDP value method, which chooses actions of a robot by probabilistic calculations with a probability distribution that represents the uncertainty of state recog-nition. In the proposed method, named probabilistic flow control (PFC) method, the core formula of the QMDP value method is changed to another. The new formula biases the calculation so as to give a priority in the decision making to an important part of the probability distribution. We have applied PFC method to a simulated simple task of a mobile robot. In the task, the robot with PFC method shows tricky behaviors that compensate the uncertainty of state information.
Content may be subject to copyright.
March 9, 2016 Advanced Robotics ar˙prob˙flow
This is an electronic version of an article published in ADVANCED ROBOTICS,
Volume 29, No. 11, pp. 721-734, 2015.
ADVANCED ROBOTICS is available online at: DOI;
Generation of Compensation Behavior of Autonomous Robot for Uncertainty
of Information with Probabilistic Flow Control
Ryuichi Ueda
Advanced Institute on Industrial Technology, 1-10-40 Higashi Ohi, Shinagawa-ku, Tokyo, Japan
(Received 7 Aug. 2014; accepted 13 Jan. 2015; Published online: 27 Mar. 2015)
We exhibit that a robot shows strategic behaviors for overcoming uncertainty of information in navigation
tasks by a modification of the core formula in QMDP value method. The new formula biases the calculation
so as to give a priority the decision making to an important part of the state space. QMDP value method
with the modified formula is named probabilistic flow control (PFC) method. We apply PFC method to a
simulated simple task of a mobile robot. In the task, the robot with PFC method shows tricky behaviors
that compensate the uncertainty of state information.
Keywords: probabilistic flow control method; particle filters; POMDPs
1. Introduction
Uncertainty of recognition is an inevitable problem for autonomous agents in the real world, while
human beings or some animals can choose appropriate actions even if the information for decision
making is quite restricted. For example, a person can go to a bedroom in his/her house even in
darkness. Even if self-localization is impossible, the person will go to the bedroom by walking along
walls. It is interesting that this behavior will be different from that in the daytime.
If autonomous robots have this kind of flexibility, they can execute their tasks in poor information
environments. In robotics, this class of problems have been studied in the framework of POMDPs
(partial observable Markov decision processes) [1]. When we define a state space of a system in
which path planning or optimal control problems are solved, the state in each instant of time
is hidden in a POMDP. Instead, the information of the actual state is given as a probability
distribution in the state space. The distribution is called a belief or a belief state.
The behavior of a robot, argued in the example of the bedroom, has been realized by Roy et al.
[2]. In this study, a robot used range sensors for measurement of the distance from walls in an indoor
environment. The measurement results were used for self-localization. The behavior of a mobile
robot was planned beforehand in the state space spanned by four variables: the position on an X-Y
coordinate system, orientation, and an additional scalar value. The additional one represented the
extent of the uncertainty of the belief. The planned decision making policy has the property that
it kept the additional value small. As the result, the robot kept an appropriate distance form walls
for self-localization in the navigation so as not to strand. The method used in this study is called
the coastal navigation.
Methods with an extended state space, such as the coastal navigation, are called augmented
Markov decision process (AMDP) methods [3]. In an augmented state space, we can apply tradi-
tional planning methods [4–6] if the problem of calculation amount can be overcame.
Corresponding author. Email:
March 9, 2016 Advanced Robotics ar˙prob˙flow
On the other hand, a problem of information uncertainty does not have to be handled with a
planning problem. In the case of AMDP methods and in the case of recent online planning methods
for POMDPs [7–9], however, these two problems are inseparable from each other.
QMDP value method [10], which has been proposed by Littman et al. two decades ago, deals with
these two problems separately. In this method, motion of a robot is previously planned without
consideration of uncertainty of information. Instead, the uncertainty is considered in online cal-
culations. QMDP evaluates candidates of actions by using the planning result and the probability
distribution of the belief.
We have used QMDP and shown its effectiveness in RoboCup (robot soccer world cup). To make
the robot decide its action in real-time, we have utilized a particle filter for the online calculations
of QMDP value method [12, 13]. QMDP value method is suitable for real-time decision making,
once the problem of calculation amount is solved. That is because this method never executes any
planning or searching at online.
On the other hand, QMDP value method does not give such an intelligent behavior shown in the
coastal navigation. Since QMDP value method cannot consider the future obtainment of information,
this method does not give a sequence of actions that has a strategy on observation.
However, we have found this limitation on QMDP value method can be overcome by a small
modification of its core formula. We therefore exhibit that some strategic sequences of actions can
be generated by this modification in this paper. We define probabilistic flow control (PFC) method,
which is a modified QMDP value method with the formula, and apply it to navigation tasks of a
simulated mobile robot. By the simulation, the modification is evaluated, and we can see interesting
behaviors of the robot generated by the modification.
This paper consists of the following sections. The problem is given in Section 2. The modification
and PFC method are proposed in Section 3. In Section 4, we observe and evaluate behaviors of
a robot generated by PFC method by simulation of navigation tasks. Though the tasks are very
simple, the robot shows interesting behaviors in the tasks so as to overcome lack of information
with PFC method. We conclude this paper in Section 5.
2. Problem
We formulate a problem discussed in this paper. An autonomous robot has a task in a time-invariant
system. In the system, the robot knows how to behave if the state of the system is known, while
the information of the state is just given partially.
2.1. State, Action, and State Transition
A state of the system and an action are represented as xand arespectively. We define a set of
states: Xand a set of final states: Xf X . When the state of the system is x∈ Xf, we consider
that the task is successfully finished. We also define a set of actions: A≡{a1, a2, . . . , am}. The
number of actions should be finite in the proposed method.
We define a sequence of discrete time steps: {0,1,2, . . . , tf} ≡ T . Time goes to the next step when
an action is executed, t= 0 and t=tfare the start and the finish time of the task respectively. tf
is not fixed.
When we need to write ‘the state at time t’ or ‘the action chosen at time t’, we write them as x(t)
and a(t) respectively. With this notation, a state transition can be represented by the following
three symbols:
x(t): a prior state at t,
a(t): an action chosen at t, and
x(t+ 1): a posterior state at t+ 1.
Since we handle a time-invariant system, the concrete value of tis not important in many cases
March 9, 2016 Advanced Robotics ar˙prob˙flow
for discussion. In that case, we write x(t), a(t), and x(t+ 1) as x,a, and xrespectively.
We assume that each state transition is noisy. If some trials of state transitions are attempted
from (x, a), the posterior states xare different in each trial. We assume that we can measure the
following probability:
P(X) = P(xX|x, a),
where x X /Xf,a∈ A, and X X . Moreover, we consider a density pa
xx=p(x|x, a)∈ ℜ,
which satisfies
P(X) = xX
2.2. Control
The purpose of the task is given by the following summation:
J[x(0), a(0),x(1), a(1), . . . , x(tf1), a(tf1),x(tf)] = V[x(tf)] +
where ra
xx∈ ℜ is the cost of a state transition. Time tis hidden in this symbol ra
xx. The sequence
of state transitions that minimize Jis regarded as an optimal control. V(x) (x∈ Xf) is a value of a
final state. Jis sometimes defined as it should be maximized. In this paper, however, we minimize
the value. When we want to minimize the total elapsed time to finish the task, for example, ra
will be time of a state transition, and V(xf)0 (xf∈ Xf).
Under some conditions, we can expect the existence of the optimal policy:
Π:X → A (3)
toward Eq. (2). Πgives the best action a= Π(x) at any state xto minimize J. The function
V:X → ℜ that gives the expectation value of Jfor each state is called the optimal value function.
The value V(x) is not changed whether xis the initial state or a halfway state. The values of
final states V(xf) are included in this function.
To solve Πor Vfrom Eq. (2), algorithms classified in dynamic programming [6] have been
utilized. Methods of reinforcement learning [14] and online path planing methods [4] also solve
subsets of this control problem.
In this paper, we assume that the robot knows the optimal value function V. Moreover, it knows
the state transition model pa
xxand the cost ra
xxof the task. In this case, the optimal policy can
be derived as
Π(x) = argmin
a∈A x∈X
xxV(x) + ra
2.3. State Recognition
Though the robot knows the optimal value function V, it cannot observe the actual state x.
Instead of direct observations, the robot has and updates the belief of the state at each time step
[3]. The belief is represented by a probability density function b:X → ℜ. The belief gives a
March 9, 2016 Advanced Robotics ar˙prob˙flow
probability that the actual state is in X⊂ X as
P(xX) = xX
In this equation, time tis hidden and omitted. b(x) means that the density of the probability
distribution at x(t).
We assume that the belief bis calculated by a Bayes filter [3]. In Section 4, a particle filter [15]
is used for the implementation of the Bayes filter.
The Bayes filter changes the belief when the robot takes an action or it obtains information.
After the action is done by the robot, the belief is updated by
b(x) = x∈X
This equation represents the flow of the probability distribution of bby a state transition.
When the robot obtains useful information for the state estimation from a sensor, the function
bis updated based on the Bayes theorem [16]:
b(x|ι) = q(ι|x)b(x)
x′′∈X q(ι|x′′ )b(x′′)dx′′ ,(7)
where ιrepresents some sort of information that contributes to the state estimation. b(x|ι) is a
value of a belief after ιis considered. q(ι|x) is a value of a likelihood function, which translates
information ιto information in the state space. Likelihood functions are implemented to the robot
as previous knowledge. After the calculation of Eq. (7), b(x) is overwritten with b(x|ι) as
2.4. Assumption of Goal Recognition
Here we give an important assumption required for the proposed method. We assume that the
robot can detect whether the task is finished or not without the belief b.
In the case of a robot navigation problem, this assumption means that someone tells the robot
whether the robot is in the goal or not. Otherwise the robot has a special sensor for detecting the
This assumption can be used for state estimation. We can define a likelihood function:
q(not finished|x) = 0 (x∈ Xf)
1 (otherwise) ,(9)
if someone tells the robot whether the task is finished or not accurately.
3. Probabilistic Flow Control
We propose a modification of an important formula used in QMDP value method. Though the
modification is slight, the motion of robots is changed to a large degree.
We name the modified QMDP value method that uses the novel function probabilistic flow control
(PFC) method. That is because this method makes the probability distribution of the belief flow
into the area of final states by using the assumption in Eq. (9).
March 9, 2016 Advanced Robotics ar˙prob˙flow
3.1. QMDP value method
QMDP value method chooses an action from a belief based on the following function:
ΠQMDP (b) = argmin
QMDP(a, b), where (10)
QMDP(a, b) = x∈X
xxV(x) + ra
If bin Eq. (11) is the Dirac delta function, Eq. (11) is identical to Eq. (4). It means that the output
of QMDP value method is identical to that of Eq. (4) when the state is known deterministically.
We show a two dimensional tile world in Fig. 1 as an example for understanding of Eq. (11). The
robot must go to the goal with the minimum number of actions in this task. A robot belong to a
cell in a time step. The robot can move to one of the adjoining cells with one of the four actions:
left, right, up, down as shown in Fig. 1(a). State transitions by the actions are deterministic. It
means that pa
xx1. The task of the robot is given with ra
xx1 and the value of Vis zero at the
goal. The numbers in some calls are the probabilities that the robot exists in each cell. The robot
cannot observe its position directly. Instead, it knows the probabilities. The robot also knows the
value function V, which gives a number of steps from any cell to the goal. Though the robot has
no sensor, it is told whether the task is finished or not after each action.
Figure 1. 2-dimensional tile world
In the case of Fig. 1(a), action down is chosen. Equation (11) gives
QMDP(down, b) =
each cell
probability [(distance to goal after down) + cost]
= 0.25(2 + 1) + 0.25(1 + 1) + 0.25(0 + 1) + 0.25(1 + 1) = 2,(12)
where bmeans the probability distribution in Fig. 1(a). The values with the other actions are
QMDP(lef t, b) = QMDP(up, b) = 4 and QMDP(right, b) = 3.
A drawback of QMDP value method can be explained with the case shown in Fig. 1(b-1)-(b-3).
Both in the three cases, the value of QMDP is minimized not only with right but also with left. If
March 9, 2016 Advanced Robotics ar˙prob˙flow
the action of the robot is chosen randomly from the two actions, the robot uselessly takes random
3.2. Probabilistic flow control method
This kind of random walk as shown in the above example will be broken when the probabilities
are slightly changed to 0.49 and 0.51, for example. Some kinds of heuristics will also be effective.
However, we can remove it with a subtle modification of QMDP value method.
Our idea is that the robot should choose left based on the right cell with probability 0.5 in the
case shown in Fig. 1(b-1) simply because the right cell is nearer to the goal than the left cell with
probability 0.5. The ‘distance’ between each cell to the goal is represented by V. Figure 2 shows a
sequence of state transitions that we want to realize. In (a) and (b), the robot goes to left. These
judgments are wrong on a temporary basis. In (c), however, the probability 0.5 that exists in a
right side cell is changed to zero at the goal by Eq. (9). After that, the probability converges to
the cell where the robot exists. The robot will go to the goal with action right.
Figure 2. Decision making that is realized by the proposed method
To actualize this idea, we only have to modify Eq. (10) and (11) as
ΠPFC(x) = argmin
QPFC(a, b), where
QPFC(a, b) = x∈X /Xf
V(x)Vmin x∈X
xxV(x) + ra
where Vmin = minx∈X V(x) and we assume that x X /XfV(x)> Vmin. We name the decision
making or control method with Eq. (13) probabilistic flow control (PFC) method.
In the case of Fig. 2(a), the values of QPFC are calculated as follows:
QPFC(lef t, b) = 0.5
3(4 + 1) + 0.5
2(1 + 1) = 1.333..., and
QPFC(right, b) = 0.5
3(2 + 1) + 0.5
2(3 + 1) = 1.5.
By the denominator V(x)Vmin, the right cell is given a priority. As the result, action left is
March 9, 2016 Advanced Robotics ar˙prob˙flow
4. Generation of Behavior with PFC method and Its Evaluation in Simulation
We observe behaviors of a robot generated by PFC method with three kinds of tasks in simulation.
In the tasks, the robot must overcome three kinds of information uncertainty respectively.
4.1. Navigation task with only one point landmark
We prepare an environment shown in Fig. 3. Since there is only one point landmark, the robot
cannot calculate its pose deterministically. The width of this square environment is 4000[mm]. We
define an X-Y coordinate system, Σ, whose origin is in the center of the room. Its x-axis and y-axis
are parallel to the walls of the room.
The shape of the robot is a circle whose radius is 50[mm]. The pose of the robot is represented
as (x, y, θ). (x, y ) is the position of the center of the robot on Σ, and θis its direction.
A landmark is placed on the origin of Σ. The robot can run through the landmark. A goal point
is set at (x, y) = (0,200)[mm]. The task is finished when the robot touches the goal point with its
Figure 3. Environment
4.1.1. Motion
The robot chooses one action from the following action set A={ccw, cw, f w}, where
ccw: rotation of (5 + 0.5σ)[deg],
cw: rotation of (5 + 0.5σ)[deg], and
fw: forward movement of (10 + σ)[mm] to θdirection.
σis a noise generated from a normal distribution N(0,1). We write this choice as σ N (0,1).
N(µ, Σ) denotes a normal distribution whose average and variance are µand Σ respectively. We
assume that the robot knows this motion model.
March 9, 2016 Advanced Robotics ar˙prob˙flow
4.1.2. Representation of Belief: Particle Filter
The robot has a particle filter for self-localization. The particle filter has a particle set:
Ξ = {ξ(i)= (x(i), w(i))|i= 1,2, . . . , N }.
x(i)= (x(i), y(i), θ(i)) and w(i)are the state and the weight of a particle ξ(i)respectively. Nis the
number of particles, which is fixed to 1000.
From the set of particles, Eq. (13) is calculated as:
QPFC(a, b) =
V(x(i))Vmin V(x) + ra
where xpa
A distribution of these particles approximates a probability distribution of a belief. We show an
example in Fig. 4. Each dot is a particle, which has its pose and weight. The poses of particles can
be regarded as candidates of the actual pose of the robot. This distribution is obtained at a trial
in Sec. 4.1.7.
Figure 4. A distribution of particles shown in a trial
At the beginning of each trial, the particle are placed randomly in the state space. It means that
the robot has no information of its pose at t= 0.
4.1.3. Update of Particles on Action
Just after a state transition from time tto time t+1, the particle filter updates Ξ so as to represent
Eq. (6). When the new and old particle sets are denoted by Ξ(t+ 1) and Ξ(t) respectively, the
procedure is as follows for i= 1,2, . . . , N :
nparticles are chosen from Ξ(t) with probabilities proportional to their weights,
the particles are moved based on the state transition model,
their weights are set to 1/N , and
March 9, 2016 Advanced Robotics ar˙prob˙flow
they are added to Ξ(t+ 1).
When the number of particles in Ξ(i+ 1) is smaller than N, particles that have random poses
and weight 1/N are added to Ξ(i+ 1). This implementation is an irregular way compared to
the implementations in [3]. Though we thought that the random particles made the particle filter
robust, they did no harm nor good. That is why random particles are shown in some figures of this
4.1.4. Observations given by the simulator and procedures on the robot
When information is obtained, the weight of each particle is overwritten by the following procedure:
w(i)η1w(i)q(ι|x(i)), where (15)
This procedure is an approximation of Eq. (7). q(ι|x(i)) is replaced to one of the likelihood functions
in Eq. (17) or Eq. (18), which are explained below.
The robot observes the point landmark at every five time steps. The simulator gives a measure-
ment ([mm], φ[deg]) by one observation. and φare the distance and the direction of the landmark
from the robot respectively. They are generated with noises based on the following model:
N (,(0.1)2) and φ N (φ,102),
where and φare the actual distance and direction respectively. The robot cannot observe the
landmark when <50[mm].
Each measurement is reflected to particles through a likelihood function:
q(ℓ, φ|x(i)) = N((i),(0.1)2)N(φφ(i),102),(17)
where ((i), φ(i)) are the direction and the orientation of the landmark calculated from x(i), which
is the state of a particle.
Moreover, the simulator gives information whether the task is finished or not to the robot after
every step. Based on Eq. (9), the robot can use the following likelihood function:
q(not finished|x(i)) = 105(x(i)∈ Xf)
1 (otherwise) .(18)
This equation means that weights of particles at final states are reduced to nearly zero. We do
not make the weights zero since some parts of code of the particle filter are complicated by the
existence of zero weight particles.
4.1.5. Resetting
Generally, particles frequently converges around some points which are not the actual state. Since
Bayes theorem does not have the ability to recover this situation, resetting algorithms are used for
the recover [17, 19].
In this simulation, we use the sensor resetting algorithm [17]. When ηin Eq. (16) is smaller than
a threshold value ηth after a measurement of the landmark, all particles are replaced in the state
space to represent a probability distribution that is proportional to the right side of Eq. (17). We
have found that ηth = 106is an appropriate value through some test trials, and we use this value.
March 9, 2016 Advanced Robotics ar˙prob˙flow
4.1.6. Value Function and Penalty
We define that a cost ra
xx= 1 is given at each state transition. To minimize the sum of this penalty,
the robot must finish the task as small number of steps as possible.
A value function that is given to the robot is given as
V(x) =
0 (otherwise)
where φGand Gare the direction and the distance of the goal from the robot respectively. 50 is
the radius of the robot. This value function is quasi-optimal. The actions obtained from this value
function will be slightly different from the actions obtained by the optimal one. However, we think
that this difference is trivial in the simulation.
4.1.7. Generation of a kind of search behavior
At first, we observe the behavior of the robot with PFC method. 10 trials are tried from an initial
pose: (1000[mm],0[mm],90[deg]) . We show a trajectory of the 9th trial in Fig. 5. The other
trajectories are shown in Fig. 6.
In the 9th trial and in some trials in Fig. 6, the robot turns around the landmark before it enters
the goal. This behavior is entirely different from the behavior derived from the value function in
Eq. (19).
Figure 5. A trajectory obtained by PFC method (the 9th trial)
We explain why this behavior is generated in the 9th trial with Fig. 7. The poses of the robot
and the distributions of particles in some time steps are illustrated. We give a mark ‘*’ to an area
of particles that are given large priority for decision making by Eq. (13).
In Fig. 7(a), the robot goes to the direction of the landmark since the value of QPFC simply
reduces with this behavior at this stage. This tendency changes in Fig. (b). The area in which
particles have priority appears in the bottom left of the figure. The particles in this area gravitate
toward the goal. From (c) to (g), the actions of the robot are chosen based on the particles in
the left side area of the goal. The particles near the goal go into the goal and are erased by the
likelihood function in Eq. (18). The circular behavior of the robot is generated by this repetition
of erasing. If the actual pose of the robot is not apart from the distribution of particles, the pose
of the robot reaches to the goal with the particles around its pose within some time steps. In
(h), a sensor resetting happens. Though the particles are distributed, the robot can still choose
appropriate actions. That is because the area that has ’*’ is not changed.
March 9, 2016 Advanced Robotics ar˙prob˙flow
Figure 6. Trajectories obtained by PFC method
Figure 7. Distributions of particles in the 9th trial
4.1.8. Comparison between QPFC and QMDP
We then evaluate QPFC method quantitatively. This evaluation aims to compare the formulae in
Eq. (11) and Eq. (13) under the same conditions. Though we compare PFC method to QMDP value
method, it does not only mean that our method is compared to an old method. The idea contained
in Eq. (11) has a possibility to be utilized with the other POMDP methods.
We have 100 trials with each method and count the number of trials in which the robot can go
to the goal within 1000[step]. In some preliminary trials, the robot is trapped by one or some fatal
deadlocks if the task is not finished within 1000[step]. The word deadlock means that the robot
alternately chooses action cw and ccw in many time steps. 100 initial poses of the robot are chosen
randomly. Each method is evaluated 100 times with these initial poses.
Moreover, we have the other sets of 100 trials with the following conditions.
March 9, 2016 Advanced Robotics ar˙prob˙flow
The robot knows the actual pose xand uses xfor decision making.
The robot calculates the average pose ¯
x= (¯x, ¯y, ¯
θ) of particles, and uses ¯
xfor decision
The result is shown in Table 1. The robot can reach to the goal in all trials with the actual pose
x. On the other hand, all of the trials are failed with the average pose ¯
x. That is because the
average pose of the particles shown in Fig. 4, for example, makes no sense.
QMDP value method can make effective decisions compared to the decision making with ¯
x. The
success rate is 27 [%]. In almost failure trials with QMDP value method, the robot is trapped by a
deadlock near the landmark.
PFC method substantially improves the success rate to 96[%]. The mechanism as we have shown
in Fig. 7 works effective in this simulation. When we consider each particle as a hypothesis, PFC
method has a tendency to try the hypotheses from a convenient one to finish the task. QMDP value
method does not have this kind of ordering ability. In the trials with QMDP method, the robot
never takes the cyclic trajectory around the landmark.
Table 1. Comparison of PFC, QMDP, and some other situations
methods successful avg. numbers of steps avg. numbers of steps
trials in successful trials in all trials
PFC 96% 392[step] 416[step]
QMDP 27% 361[step] 828[step]
decision based on x100% 176[step] 176[step]
decision based on ¯
x0% - -
Figure 8. A deadlock in a trial with PFC method
Not all of the trials are successful with PFC method. We show a deadlock invoked by PFC
method in Fig. 8. Deadlocks with PFC method occur with asymmetric distributions of particles
as shown in this figure. Actually, short-time deadlocks have occurred frequently in the trials with
PFC method. However, they have been solved in many cases when distributions of particles have
been slightly changed by measurements of the landmark. It seems that the deadlocks caused by
PFC method are more unstable than those by QMDP value method at least in this task.
4.2. Evaluation without landmark
We evaluate PFC method under another kind of uncertainty. The task of the robot in this simulation
is the same the previous one. However, the robot cannot observe the landmark. It means that
Eq. (17) is not applied to the particles. Instead, the particles are placed around the actual pose of
the robot at the start of a trial, as shown in Fig. 9 (step 1). When the robot gets near to the goal,
March 9, 2016 Advanced Robotics ar˙prob˙flow
the distribution of the particles is larger than the area of the goal as shown in Fig. 9 (step 90). The
robot must search the goal though it does not have enough information to reach the goal directly.
In each trial of this simulation, the robot starts from (x, y, θ) = (1000[mm],0[mm],90[deg]).
Each particle is initially placed at (x, y, θ) that is chosen by the following procedure:
(x, y, θ) = (1000 + Rcos ω[mm], R sin ω[mm],90 + δ[deg]) (20)
where, R[0, r), δ (0.1r, 0.1r),and ω[0,360).
These distributions in Fig. 9 are obtained by r= 200.
Figure 9. Distributions of particles in a trial (r= 200)
We choose r= 50,100,200,300 and 400. We count the number of successful trials in 100 trials
for each value of rand for each method. 500[step] is the cutoff point.
Figure 10. Success rates in the simulation without landmark observation
Figure 10 shows the result. In the range 100 r500, PFC method shows a clear advantage
over QMDP value method. In many cases with QMDP method, deadlocks happen when the robot is
near the goal. Figure 11 shows a distribution of particles and the robot in a deadlock. If the robot
chooses action fw, it can comes closer to the goal. However, action fw is given a low evaluations by
Eq. (11) due to the particles in the upper part of this figure.
We show the first six trajectories of simulation with PFC method and r= 200 in Fig. 12. When
r= 200, the success rate was 95[%]. In the 4 6th trials, the robot traces spiral patterns so as to
sweep the particles.
March 9, 2016 Advanced Robotics ar˙prob˙flow
Figure 11. Deadlock in front of the goal (r= 200, QMDP value method)
Figure 12. Trajectories generated by PFC method (r= 200)
4.3. Evaluation in an corridor environment that causes perceptual aliasing
Finally, we evaluate the ability of PFC method toward perceptual aliasing [3] in a symmetric
environment. The environment is illustrated in Fig. 13(a). The landmark is removed from the
environment in Fig. 3, and six walls are placed. The goal area is changed to a 100[mm] ×100[mm]
square. The robot cannot come within 50[mm] of the wall due to its radius.
As shown in Fig. 13(b), we divide the environment into eight ‘rooms.’ The rooms are classified
into three types from their shapes. Every five steps, the robot obtains a measurement xT, which
denotes a relative pose at a ‘type T room.’ When we represent the room where the robot exists
as R, the robot cannot know Rfrom one measurement. Instead, the robot knows whether T is
A, B, or C. The standard deviations of the noises on xTare 100[mm] on the position (x, y) and
10[deg] on the orientation θ.
The motion of the robot is based on the definition in Sec. 4.1.1. When the robot collides with a
wall or the edge of the environment, the robot is replaced to the pose before the motion. This is a
rule for simplification.
4.3.1. Modification of particle filter
The number of particles are increased from N= 1000 to N= 10000. In this environment, a cluster
of particles is easily disappeared by some resampling procedures when N= 1000. We can choose
others implementations that are suitable for tracking multiple hypotheses (e.g. [18]). In this case,
there is a possibility that PFC method can deal with more severe perceptual aliasing. It should
be tried in future. In this paper, we compare methods with the native particle filter. We show
processing time in each procedure in Table 2 for reference.
At the start of a trial, the particles are placed into all rooms that have the identical type with
March 9, 2016 Advanced Robotics ar˙prob˙flow
Figure 13. An environment with corridors in which perceptual aliasing happens
Table 2. Time for processing each procedure
(N= 10000, on an Intel Xeon CPU E5-2670
procedure time[ms]
observation 3.3
resampling 0.4
calculation of QPFC or QMDP 2.0
reset 6.83
R. An example of the initial placement of particles is shown in Fig. 13(a). Noises are added to
their poses in a like manner in Eq. (20) with r= 100.
After a measurement of xT, the following likelihood function is applied to particles:
q(xT|x(i)) = 0 (x(i)does not belong to type T room)
N(epos,1002)N(eθ,102) (otherwise),(21)
where epos[mm] and eθ[deg] are difference of position and orientation between xTand x(i).
Equation (18) is applied to the particles that enter in the goal at every step. In this simulation,
we have used zero instead of 105in Eq. (18). We use another likelihood function when the robot
chooses fw. When the robot moves with action fw and does not collide with a wall or an edge of
the environment, zero is multiplied to the weights of particles that collide with one of them. When
the robot cannot move due to a collision, the particles are not moved.
The threshold for a reset is set to ηth = 1030. The error of xTreaches about one meter on XY-
space when ηbecomes smaller than 1030. By a reset, the particles are replaced with the procedure
at the initial placement of particles.
4.3.2. Discretization for motion planning and calculation of QPFC
To obtain the optimal value function of this task, we use value iteration [6]. We divide the state
space into 115,200 cells, as shown in Table 3. A cell is enumerated as sm(m=k+ 72j+ 40 ·72i)
March 9, 2016 Advanced Robotics ar˙prob˙flow
with the indexes in Table 3.
At value iteration, the values of the cells in the final state region are fixed to zero. Then the
following calculation is repeatedly applied to the other cells:
ss[V(s) + 1],(22)
where 1 is the cost for an action, and Pa
ssis a state transition probability that a continuous state
xmoves from sto sby action a. After the convergence, each value of V(s) becomes the expected
number of steps to the goal.
As the above procedure suggests, we must translate pa
xxin the continuous space to Pa
ssin the
discrete state space. We solve every Pa
sswith the following geometric calculation:
ss=volume of (ssa)
volume of s,(23)
where sais a posterior region of sthat is moved by an action ain the continuous space X. In this
calculation, we do not consider the noise of state transitions for simplicity. The state transitions
that has more than zero possibility of a collision with the walls or the edge of the environment are
removed from the set of state transitions.
We show a part of the calculation result in Fig. 14. In this figure, the larger the value of Vis,
the darker the cell is drawn. The calculation time is 209[s] with a simple implementation of value
iteration on an Intel Xeon CPU E5-2670 (2.50[GHz]). If we want to reduce this calculation time,
other methods for Markov decision processes should be applied to the problem. PFC method does
not restrict methods for obtaining V.
Table 3. Discretization of the state space
axis intervals number of intervals
x[1000i2000,1000i1900)[mm] (i= 0,1,...,39) 40
y[1000j2000,1000j1900)[mm] (j= 0,1,...,39) 40
θ[k, k + 5)[deg] (k= 0,1,...,71) 72
Figure 14. A part of Value Function (cells with 0 θ < 5[deg])
March 9, 2016 Advanced Robotics ar˙prob˙flow
We modify the calculation procedure of QPFC because the value function Vis represented on a
discrete state space. We use the following calculation:
QPFC(a, b) =
s(i)sV(s) + 1,(24)
where s(i)is a discrete state to which x(i)belongs.
4.3.3. Comparison of success rate and steps
We have compared the methods that are evaluated in Sec. 4.1.8, and have obtained the result
shown in Table 4. 100 initial poses of the robot are chosen randomly before trials and each method
is evaluated 100 times with these initial poses.
As shown in Table 4, PFC method has yielded a better result than QMDP value method both
in the success rate and the number of steps. We show the first five trials of the two methods in
Fig. 15. In some trials with both of the methods, we observe that the robot moves from room to
room (trial 1, 3, 4 with PFC method and trial 2, 3 with QMDP value method). As shown in Fig. 16,
the robot takes actions based on one of clusters. Each cluster of particles can be regarded as a
hypothesis in the context of multiple hypothesis tracking [18]. When a hypothesis is discarded,
another is checked. This procedure generates the patrol behavior.
In trials with QMDP value method, the robot sometimes steps away from the goal even though it
is near the goal (trial 1, 2, and 5 in Fig. 15). This ineffective behavior makes the average number of
steps increase. This goal avoidance behavior happens when the cluster around the goal is reduced
by Eq. (18). Since the cluster around the goal loses its weights, the robot takes actions based on
the particles in other clusters and leaves from the goal.
Figure 15. Trajectories obtained by PFC and Q-MDP
4.4. Discussion
The behaviors of the robot generated by PFC method are evaluated with three types of tasks in
Sec. 4.1-4.3. In every task, PFC method yields a better result than QMDP value method due to the
ability of Eq. (13). Here we discuss what has been overcome by Eq. (13) in each task. From the
discussion, we refer to the possibilities of Eq. (13) in other cases that are not tried in this paper.
March 9, 2016 Advanced Robotics ar˙prob˙flow
Figure 16. Trajectory on 4th trial of PFC method. The dotted arrows show the motion of the robot roughly. Particles bounded
by a circle have large weights for decision making.
Table 4. Comparison of PFC, QMDP, and some other situations in the corridor environment
methods successful trials avg. numbers of steps avg. numbers of steps
(cutoff: 5000[step]) in successful trials in all trials
PFC 91% 832[step] 1208[step]
QMDP 80% 1659[step] 2328[step]
decision based on x99% 268[step] 317[step]
decision based on ¯
x25% 641[step] 3910[step]
In Sec. 4.1, the robot can finish the task that is defined in three dimensional space though the self-
localization result has only two dimensional information during a certain time of the task. The lack
of one dimensional information is covered by the cyclic behavior of the robot. If this phenomenon
is shown in more high-dimensional cases, Eq. (13) will be effective for multi-agent systems that are
handled in the Dec-POMDP (decentralized POMDP) framework [20], for example. Since each agent
does not have information of others in this framework, the dimension of the available information
has a smaller dimension than that of the state space. However, we need further experiments so as
to investigate this possibility.
In Sec. 4.2, we evaluate the methods with the case where the uncertainty of self-localization is
beyond the requirement to reach the goal. In [11], it is reported that QMDP value method can make
a goalkeeper robot for RoboCup wait for additional information on self-localization when the robot
does not have enough information to enter its goal without collision. The deadlocks by QMDP value
method in Sec. 4.2 correspond to this waiting behavior. A robot with PFC method is more active.
It takes a goal search behavior even if the information is insufficient. A lack of information near an
end of a task sometimes happens in applications of actual robots. PFC method or Eq. (13) will be
effective in this case.
Another possible application of Eq. (13) is the utilization for some learning methods as Sarsa(λ)
or Q(λ) in [14] when these methods are applied to POMDPs. In these methods, a robot chooses its
action based on an action-value function at leaning phases. Eq. (13) can be applied to the decision
making even if the action-value function is not converged.
In Sec. 4.3, PFC method with Eq. (13) makes the robot move from room to room so as to check
more than one hypotheses of the poses sequentially. PFC method gives a simple but appropriate
criterion when one hypothesis should be discarded and another should be verified. It is significant
that we can represent this criterion only with one addition of division by Vto Eq. (11).
March 9, 2016 Advanced Robotics ar˙prob˙flow
5. Conclusion
We have modified a formula used in QMDP value method, and proposed probabilistic flow control
(PFC) method. PFC method has been applied to simulated POMDP problems in which the state
of the robot has been essentially uncertain. We have found the following knowledge.
PFC method, which uses Eq. (13), is more effective than QMDP method in the three simulated
conditions. The success rates are 96[%] with PFC method and 27[%] with QMDP value method
in the simulation with one landmark. In the simulation without landmark, the limit value
of QPFC method to keep more than 50[%] of success is r= 500, while that of QMDP value
method is r= 100. In the simulation with the corridor environment, the average number of
steps with PFC method is half of that with QMDP method.
The robot with PFC method shows tricky behaviors that are never intended in a value
function. In the metaphorical sense of the word, the robot changes its behavior in accord
with the uncertainty of information.
As a future work, uses of Eq. (13) with various POMDP methods should be investigated as men-
tioned in Sec. 4.4.
[1] Kaelbling L, et al. Planning and acting in partially observable stochastic domains. Artificial Intelligence.
[2] Roy N, et al. Coastal Navigation - Mobile Robot Navigation with Uncertainty in Dynamic Environ-
ments. In: Proc. of IEEE ICRA; 1999. p. 35–40.
[3] Thrun S, et al. Probabilistic ROBOTICS. MIT Press; 2005.
[4] Latombe J.C. Robot Motion Planning. Boston, MA: Kluwer Academic Publishers; 1991.
[5] LaValle S.M., Kuffner J.J. Randomized Kinodynamic Planning. In: Proc. of IEEE International Con-
ference on Robotics and Automation; 1999. p. 473–479.
[6] Bellman R. Dynamic Programming. Princeton, NJ: Princeton University Press; 1957.
[7] Silver D, Veness J. Monte-Carlo Planning in Large POMDPs. In: NIPS; Vol. 23; 2010. p. 2164–2172.
[8] Bonet B, Geffner H. Solving pomdps: Rtdp-bel vs. point-based algorithms. In: IJCAI; 2009. p. 1641–
[9] Ong S.C., et al. Planning under uncertainty for robotic tasks with mixed observability. The International
Journal of Robotics Research. 2010;29(8):1053–1068.
[10] Littman M.L., et al. Learning Policies for Partially Observable Environments: Scaling Up. In: Proceed-
ings of International Conference on Machine Learning; 1995. p. 362–370.
[11] Ueda R, et al. Mobile Robot Navigation based on Expected State Value under Uncertainty of Self-
localization. In: Proc. of IROS; 2003.
[12] Ueda R, et al. Real-Time Decision Making with State-Value Function under Uncertainty of State
Estimation. In: Proc. of ICRA; 2005.
[13] Jitsukawa Y, et al. Fast Decision Making of Autonomous Robot under Dynamic Environment by Sam-
pling Real-Time Q-MDP Value Method. In: Proc. of IROS; 2007. p. 1644–1650.
[14] Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press;
[15] Fox D, et al. Particle Filters for Mobile Robot Localization. A Doucet, N de Freitas, and N Gordon,
editors, Sequential Monte Carlo Methods in Practice; 2000. p. 470–498.
[16] Doucet A, et al. Sequential Monte Carlo Methods in Practice. New York, NY: Springer-Verlag; 2001.
[17] Lenser S, Veloso M. Sensor resetting localization for poorly modelled robots. In: Proc. of IEEE ICRA;
2000. p. 1225–1232.
[18] Jensfelt P, Kristensen S. Active Global Localization for a Mobile Robot Using Multiple Hypothesis
Tracking. IEEE Trans. on Robotics and Automation. 2001;17(5):748–760.
[19] Ueda R, et al. Expansion Resetting for Recovery from Fatal Error in Monte Carlo Localization –
Comparison with Sensor Resetting Methods. In: Proc. of IROS; 2004. p. 2481–2486.
[20] A¸sık O, Akın L. Solving Multi-agent Decision Problems Modeled as Dec-POMDP: A Robot Soccer
March 9, 2016 Advanced Robotics ar˙prob˙flow
Case Study. RoboCup 2012: Robot Soccer World Cup XVI Lecture Notes in Computer Science Volume
7500; 2013. p. 130-140.
Conference Paper
We propose a novel method, a particle filter on episode, for decision makings of agents in the real world. This method is used for simulating behavioral experiments of rodents as a workable model, and for decision making of actual robots. Recent studies on neuroscience suggest that hippocampus and its surroundings in brains of mammals are related to solve navigation problems, which are also essential in robotics. The hippocampus also handle memories and some parts of a brain utilize them for decision. The particle filter gives a calculation model of decision making based on memories. In this paper, we have verified that this method learns two kinds of tasks that have been frequently examined in behavioral experiments of rodents. Though the tasks have been different in character from each other, the algorithm has been able to make an actual robot take appropriate behavior in the both tasks with an identical parameter set.
Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics -- trial-and-error search and delayed reward -- are the most important distinguishing features of reinforcement learning. Reinforcement learning is both a new and a very old topic in AI. The term appears to have been coined by Minsk (1961), and independently in control theory by Walz and Fu (1965). The earliest machine learning research now viewed as directly relevant was Samuel's (1959) checker player, which used temporal-difference learning to manage delayed reward much as it is used today. Of course learning and reinforcement have been studied in psychology for almost a century, and that work has had a very strong impact on the AI/engineering work. One could in fact consider all of reinforcement learning to be simply the reverse engineering of certain psychological learning processes (e.g. operant conditioning and secondary reinforcement). Reinforcement Learning is an edited volume of original research, comprising seven invited contributions by leading researchers.
In Chapter 1 we introduced configuration space as a space in which the robot maps to a point. The mathematical structure of this space, however, is not completely straightforward, and deserves some specific consideration. The purpose of this chapter and the next one is to provide the reader with a general understanding of this structure when the robot is a rigid object not constrained by any kinematic or dynamic constraint. This chapter mainly focuses on topological and differential properties of the configuration space. More detailed algebraic and geometric properties related to the mapping of the obstacles into configuration space will be investigated in Chapter 3.
Obstacles in the workspace W map in the configuration space C to regions called C-obstacles. In Chapter 2 we defined the C-obstacle CB corresponding to a workspace obstacle B as the following region in C:$$CB = \{ q \in C/A(q) \cap B \ne 0\} $$.
Robot soccer is one of the major domains for studying the coordination of multi-robot teams. Decentralized Partially Observable Markov Decision Process (Dec-POMDP) is a recent mathematical framework which has been used to model multi-agent coordination. In this work, we model simple robot soccer as Dec-POMDP and solve it using an algorithm which is based on the approach detailed in [1]. This algorithm uses finite state controllers to represent policies and searches the policy space with genetic algorithms. We use the TeamBots simulation environment. We use score difference of a game as a fitness and try to estimate it by running many simulations. We show that it is possible to model a robot soccer game as a Dec-POMDP and achieve satisfactory results. The trained policy wins almost all of the games against the standard TeamBots teams, and a reinforcement learning based team developed elsewhere.
1 Introduction and Overview.- 2 Configuration Space of a Rigid Object.- 3 Obstacles in Configuration Space.- 4 Roadmap Methods.- 5 Exact Cell Decomposition.- 6 Approximate Cell Decomposition.- 7 Potential Field Methods.- 8 Multiple Moving Objects.- 9 Kinematic Constraints.- 10 Dealing with Uncertainty.- 11 Movable Objects.- Prospects.- Appendix A Basic Mathematics.- Appendix B Computational Complexity.- Appendix C Graph Searching.- Appendix D Sweep-Line Algorithm.- References.
Conference Paper
We have proposed the real-time QMDP method for decision making of a robot under uncertain state recognition. This method evaluates every action and chooses the best one with a particle filter for estimation and a state-value function of dynamic programming. Different from our past work, this paper applies it to a complicated decision making task that yields local maxima and discontinuity on the state-value function. We then verify whether the method can choose proper actions or not in such a condition. As an example, total behavior of a goalkeeper for robot soccer is planned by using value iteration. This task contains three strategies, which are related to three kinds of local maxima respectively. Simulations, experiments and actual games have suggested that the method can decide actions effectively according as uncertain result of state estimation.
Conference Paper
In this paper, the sampling real-time Q<sub>MDP</sub> value method is proposed and applied to a goalkeeper task of a soccer robot. Soccer is a challenging task for an autonomous robot that has poor computing resources and sensors, and a good subject in the study of decision making in dynamic environments. A robot frequently decides its behavior without enough observation of the environment. In the proposed method, the risk of a score is solved beforehand toward every set of position of the goalkeeper and motion of the ball. The shortage of observation is calculated and represented by particle filters. The proposed method uses the risk function and the particle filters so as to choose appropriate behavior of the goalkeeper. The experiment with an actual robot suggests that the method can decide actions reflexively toward the motion of the ball.
In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (mdps) and partially observable MDPs (pomdps). We then outline a novel algorithm for solving pomdps off line and show how, in some cases, a finite-memory controller can be extracted from the solution to a POMDP. We conclude with a discussion of how our approach relates to previous work, the complexity of finding exact solutions to pomdps, and of some possibilities for finding approximate solutions.