Content uploaded by Ryuichi Ueda

Author content

All content in this area was uploaded by Ryuichi Ueda on Apr 01, 2016

Content may be subject to copyright.

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

This is an electronic version of an article published in ADVANCED ROBOTICS,

Volume 29, No. 11, pp. 721-734, 2015.

ADVANCED ROBOTICS is available online at: www.tandfonline.com/Article DOI;

10.1080/01691864.2015.1009943.

FULL PAPER

Generation of Compensation Behavior of Autonomous Robot for Uncertainty

of Information with Probabilistic Flow Control

Ryuichi Ueda∗

Advanced Institute on Industrial Technology, 1-10-40 Higashi Ohi, Shinagawa-ku, Tokyo, Japan

(Received 7 Aug. 2014; accepted 13 Jan. 2015; Published online: 27 Mar. 2015)

We exhibit that a robot shows strategic behaviors for overcoming uncertainty of information in navigation

tasks by a modiﬁcation of the core formula in QMDP value method. The new formula biases the calculation

so as to give a priority the decision making to an important part of the state space. QMDP value method

with the modiﬁed formula is named probabilistic ﬂow control (PFC) method. We apply PFC method to a

simulated simple task of a mobile robot. In the task, the robot with PFC method shows tricky behaviors

that compensate the uncertainty of state information.

Keywords: probabilistic ﬂow control method; particle ﬁlters; POMDPs

1. Introduction

Uncertainty of recognition is an inevitable problem for autonomous agents in the real world, while

human beings or some animals can choose appropriate actions even if the information for decision

making is quite restricted. For example, a person can go to a bedroom in his/her house even in

darkness. Even if self-localization is impossible, the person will go to the bedroom by walking along

walls. It is interesting that this behavior will be diﬀerent from that in the daytime.

If autonomous robots have this kind of ﬂexibility, they can execute their tasks in poor information

environments. In robotics, this class of problems have been studied in the framework of POMDPs

(partial observable Markov decision processes) [1]. When we deﬁne a state space of a system in

which path planning or optimal control problems are solved, the state in each instant of time

is hidden in a POMDP. Instead, the information of the actual state is given as a probability

distribution in the state space. The distribution is called a belief or a belief state.

The behavior of a robot, argued in the example of the bedroom, has been realized by Roy et al.

[2]. In this study, a robot used range sensors for measurement of the distance from walls in an indoor

environment. The measurement results were used for self-localization. The behavior of a mobile

robot was planned beforehand in the state space spanned by four variables: the position on an X-Y

coordinate system, orientation, and an additional scalar value. The additional one represented the

extent of the uncertainty of the belief. The planned decision making policy has the property that

it kept the additional value small. As the result, the robot kept an appropriate distance form walls

for self-localization in the navigation so as not to strand. The method used in this study is called

the coastal navigation.

Methods with an extended state space, such as the coastal navigation, are called augmented

Markov decision process (AMDP) methods [3]. In an augmented state space, we can apply tradi-

tional planning methods [4–6] if the problem of calculation amount can be overcame.

∗Corresponding author. Email: ueda-ryuichi@aiit.ac.jp

1

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

On the other hand, a problem of information uncertainty does not have to be handled with a

planning problem. In the case of AMDP methods and in the case of recent online planning methods

for POMDPs [7–9], however, these two problems are inseparable from each other.

QMDP value method [10], which has been proposed by Littman et al. two decades ago, deals with

these two problems separately. In this method, motion of a robot is previously planned without

consideration of uncertainty of information. Instead, the uncertainty is considered in online cal-

culations. QMDP evaluates candidates of actions by using the planning result and the probability

distribution of the belief.

We have used QMDP and shown its eﬀectiveness in RoboCup (robot soccer world cup). To make

the robot decide its action in real-time, we have utilized a particle ﬁlter for the online calculations

of QMDP value method [12, 13]. QMDP value method is suitable for real-time decision making,

once the problem of calculation amount is solved. That is because this method never executes any

planning or searching at online.

On the other hand, QMDP value method does not give such an intelligent behavior shown in the

coastal navigation. Since QMDP value method cannot consider the future obtainment of information,

this method does not give a sequence of actions that has a strategy on observation.

However, we have found this limitation on QMDP value method can be overcome by a small

modiﬁcation of its core formula. We therefore exhibit that some strategic sequences of actions can

be generated by this modiﬁcation in this paper. We deﬁne probabilistic ﬂow control (PFC) method,

which is a modiﬁed QMDP value method with the formula, and apply it to navigation tasks of a

simulated mobile robot. By the simulation, the modiﬁcation is evaluated, and we can see interesting

behaviors of the robot generated by the modiﬁcation.

This paper consists of the following sections. The problem is given in Section 2. The modiﬁcation

and PFC method are proposed in Section 3. In Section 4, we observe and evaluate behaviors of

a robot generated by PFC method by simulation of navigation tasks. Though the tasks are very

simple, the robot shows interesting behaviors in the tasks so as to overcome lack of information

with PFC method. We conclude this paper in Section 5.

2. Problem

We formulate a problem discussed in this paper. An autonomous robot has a task in a time-invariant

system. In the system, the robot knows how to behave if the state of the system is known, while

the information of the state is just given partially.

2.1. State, Action, and State Transition

A state of the system and an action are represented as xand arespectively. We deﬁne a set of

states: Xand a set of ﬁnal states: Xf⊂ X . When the state of the system is ∀x∈ Xf, we consider

that the task is successfully ﬁnished. We also deﬁne a set of actions: A≡{a1, a2, . . . , am}. The

number of actions should be ﬁnite in the proposed method.

We deﬁne a sequence of discrete time steps: {0,1,2, . . . , tf} ≡ T . Time goes to the next step when

an action is executed, t= 0 and t=tfare the start and the ﬁnish time of the task respectively. tf

is not ﬁxed.

When we need to write ‘the state at time t’ or ‘the action chosen at time t’, we write them as x(t)

and a(t) respectively. With this notation, a state transition can be represented by the following

three symbols:

•x(t): a prior state at t,

•a(t): an action chosen at t, and

•x(t+ 1): a posterior state at t+ 1.

Since we handle a time-invariant system, the concrete value of tis not important in many cases

2

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

for discussion. In that case, we write x(t), a(t), and x(t+ 1) as x,a, and x′respectively.

We assume that each state transition is noisy. If some trials of state transitions are attempted

from (x, a), the posterior states x′are diﬀerent in each trial. We assume that we can measure the

following probability:

P(X′) = P(x′∈X′|x, a),

where ∀x∈ X /Xf,∀a∈ A, and ∀X′⊂ X . Moreover, we consider a density pa

xx′=p(x′|x, a)∈ ℜ,

which satisﬁes

P(X′) = x′∈X′

pa

xx′dx′.(1)

2.2. Control

The purpose of the task is given by the following summation:

J[x(0), a(0),x(1), a(1), . . . , x(tf−1), a(tf−1),x(tf)] = V[x(tf)] +

tf−1

t=0

ra

xx′,(2)

where ra

xx′∈ ℜ is the cost of a state transition. Time tis hidden in this symbol ra

xx′. The sequence

of state transitions that minimize Jis regarded as an optimal control. V(x) (x∈ Xf) is a value of a

ﬁnal state. Jis sometimes deﬁned as it should be maximized. In this paper, however, we minimize

the value. When we want to minimize the total elapsed time to ﬁnish the task, for example, ra

xx′

will be time of a state transition, and V(xf)≡0 (∀xf∈ Xf).

Under some conditions, we can expect the existence of the optimal policy:

Π∗:X → A (3)

toward Eq. (2). Π∗gives the best action a∗= Π∗(x) at any state xto minimize J. The function

V∗:X → ℜ that gives the expectation value of Jfor each state is called the optimal value function.

The value V∗(x) is not changed whether xis the initial state or a halfway state. The values of

ﬁnal states V(xf) are included in this function.

To solve Π∗or V∗from Eq. (2), algorithms classiﬁed in dynamic programming [6] have been

utilized. Methods of reinforcement learning [14] and online path planing methods [4] also solve

subsets of this control problem.

In this paper, we assume that the robot knows the optimal value function V∗. Moreover, it knows

the state transition model pa

xx′and the cost ra

xx′of the task. In this case, the optimal policy can

be derived as

Π∗(x) = argmin

a∈A x′∈X

pa

xx′V(x′) + ra

xx′dx′.(4)

2.3. State Recognition

Though the robot knows the optimal value function V∗, it cannot observe the actual state x.

Instead of direct observations, the robot has and updates the belief of the state at each time step

[3]. The belief is represented by a probability density function b:X → ℜ. The belief gives a

3

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

probability that the actual state is in ∀X⊂ X as

P(x∈X) = x∈X

b(x)dx.(5)

In this equation, time tis hidden and omitted. b(x) means that the density of the probability

distribution at x(t).

We assume that the belief bis calculated by a Bayes ﬁlter [3]. In Section 4, a particle ﬁlter [15]

is used for the implementation of the Bayes ﬁlter.

The Bayes ﬁlter changes the belief when the robot takes an action or it obtains information.

After the action is done by the robot, the belief is updated by

b(x′) = x∈X

pa

xx′b(x)dx.(6)

This equation represents the ﬂow of the probability distribution of bby a state transition.

When the robot obtains useful information for the state estimation from a sensor, the function

bis updated based on the Bayes theorem [16]:

b(x|ι) = q(ι|x)b(x)

x′′∈X q(ι|x′′ )b(x′′)dx′′ ,(7)

where ιrepresents some sort of information that contributes to the state estimation. b(x|ι) is a

value of a belief after ιis considered. q(ι|x) is a value of a likelihood function, which translates

information ιto information in the state space. Likelihood functions are implemented to the robot

as previous knowledge. After the calculation of Eq. (7), b(x) is overwritten with b(x|ι) as

b(x)←− b(x|ι).(8)

2.4. Assumption of Goal Recognition

Here we give an important assumption required for the proposed method. We assume that the

robot can detect whether the task is ﬁnished or not without the belief b.

In the case of a robot navigation problem, this assumption means that someone tells the robot

whether the robot is in the goal or not. Otherwise the robot has a special sensor for detecting the

goal.

This assumption can be used for state estimation. We can deﬁne a likelihood function:

q(not ﬁnished|x) = 0 (x∈ Xf)

1 (otherwise) ,(9)

if someone tells the robot whether the task is ﬁnished or not accurately.

3. Probabilistic Flow Control

We propose a modiﬁcation of an important formula used in QMDP value method. Though the

modiﬁcation is slight, the motion of robots is changed to a large degree.

We name the modiﬁed QMDP value method that uses the novel function probabilistic ﬂow control

(PFC) method. That is because this method makes the probability distribution of the belief ﬂow

into the area of ﬁnal states by using the assumption in Eq. (9).

4

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

3.1. QMDP value method

QMDP value method chooses an action from a belief based on the following function:

ΠQMDP (b) = argmin

a∈A

QMDP(a, b), where (10)

QMDP(a, b) = x∈X

b(x)x′∈X

pa

xx′V(x′) + ra

xx′dx′dx.(11)

If bin Eq. (11) is the Dirac delta function, Eq. (11) is identical to Eq. (4). It means that the output

of QMDP value method is identical to that of Eq. (4) when the state is known deterministically.

We show a two dimensional tile world in Fig. 1 as an example for understanding of Eq. (11). The

robot must go to the goal with the minimum number of actions in this task. A robot belong to a

cell in a time step. The robot can move to one of the adjoining cells with one of the four actions:

left, right, up, down as shown in Fig. 1(a). State transitions by the actions are deterministic. It

means that pa

xx′≡1. The task of the robot is given with ra

xx′≡1 and the value of Vis zero at the

goal. The numbers in some calls are the probabilities that the robot exists in each cell. The robot

cannot observe its position directly. Instead, it knows the probabilities. The robot also knows the

value function V, which gives a number of steps from any cell to the goal. Though the robot has

no sensor, it is told whether the task is ﬁnished or not after each action.

Figure 1. 2-dimensional tile world

In the case of Fig. 1(a), action down is chosen. Equation (11) gives

QMDP(down, b) =

each cell

probability [(distance to goal after down) + cost]

= 0.25(2 + 1) + 0.25(1 + 1) + 0.25(0 + 1) + 0.25(1 + 1) = 2,(12)

where bmeans the probability distribution in Fig. 1(a). The values with the other actions are

QMDP(lef t, b) = QMDP(up, b) = 4 and QMDP(right, b) = 3.

A drawback of QMDP value method can be explained with the case shown in Fig. 1(b-1)-(b-3).

Both in the three cases, the value of QMDP is minimized not only with right but also with left. If

5

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

the action of the robot is chosen randomly from the two actions, the robot uselessly takes random

walk.

3.2. Probabilistic ﬂow control method

This kind of random walk as shown in the above example will be broken when the probabilities

are slightly changed to 0.49 and 0.51, for example. Some kinds of heuristics will also be eﬀective.

However, we can remove it with a subtle modiﬁcation of QMDP value method.

Our idea is that the robot should choose left based on the right cell with probability 0.5 in the

case shown in Fig. 1(b-1) simply because the right cell is nearer to the goal than the left cell with

probability 0.5. The ‘distance’ between each cell to the goal is represented by V. Figure 2 shows a

sequence of state transitions that we want to realize. In (a) and (b), the robot goes to left. These

judgments are wrong on a temporary basis. In (c), however, the probability 0.5 that exists in a

right side cell is changed to zero at the goal by Eq. (9). After that, the probability converges to

the cell where the robot exists. The robot will go to the goal with action right.

Figure 2. Decision making that is realized by the proposed method

To actualize this idea, we only have to modify Eq. (10) and (11) as

ΠPFC(x) = argmin

a∈A

QPFC(a, b), where

QPFC(a, b) = x∈X /Xf

b(x)

V(x)−Vmin x′∈X

pa

xx′V(x′) + ra

xx′dx′dx,(13)

where Vmin = minx∈X V(x) and we assume that ∀x∈ X /Xf⇒V(x)> Vmin. We name the decision

making or control method with Eq. (13) probabilistic ﬂow control (PFC) method.

In the case of Fig. 2(a), the values of QPFC are calculated as follows:

QPFC(lef t, b) = 0.5

3(4 + 1) + 0.5

2(1 + 1) = 1.333..., and

QPFC(right, b) = 0.5

3(2 + 1) + 0.5

2(3 + 1) = 1.5.

By the denominator V(x)−Vmin, the right cell is given a priority. As the result, action left is

chosen.

6

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

4. Generation of Behavior with PFC method and Its Evaluation in Simulation

We observe behaviors of a robot generated by PFC method with three kinds of tasks in simulation.

In the tasks, the robot must overcome three kinds of information uncertainty respectively.

4.1. Navigation task with only one point landmark

We prepare an environment shown in Fig. 3. Since there is only one point landmark, the robot

cannot calculate its pose deterministically. The width of this square environment is 4000[mm]. We

deﬁne an X-Y coordinate system, Σ, whose origin is in the center of the room. Its x-axis and y-axis

are parallel to the walls of the room.

The shape of the robot is a circle whose radius is 50[mm]. The pose of the robot is represented

as (x, y, θ). (x, y ) is the position of the center of the robot on Σ, and θis its direction.

A landmark is placed on the origin of Σ. The robot can run through the landmark. A goal point

is set at (x, y) = (0,200)[mm]. The task is ﬁnished when the robot touches the goal point with its

body.

Figure 3. Environment

4.1.1. Motion

The robot chooses one action from the following action set A={ccw, cw, f w}, where

•ccw: rotation of (5 + 0.5σ)[deg],

•cw: rotation of (−5 + 0.5σ)[deg], and

•fw: forward movement of (10 + σ)[mm] to θdirection.

σis a noise generated from a normal distribution N(0,1). We write this choice as σ∼ N (0,1).

N(µ, Σ) denotes a normal distribution whose average and variance are µand Σ respectively. We

assume that the robot knows this motion model.

7

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

4.1.2. Representation of Belief: Particle Filter

The robot has a particle ﬁlter for self-localization. The particle ﬁlter has a particle set:

Ξ = {ξ(i)= (x(i), w(i))|i= 1,2, . . . , N }.

x(i)= (x(i), y(i), θ(i)) and w(i)are the state and the weight of a particle ξ(i)respectively. Nis the

number of particles, which is ﬁxed to 1000.

From the set of particles, Eq. (13) is calculated as:

QPFC(a, b) =

x(i)/∈Xf(i=1,2,...,N)

w(i)

V(x(i))−Vmin V(x′) + ra

x(i)x

′,(14)

where x′∼pa

xx′.

A distribution of these particles approximates a probability distribution of a belief. We show an

example in Fig. 4. Each dot is a particle, which has its pose and weight. The poses of particles can

be regarded as candidates of the actual pose of the robot. This distribution is obtained at a trial

in Sec. 4.1.7.

Figure 4. A distribution of particles shown in a trial

At the beginning of each trial, the particle are placed randomly in the state space. It means that

the robot has no information of its pose at t= 0.

4.1.3. Update of Particles on Action

Just after a state transition from time tto time t+1, the particle ﬁlter updates Ξ so as to represent

Eq. (6). When the new and old particle sets are denoted by Ξ(t+ 1) and Ξ(t) respectively, the

procedure is as follows for i= 1,2, . . . , N :

•n=⌊Nw(i)⌋

•nparticles are chosen from Ξ(t) with probabilities proportional to their weights,

•the particles are moved based on the state transition model,

•their weights are set to 1/N , and

8

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

•they are added to Ξ(t+ 1).

When the number of particles in Ξ(i+ 1) is smaller than N, particles that have random poses

and weight 1/N are added to Ξ(i+ 1). This implementation is an irregular way compared to

the implementations in [3]. Though we thought that the random particles made the particle ﬁlter

robust, they did no harm nor good. That is why random particles are shown in some ﬁgures of this

chapter.

4.1.4. Observations given by the simulator and procedures on the robot

When information is obtained, the weight of each particle is overwritten by the following procedure:

w(i)←− η−1w(i)q(ι|x(i)), where (15)

η=

N

i=1

w(i)q(ι|x(i)).(16)

This procedure is an approximation of Eq. (7). q(ι|x(i)) is replaced to one of the likelihood functions

in Eq. (17) or Eq. (18), which are explained below.

The robot observes the point landmark at every ﬁve time steps. The simulator gives a measure-

ment (ℓ[mm], φ[deg]) by one observation. ℓand φare the distance and the direction of the landmark

from the robot respectively. They are generated with noises based on the following model:

ℓ∼ N (ℓ∗,(0.1ℓ∗)2) and φ∼ N (φ∗,102),

where ℓ∗and φ∗are the actual distance and direction respectively. The robot cannot observe the

landmark when ℓ∗<50[mm].

Each measurement is reﬂected to particles through a likelihood function:

q(ℓ, φ|x(i)) = N(ℓ−ℓ(i),(0.1ℓ)2)N(φ−φ(i),102),(17)

where (ℓ(i), φ(i)) are the direction and the orientation of the landmark calculated from x(i), which

is the state of a particle.

Moreover, the simulator gives information whether the task is ﬁnished or not to the robot after

every step. Based on Eq. (9), the robot can use the following likelihood function:

q(not ﬁnished|x(i)) = 10−5(x(i)∈ Xf)

1 (otherwise) .(18)

This equation means that weights of particles at ﬁnal states are reduced to nearly zero. We do

not make the weights zero since some parts of code of the particle ﬁlter are complicated by the

existence of zero weight particles.

4.1.5. Resetting

Generally, particles frequently converges around some points which are not the actual state. Since

Bayes theorem does not have the ability to recover this situation, resetting algorithms are used for

the recover [17, 19].

In this simulation, we use the sensor resetting algorithm [17]. When ηin Eq. (16) is smaller than

a threshold value ηth after a measurement of the landmark, all particles are replaced in the state

space to represent a probability distribution that is proportional to the right side of Eq. (17). We

have found that ηth = 10−6is an appropriate value through some test trials, and we use this value.

9

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

4.1.6. Value Function and Penalty

We deﬁne that a cost ra

xx′= 1 is given at each state transition. To minimize the sum of this penalty,

the robot must ﬁnish the task as small number of steps as possible.

A value function that is given to the robot is given as

V(x) =

φG

ω+ℓG−50

v(ℓG>50)

0 (otherwise)

,(19)

where φGand ℓGare the direction and the distance of the goal from the robot respectively. 50 is

the radius of the robot. This value function is quasi-optimal. The actions obtained from this value

function will be slightly diﬀerent from the actions obtained by the optimal one. However, we think

that this diﬀerence is trivial in the simulation.

4.1.7. Generation of a kind of search behavior

At ﬁrst, we observe the behavior of the robot with PFC method. 10 trials are tried from an initial

pose: (1000[mm],0[mm],−90[deg]) . We show a trajectory of the 9th trial in Fig. 5. The other

trajectories are shown in Fig. 6.

In the 9th trial and in some trials in Fig. 6, the robot turns around the landmark before it enters

the goal. This behavior is entirely diﬀerent from the behavior derived from the value function in

Eq. (19).

Figure 5. A trajectory obtained by PFC method (the 9th trial)

We explain why this behavior is generated in the 9th trial with Fig. 7. The poses of the robot

and the distributions of particles in some time steps are illustrated. We give a mark ‘*’ to an area

of particles that are given large priority for decision making by Eq. (13).

In Fig. 7(a), the robot goes to the direction of the landmark since the value of QPFC simply

reduces with this behavior at this stage. This tendency changes in Fig. (b). The area in which

particles have priority appears in the bottom left of the ﬁgure. The particles in this area gravitate

toward the goal. From (c) to (g), the actions of the robot are chosen based on the particles in

the left side area of the goal. The particles near the goal go into the goal and are erased by the

likelihood function in Eq. (18). The circular behavior of the robot is generated by this repetition

of erasing. If the actual pose of the robot is not apart from the distribution of particles, the pose

of the robot reaches to the goal with the particles around its pose within some time steps. In

(h), a sensor resetting happens. Though the particles are distributed, the robot can still choose

appropriate actions. That is because the area that has ’*’ is not changed.

10

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

Figure 6. Trajectories obtained by PFC method

Figure 7. Distributions of particles in the 9th trial

4.1.8. Comparison between QPFC and QMDP

We then evaluate QPFC method quantitatively. This evaluation aims to compare the formulae in

Eq. (11) and Eq. (13) under the same conditions. Though we compare PFC method to QMDP value

method, it does not only mean that our method is compared to an old method. The idea contained

in Eq. (11) has a possibility to be utilized with the other POMDP methods.

We have 100 trials with each method and count the number of trials in which the robot can go

to the goal within 1000[step]. In some preliminary trials, the robot is trapped by one or some fatal

deadlocks if the task is not ﬁnished within 1000[step]. The word deadlock means that the robot

alternately chooses action cw and ccw in many time steps. 100 initial poses of the robot are chosen

randomly. Each method is evaluated 100 times with these initial poses.

Moreover, we have the other sets of 100 trials with the following conditions.

11

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

•The robot knows the actual pose x∗and uses x∗for decision making.

•The robot calculates the average pose ¯

x= (¯x, ¯y, ¯

θ) of particles, and uses ¯

xfor decision

making.

The result is shown in Table 1. The robot can reach to the goal in all trials with the actual pose

x∗. On the other hand, all of the trials are failed with the average pose ¯

x. That is because the

average pose of the particles shown in Fig. 4, for example, makes no sense.

QMDP value method can make eﬀective decisions compared to the decision making with ¯

x. The

success rate is 27 [%]. In almost failure trials with QMDP value method, the robot is trapped by a

deadlock near the landmark.

PFC method substantially improves the success rate to 96[%]. The mechanism as we have shown

in Fig. 7 works eﬀective in this simulation. When we consider each particle as a hypothesis, PFC

method has a tendency to try the hypotheses from a convenient one to ﬁnish the task. QMDP value

method does not have this kind of ordering ability. In the trials with QMDP method, the robot

never takes the cyclic trajectory around the landmark.

Table 1. Comparison of PFC, QMDP, and some other situations

methods successful avg. numbers of steps avg. numbers of steps

trials in successful trials in all trials

PFC 96% 392[step] 416[step]

QMDP 27% 361[step] 828[step]

decision based on x∗100% 176[step] 176[step]

decision based on ¯

x0% - -

Figure 8. A deadlock in a trial with PFC method

Not all of the trials are successful with PFC method. We show a deadlock invoked by PFC

method in Fig. 8. Deadlocks with PFC method occur with asymmetric distributions of particles

as shown in this ﬁgure. Actually, short-time deadlocks have occurred frequently in the trials with

PFC method. However, they have been solved in many cases when distributions of particles have

been slightly changed by measurements of the landmark. It seems that the deadlocks caused by

PFC method are more unstable than those by QMDP value method at least in this task.

4.2. Evaluation without landmark

We evaluate PFC method under another kind of uncertainty. The task of the robot in this simulation

is the same the previous one. However, the robot cannot observe the landmark. It means that

Eq. (17) is not applied to the particles. Instead, the particles are placed around the actual pose of

the robot at the start of a trial, as shown in Fig. 9 (step 1). When the robot gets near to the goal,

12

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

the distribution of the particles is larger than the area of the goal as shown in Fig. 9 (step 90). The

robot must search the goal though it does not have enough information to reach the goal directly.

In each trial of this simulation, the robot starts from (x, y, θ) = (1000[mm],0[mm],−90[deg]).

Each particle is initially placed at (x, y, θ) that is chosen by the following procedure:

(x, y, θ) = (1000 + Rcos ω[mm], R sin ω[mm],−90 + δ[deg]) (20)

where, R∼[0, r), δ ∼(−0.1r, 0.1r),and ω∼[0,360).

These distributions in Fig. 9 are obtained by r= 200.

Figure 9. Distributions of particles in a trial (r= 200)

We choose r= 50,100,200,300 and 400. We count the number of successful trials in 100 trials

for each value of rand for each method. 500[step] is the cutoﬀ point.

Figure 10. Success rates in the simulation without landmark observation

Figure 10 shows the result. In the range 100 ≤r≤500, PFC method shows a clear advantage

over QMDP value method. In many cases with QMDP method, deadlocks happen when the robot is

near the goal. Figure 11 shows a distribution of particles and the robot in a deadlock. If the robot

chooses action fw, it can comes closer to the goal. However, action fw is given a low evaluations by

Eq. (11) due to the particles in the upper part of this ﬁgure.

We show the ﬁrst six trajectories of simulation with PFC method and r= 200 in Fig. 12. When

r= 200, the success rate was 95[%]. In the 4 −6th trials, the robot traces spiral patterns so as to

sweep the particles.

13

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

Figure 11. Deadlock in front of the goal (r= 200, QMDP value method)

Figure 12. Trajectories generated by PFC method (r= 200)

4.3. Evaluation in an corridor environment that causes perceptual aliasing

Finally, we evaluate the ability of PFC method toward perceptual aliasing [3] in a symmetric

environment. The environment is illustrated in Fig. 13(a). The landmark is removed from the

environment in Fig. 3, and six walls are placed. The goal area is changed to a 100[mm] ×100[mm]

square. The robot cannot come within 50[mm] of the wall due to its radius.

As shown in Fig. 13(b), we divide the environment into eight ‘rooms.’ The rooms are classiﬁed

into three types from their shapes. Every ﬁve steps, the robot obtains a measurement xT, which

denotes a relative pose at a ‘type T room.’ When we represent the room where the robot exists

as R∗, the robot cannot know R∗from one measurement. Instead, the robot knows whether T is

A, B, or C. The standard deviations of the noises on xTare 100[mm] on the position (x, y) and

10[deg] on the orientation θ.

The motion of the robot is based on the deﬁnition in Sec. 4.1.1. When the robot collides with a

wall or the edge of the environment, the robot is replaced to the pose before the motion. This is a

rule for simpliﬁcation.

4.3.1. Modiﬁcation of particle ﬁlter

The number of particles are increased from N= 1000 to N= 10000. In this environment, a cluster

of particles is easily disappeared by some resampling procedures when N= 1000. We can choose

others implementations that are suitable for tracking multiple hypotheses (e.g. [18]). In this case,

there is a possibility that PFC method can deal with more severe perceptual aliasing. It should

be tried in future. In this paper, we compare methods with the native particle ﬁlter. We show

processing time in each procedure in Table 2 for reference.

At the start of a trial, the particles are placed into all rooms that have the identical type with

14

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

Figure 13. An environment with corridors in which perceptual aliasing happens

Table 2. Time for processing each procedure

(N= 10000, on an Intel Xeon CPU E5-2670

2.50[GHz])

procedure time[ms]

observation 3.3

resampling 0.4

calculation of QPFC or QMDP 2.0

reset 6.8−3

R∗. An example of the initial placement of particles is shown in Fig. 13(a). Noises are added to

their poses in a like manner in Eq. (20) with r= 100.

After a measurement of xT, the following likelihood function is applied to particles:

q(xT|x(i)) = 0 (x(i)does not belong to type T room)

N(epos,1002)N(eθ,102) (otherwise),(21)

where epos[mm] and eθ[deg] are diﬀerence of position and orientation between xTand x(i).

Equation (18) is applied to the particles that enter in the goal at every step. In this simulation,

we have used zero instead of 10−5in Eq. (18). We use another likelihood function when the robot

chooses fw. When the robot moves with action fw and does not collide with a wall or an edge of

the environment, zero is multiplied to the weights of particles that collide with one of them. When

the robot cannot move due to a collision, the particles are not moved.

The threshold for a reset is set to ηth = 10−30. The error of xTreaches about one meter on XY-

space when ηbecomes smaller than 10−30. By a reset, the particles are replaced with the procedure

at the initial placement of particles.

4.3.2. Discretization for motion planning and calculation of QPFC

To obtain the optimal value function of this task, we use value iteration [6]. We divide the state

space into 115,200 cells, as shown in Table 3. A cell is enumerated as sm(m=k+ 72j+ 40 ·72i)

15

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

with the indexes in Table 3.

At value iteration, the values of the cells in the ﬁnal state region are ﬁxed to zero. Then the

following calculation is repeatedly applied to the other cells:

V(s)←−

s′

Pa

ss′[V(s′) + 1],(22)

where 1 is the cost for an action, and Pa

ss′is a state transition probability that a continuous state

xmoves from sto s′by action a. After the convergence, each value of V(s) becomes the expected

number of steps to the goal.

As the above procedure suggests, we must translate pa

xx′in the continuous space to Pa

ss′in the

discrete state space. We solve every Pa

ss′with the following geometric calculation:

Pa

ss′=volume of (s′∩sa)

volume of s,(23)

where sais a posterior region of sthat is moved by an action ain the continuous space X. In this

calculation, we do not consider the noise of state transitions for simplicity. The state transitions

that has more than zero possibility of a collision with the walls or the edge of the environment are

removed from the set of state transitions.

We show a part of the calculation result in Fig. 14. In this ﬁgure, the larger the value of Vis,

the darker the cell is drawn. The calculation time is 209[s] with a simple implementation of value

iteration on an Intel Xeon CPU E5-2670 (2.50[GHz]). If we want to reduce this calculation time,

other methods for Markov decision processes should be applied to the problem. PFC method does

not restrict methods for obtaining V.

Table 3. Discretization of the state space

axis intervals number of intervals

x[1000i−2000,1000i−1900)[mm] (i= 0,1,...,39) 40

y[1000j−2000,1000j−1900)[mm] (j= 0,1,...,39) 40

θ[k, k + 5)[deg] (k= 0,1,...,71) 72

Figure 14. A part of Value Function (cells with 0 ≤θ < 5[deg])

16

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

We modify the calculation procedure of QPFC because the value function V∗is represented on a

discrete state space. We use the following calculation:

QPFC(a, b) =

x(i)/∈Xf(i=1,2,...,N)

w(i)

V(s(i))

s′

Pa

s(i)s′V(s′) + 1,(24)

where s(i)is a discrete state to which x(i)belongs.

4.3.3. Comparison of success rate and steps

We have compared the methods that are evaluated in Sec. 4.1.8, and have obtained the result

shown in Table 4. 100 initial poses of the robot are chosen randomly before trials and each method

is evaluated 100 times with these initial poses.

As shown in Table 4, PFC method has yielded a better result than QMDP value method both

in the success rate and the number of steps. We show the ﬁrst ﬁve trials of the two methods in

Fig. 15. In some trials with both of the methods, we observe that the robot moves from room to

room (trial 1, 3, 4 with PFC method and trial 2, 3 with QMDP value method). As shown in Fig. 16,

the robot takes actions based on one of clusters. Each cluster of particles can be regarded as a

hypothesis in the context of multiple hypothesis tracking [18]. When a hypothesis is discarded,

another is checked. This procedure generates the patrol behavior.

In trials with QMDP value method, the robot sometimes steps away from the goal even though it

is near the goal (trial 1, 2, and 5 in Fig. 15). This ineﬀective behavior makes the average number of

steps increase. This goal avoidance behavior happens when the cluster around the goal is reduced

by Eq. (18). Since the cluster around the goal loses its weights, the robot takes actions based on

the particles in other clusters and leaves from the goal.

Figure 15. Trajectories obtained by PFC and Q-MDP

4.4. Discussion

The behaviors of the robot generated by PFC method are evaluated with three types of tasks in

Sec. 4.1-4.3. In every task, PFC method yields a better result than QMDP value method due to the

ability of Eq. (13). Here we discuss what has been overcome by Eq. (13) in each task. From the

discussion, we refer to the possibilities of Eq. (13) in other cases that are not tried in this paper.

17

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

Figure 16. Trajectory on 4th trial of PFC method. The dotted arrows show the motion of the robot roughly. Particles bounded

by a circle have large weights for decision making.

Table 4. Comparison of PFC, QMDP, and some other situations in the corridor environment

methods successful trials avg. numbers of steps avg. numbers of steps

(cutoﬀ: 5000[step]) in successful trials in all trials

PFC 91% 832[step] 1208[step]

QMDP 80% 1659[step] 2328[step]

decision based on x∗99% 268[step] 317[step]

decision based on ¯

x25% 641[step] 3910[step]

In Sec. 4.1, the robot can ﬁnish the task that is deﬁned in three dimensional space though the self-

localization result has only two dimensional information during a certain time of the task. The lack

of one dimensional information is covered by the cyclic behavior of the robot. If this phenomenon

is shown in more high-dimensional cases, Eq. (13) will be eﬀective for multi-agent systems that are

handled in the Dec-POMDP (decentralized POMDP) framework [20], for example. Since each agent

does not have information of others in this framework, the dimension of the available information

has a smaller dimension than that of the state space. However, we need further experiments so as

to investigate this possibility.

In Sec. 4.2, we evaluate the methods with the case where the uncertainty of self-localization is

beyond the requirement to reach the goal. In [11], it is reported that QMDP value method can make

a goalkeeper robot for RoboCup wait for additional information on self-localization when the robot

does not have enough information to enter its goal without collision. The deadlocks by QMDP value

method in Sec. 4.2 correspond to this waiting behavior. A robot with PFC method is more active.

It takes a goal search behavior even if the information is insuﬃcient. A lack of information near an

end of a task sometimes happens in applications of actual robots. PFC method or Eq. (13) will be

eﬀective in this case.

Another possible application of Eq. (13) is the utilization for some learning methods as Sarsa(λ)

or Q(λ) in [14] when these methods are applied to POMDPs. In these methods, a robot chooses its

action based on an action-value function at leaning phases. Eq. (13) can be applied to the decision

making even if the action-value function is not converged.

In Sec. 4.3, PFC method with Eq. (13) makes the robot move from room to room so as to check

more than one hypotheses of the poses sequentially. PFC method gives a simple but appropriate

criterion when one hypothesis should be discarded and another should be veriﬁed. It is signiﬁcant

that we can represent this criterion only with one addition of division by Vto Eq. (11).

18

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

5. Conclusion

We have modiﬁed a formula used in QMDP value method, and proposed probabilistic ﬂow control

(PFC) method. PFC method has been applied to simulated POMDP problems in which the state

of the robot has been essentially uncertain. We have found the following knowledge.

•PFC method, which uses Eq. (13), is more eﬀective than QMDP method in the three simulated

conditions. The success rates are 96[%] with PFC method and 27[%] with QMDP value method

in the simulation with one landmark. In the simulation without landmark, the limit value

of QPFC method to keep more than 50[%] of success is r= 500, while that of QMDP value

method is r= 100. In the simulation with the corridor environment, the average number of

steps with PFC method is half of that with QMDP method.

•The robot with PFC method shows tricky behaviors that are never intended in a value

function. In the metaphorical sense of the word, the robot changes its behavior in accord

with the uncertainty of information.

As a future work, uses of Eq. (13) with various POMDP methods should be investigated as men-

tioned in Sec. 4.4.

References

[1] Kaelbling L, et al. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence.

1998;101(1-2):99–134.

[2] Roy N, et al. Coastal Navigation - Mobile Robot Navigation with Uncertainty in Dynamic Environ-

ments. In: Proc. of IEEE ICRA; 1999. p. 35–40.

[3] Thrun S, et al. Probabilistic ROBOTICS. MIT Press; 2005.

[4] Latombe J.C. Robot Motion Planning. Boston, MA: Kluwer Academic Publishers; 1991.

[5] LaValle S.M., Kuﬀner J.J. Randomized Kinodynamic Planning. In: Proc. of IEEE International Con-

ference on Robotics and Automation; 1999. p. 473–479.

[6] Bellman R. Dynamic Programming. Princeton, NJ: Princeton University Press; 1957.

[7] Silver D, Veness J. Monte-Carlo Planning in Large POMDPs. In: NIPS; Vol. 23; 2010. p. 2164–2172.

[8] Bonet B, Geﬀner H. Solving pomdps: Rtdp-bel vs. point-based algorithms. In: IJCAI; 2009. p. 1641–

1646.

[9] Ong S.C., et al. Planning under uncertainty for robotic tasks with mixed observability. The International

Journal of Robotics Research. 2010;29(8):1053–1068.

[10] Littman M.L., et al. Learning Policies for Partially Observable Environments: Scaling Up. In: Proceed-

ings of International Conference on Machine Learning; 1995. p. 362–370.

[11] Ueda R, et al. Mobile Robot Navigation based on Expected State Value under Uncertainty of Self-

localization. In: Proc. of IROS; 2003.

[12] Ueda R, et al. Real-Time Decision Making with State-Value Function under Uncertainty of State

Estimation. In: Proc. of ICRA; 2005.

[13] Jitsukawa Y, et al. Fast Decision Making of Autonomous Robot under Dynamic Environment by Sam-

pling Real-Time Q-MDP Value Method. In: Proc. of IROS; 2007. p. 1644–1650.

[14] Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press;

1998.

[15] Fox D, et al. Particle Filters for Mobile Robot Localization. A Doucet, N de Freitas, and N Gordon,

editors, Sequential Monte Carlo Methods in Practice; 2000. p. 470–498.

[16] Doucet A, et al. Sequential Monte Carlo Methods in Practice. New York, NY: Springer-Verlag; 2001.

[17] Lenser S, Veloso M. Sensor resetting localization for poorly modelled robots. In: Proc. of IEEE ICRA;

2000. p. 1225–1232.

[18] Jensfelt P, Kristensen S. Active Global Localization for a Mobile Robot Using Multiple Hypothesis

Tracking. IEEE Trans. on Robotics and Automation. 2001;17(5):748–760.

[19] Ueda R, et al. Expansion Resetting for Recovery from Fatal Error in Monte Carlo Localization –

Comparison with Sensor Resetting Methods. In: Proc. of IROS; 2004. p. 2481–2486.

[20] A¸sık O, Akın L. Solving Multi-agent Decision Problems Modeled as Dec-POMDP: A Robot Soccer

19

March 9, 2016 Advanced Robotics ar˙prob˙ﬂow

Case Study. RoboCup 2012: Robot Soccer World Cup XVI Lecture Notes in Computer Science Volume

7500; 2013. p. 130-140.

20