Conference PaperPDF Available

Equi-Reward Utility Maximizing Design in Stochastic Environments

Abstract and Figures

We present the Equi Reward Utility Maximizing Design (ER-UMD) problem for redesigning stochastic environments to maximize agent performance. ER-UMD fits well contemporary applications that require offline design of environments where robots and humans act and cooperate. To find an optimal modification sequence we present two novel solution techniques: a compilation that embeds design into a planning problem, allowing use of off-the-shelf solvers to find a solution, and a heuristic search in the modifications space, for which we present an admissible heuristic. Evaluation shows the feasibility of the approach using standard benchmarks from the probabilistic planning competition and a benchmark we created for a vacuum cleaning robot setting.
Content may be subject to copyright.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4353
Equi-Reward Utility Maximizing Design in Stochastic Environments
Sarah Keren, Luis Pineda, Avigdor Gal, Erez Karpas, Shlomo Zilberstein
Technion–Israel Institute of Technology
College of Information and Computer Sciences, University of Massachusetts Amherst
sarahn@technion.ac.il, lpineda@cs.umass.edu
Abstract
We present the Equi-Reward Utility Maximiz-
ing Design (ER-UMD) problem for redesigning
stochastic environments to maximize agent perfor-
mance. ER-UMD fits well contemporary applica-
tions that require offline design of environments
where robots and humans act and cooperate. To
find an optimal modification sequence we present
two novel solution techniques: a compilation that
embeds design into a planning problem, allowing
use of off-the-shelf solvers to find a solution, and
a heuristic search in the modifications space, for
which we present an admissible heuristic. Evalu-
ation shows the feasibility of the approach using
standard benchmarks from the probabilistic plan-
ning competition and a benchmark we created for a
vacuum cleaning robot setting.
1 Introduction
We are surrounded by physical and virtual environments with
a controllable design. Hospitals are designed to minimize the
daily distance covered by staff, computer networks are struc-
tured to maximize message throughput, human-robot assem-
bly lines are designed to maximize productivity, etc. Com-
mon to all these environments is that they are designed with
the intention of maximizing some user benefit while account-
ing for different forms of uncertainty.
Typically, design is performed manually, often leading to
far from optimal solutions. We therefore suggest to auto-
mate the design process and formulate the Equi-Reward Util-
ity Maximizing Design (ER-UMD) problem where a system
controls the environment by applying a sequence of modifi-
cations in order to maximize agent utility.
We assume a fully observable stochastic setting and use
Markov decision processes [Bellman, 1957]to model the
agent environment. We exploit the alignment of system and
agent utility to show a compilation of the design problem into
a planning problem and piggyback on the search for an opti-
mal policy to find an optimal sequence of modifications. In
Last three authors email addresses: avigal@ie.technion.ac.il,
karpase@technion.ac.il, shlomo@cs.umass.edu
addition, we exploit the structure of the offline design pro-
cess and offer a heuristic search in the modifications space to
yield optimal design strategies. We formulate the conditions
for heuristic admissibility and propose an admissible heuris-
tic based on environment simplification. Finally, for settings
where practicality is prioritized over optimality, we present a
way to efficiently acquire sub-optimal solutions.
The contributions of this work are threefold. First, we
formulate the ER-UMD problem as a special case of envi-
ronment design [Zhang et al., 2009]. ER-UMD supports
arbitrary modification methods. Particularly, for stochastic
settings, we propose modifying probability distributions, an
approach which offers a wide range of subtle environment
modifications. Second, we present two new approaches for
solving ER-UMD problems, specify the conditions for ac-
quiring an optimal solution and present an admissible heuris-
tic to support the solution. Finally, we evaluate our ap-
proaches given a design budget, using probabilistic bench-
marks from the International Planning Competitions, where a
variety of stochastic shortest path MDPs are introduced [Bert-
sekas, 1995]and on a domain we created for a vacuum clean-
ing robot. We show how redesign substantially improves ex-
pected utility, expressed via reduced cost, achieved with a
small modification budget. Moreover, the techniques we de-
velop outperform the exhaustive approach reducing calcula-
tion effort by up to 30% .
The remaining of the paper is organized as follows. Sec-
tion 2 describes the ER-UMD framework. In Section 3, we
describe our novel techniques for solving the ER-UMD prob-
lem. Section 4 describes an empirical evaluation followed by
related work (Section 5) and concluding remarks (Section 6).
2 Equi-Reward Utility Maximizing Design
The equi-reward utility maximizing design (ER-UMD) prob-
lem takes as input an environment with stochastic action out-
comes, a set of allowed modifications, and a set of constraints
and finds an optimal sequence of modifications (atomic
changes such as additions and deletions of environment el-
ements) to apply to the environment for maximizing agent
expected utility under the constraints. We refer to sequences
rather then sets to support settings where different application
orders impact the model differently. Such a setting may in-
volve, for example, modifications that add preconditions nec-
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4354
Figure 1: An example of an ER-UMD problem
essary for the application of other modifications (e.g. a dock-
ing station can only be added after adding a power outlet).
We consider stochastic environments defined by the
quadruple =hS, A, f, s0,iwith a set of states S, a set
of actions A, a stochastic transition function f:S×A×
S[0,1] specifying the probability f(s, a, s0)of reach-
ing state s0after applying action ain sS, and an initial
state s0, S. We let E,SEand AEdenote the set of all
environments, states and actions, respectively. Adopting the
notation of Zhang and Parkes (2008) for environment design,
we define the ER-UMD model as follows.
Definition 1 An equi-reward utility maximizing (ER-UMD)
model ωis a tuple h0
ω,Rω, γω,ω,Fω,Φωiwhere
0
ω∈ E is an initial environment.
• Rω:SE× AE× SERis a Markovian and station-
ary reward function, specifying the reward r(s, a, s0)an
agent gains from transitioning from state sto s0by the
execution of a.
γωis a discount factor in (0,1], representing the depre-
cation of agent rewards over time.
ωis a finite set of atomic modifications a system can
apply. A modification sequence is an ordered set of mod-
ifications ~
∆ = h1, . . . , nis.t. iω. We denote
by ~
ωthe set of all such sequences.
• Fω:ω× E → E is a deterministic modification tran-
sition function, specifying the result of applying a modi-
fication to an environment.
Φω:~
ω× E {0,1}is an indicator that specifies the
allowed modification sequences in an environment.
Whenever ωis clear from the context we use 0,R,γ,,F,
and Φ. Note that a reward becomes a cost when negative.
The reward function Rand discount factor γform, to-
gether with an environment ∈ E an infinite horizon dis-
counted reward Markov decision process (MDP) [Bertsekas,
1995]hS, A, f, s0,R, γ i. The solution of an MDP is a con-
trol policy π:SAdescribing the appropriate action to
perform at each state. We let Πrepresent the set of all pos-
sible policies in . Optimal policies Π
Πyield maximal
expected accumulated reward for every state sS[Bellman,
1957]. We assume agents are optimal and let V(ω)represent
the discounted expected agent reward of following an optimal
policy from the initial state s0in a model ω.
Modifications can be defined arbitrarily, support-
ing all the changes applicable to a deterministic environment
[Herzig et al., 2014]. For example, we can allow adding a
transition between previously disconnected states. Particu-
lar to a stochastic environment is the option of modifying the
transition function by increasing and decreasing the proba-
bility of specific outcomes. Each modification may be as-
sociated with a system cost C:R+and a sequence
cost C(~
∆) = Pi~
C(∆i). Given a sequence ~
such that
Φ(~
, ) = 1 (i.e.,~
can be applied to ∈ E ) we let ~
represent the environment that is the result of applying ~
to
and ω~
is the same model with ~
as its initial environment.
The solution to an ER-UMD problem is a modification se-
quence ~
~
to apply to 0
ωthat maximizes agent utility
V(ω~
)under the constraints, formulated as follows.
Problem 1 Given a model ω=h0,R, γ, ,F,Φi, the ER-
UMD problem finds a modification sequence ~
~
argmax
~
~
|Φ(~
∆)=1
V(ω~
)
We let ~
ωrepresent the set of solutions to Problem 1 and
Vmax(ω) = max
~
~
|Φ(~
∆)=1
V(ω~
)represent the maximal
agent utility achievable via design in ω. In particular, we seek
solutions ~
~
ωthat minimize design cost C(~
).
Example 1 As an example of a controllable environment
where humans and robots co-exist consider Figure 1(left),
where a vacuum cleaning robot is placed in a living room.
The set Eof possible environments specifies possible room
configurations. The robot’s utility, expressed via the reward
Rand discount factor γ, may be defined in various ways;
it may try to clean an entire room as quickly as possible or
cover as much space as possible before its battery runs out.
(Re)moving a piece of furniture from or within the room (Fig-
ure 1(center)) may impact the robot’s utility. For example,
removing a chair from the room may create a shortcut to a
specific location but may also create access to a corner the
robot may get stuck in. Accounting for uncertainty, there may
be locations in which the robot tends to slip, ending up in a
different location than intended. Increasing friction, e.g., by
introducing a high friction tile (Figure 1(right)), may reduce
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4355
the probability of undesired outcomes. All types of modifica-
tions, expressed by and F, are applied offline (since such
robots typically perform their task unsupervised) and should
be applied economically in order to maintain usability of the
environment. These type of constraints are reflected by Φthat
can restrict the design process by a predefined budget or by
disallowing specific room configurations.
3 Finding
~
A baseline method for finding an optimal modification se-
quence involves applying an exhaustive best first search
(BFS) in the space of allowed sequences and selecting one
that maximizes system utility. This approach was used for
finding the optimal set of modifications in a goal recogni-
tion design setting [Keren et al., 2014; Wayllace et al., 2016].
The state space pruning applied there assumes that disallow-
ing actions is the only allowed modification, making it non-
applicable for ER-UMD, which supports arbitrary modifica-
tion methods. We therefore present two novel techniques for
finding an optimal design strategy for ER-UMD.
3.1 ER-UMD Compilation to Planning
As a first approach, we embed design into a planning prob-
lem description. The DesignComp compilation (Definition 2)
extends the agent’s underlying MDP by adding pre-process
operators that modify the environment off-line. After initial-
ization, the agent acts in the new optimized environment.
The compilation uses the PPDDL notation [Younes and
Littman, 2004]which uses a factored MDP representation.
Accordingly, an environment E is represented as a
tuple hX, s0,, Aiwith states specified as a combination
of state variables Xand a transition function embedded
in the description of actions. Action aAis rep-
resented by hprec, hp1, add1, del1i, . . . , hpm, addm, delmii
where prec is the set of literals that need to be true as
a precondition for applying a. The probabilistic effects
hp1, add1, del1i, . . . , hpm, addm, delmiare represented by
pi, the probability of the i-th effect. When outcome ioc-
curs, addiand deliare literals, added and removed from the
state description, respectively [Mausam and Kolobov, 2012].
The policy of the compiled planning problem has two
stages: design - in which the system is modified and exe-
cution - describing the policy agents follow to maximize util-
ity. Accordingly, the compiled domain has two action types:
Ades, corresponding to modifications applied by the design
system and Aexe, executed by the agent. To separate between
the stages we use a fluent execution, initially false to allow
the application of Ades, and a no-cost action astar t that sets
execution to true rending Aexe applicable.
The compilation process supports two modifications types.
Modifications Xchange the initial state by modifying the
value of state variables XX. Modifications Achange
the action set by enabling actions AA. Accord-
ingly, the definition includes a set of design action Ades =
Ades-s0Ades-A, where Ades-s0are actions that change the
initial value of variables and Ades-Aincludes actions Athat
are originally disabled but can be enabled in the modified en-
vironment. In particular, we include in Aactions that share
the same structure as actions in the original environment ex-
cept for a modified probability distribution.
The following definition of DesignComp supports a design
budget Bimplemented using a timer mechanism as in [Keren
et al., 2015]. The timer advances with up to Bdesign actions
that can be applied before performing astart. This constraint
is represented by ΦBthat returns 0for any modifications se-
quences that exceeds the given budget.
Definition 2 For an ER-UMD problem
ω=h0
ω,Rω, γω,ω,Fω,ΦB
ωi
where ω=XAwe create a planning problem
P0=hX0, s0
0, A0,R0, γ0i,where:
X0={X0
ω} ∪ {execution} ∪ {timet|t0, . . . , B} ∪
{enableda|aA}
s0
0={s0,0
ω}∪{time0}
A0=Aexe Ades-s0Ades-Aastart where
Aexe =A0As.t.
{hprec(a)execution,eff(a)i | aA0}
{hprec(a)execution enableda,eff(a)i | a
A}
Ades-s0={hh¬execution, timeii,h1,hx, timei+1i,
htimeiiii | xX}
Ades-A={hh¬execution, timeii,h1,henableda,
timei+1i, timeiii | aA}}
astart =h∅,h1,¬execution,∅}ii
• R0=R(a),if aAexe
0,if aAdes, ainit
γ0=γ
Optimally solving the compiled problem P0yields an op-
timal policy π
P0with two components, separated by the ex-
ecution of astart. The initialization component consists of a
possibly empty sequence of deterministic design actions de-
noted by ~
P0, while the execution component represents the
optimal policy in the modified environment.
The next two propositions establish the correctness of the
compilation. Proofs are omitted due to space constraints. We
first argue that V(P0), the expected reward from the initial
state in the compiled planning problem, is equal to the ex-
pected reward in the optimal modified environment.
Lemma 1 Given an ER-UMD problem ωand an optimal
modification sequence ~
~
ω
V(P0) = V(ω~
).
An immediate corollary is that the compilation outcome is
indeed an optimal sequence of modifications.
Corollary 1 Given an ER-UMD problem ωand the compiled
model P0,~
P0~
ω
The reward function R0assigns zero cost to all design ac-
tions Ades. To ensure the compilation not only respects the
budget B, but also minimizes design cost, we can assign a
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4356
󰑃
Original
environment
󰑃2
󰑃1
󰑃1,2
Execution -
agent policy
Design -
system policy
󰑃3
󰑃1,3 󰑃2,3 󰑃3,1 󰑃3,2󰑃2,1
Figure 2: Search space of an ER-UMD problem
small cost (negative reward) cdto design actions Ades. If cd
is too high, it might lead the solver to omit design actions that
improve utility by less than cd. However, the loss of utility
will be at most cdB. Thus, by bounding the minimum im-
provement in utility from a modification, we can still ensure
optimality.
3.2 Design as Informed Search
The key benefit of compiling ER-UMD to planning is the
ability to use any off-the-shelf solver to find a design strat-
egy. However, this approach does not fully exploit the special
characteristics of the off-line design setting we address. We
therefore observe that embedding design into the definition
of a planning problem results in an MDP with a special struc-
ture, depicted in Figure 2. The search of an optimal redesign
policy is illustrated as a tree comprising of two component.
The design component, at the top of the figure, describes the
deterministic offline design process with nodes representing
the different possibilities of modifying the environment. The
execution component, at the bottom of the figure, represents
the stochastic modified environments in which agents act.
Each design node represents a different ER-UMD model,
characterized by the sequence ~
of modifications that has
been applied to the environment and a constraints set Φ, spec-
ifying the allowed modifications in the subtree rooted at a
node. With the original ER-UMD problem ωat the root, each
successor design node represents a sub-problem ω~
of the
ancestor ER-UMD problem, accounting for all modification
sequences that have ~
as their prefix. The set of constraints
of the successors is updated with relation to the parent node.
For example, when a design budget is specified, it is reduced
when moving down the tree from a node to its successor.
When a design node is associated with a valid modification,
it is connected to a leaf node representing a ER-UMD model
with the environment ~
that results from applying the mod-
ification. To illustrate, invalid modification sequences are
crossed out in Figure 2.
Using this search tree we propose an informed search in
the space of allowed modifications, using heuristic estima-
Algorithm 1 Best First Design (BFD)
BFD(ω, h)
1: create OPEN list for unexpanded nodes
2: ncur =hdesign, ~
i(initial model)
3: while ncur do
4: if IsE xecution(ncur )then
5: return ncur.~
(best modification found - exit)
6: end if
7: for each nsuc GetSuccessors(ncur , ω )do
8: put hhdesign, nsuc.~
i, h(nsuc)iin OPEN
9: end for
10: if Φσ(ncur.~
∆) = 1 then
11: put hhexecution, ~
newi,V(ω~
new )iin OPEN
12: end if
13: ncur =ExtractMax(OPEN)
14: end while
15: return error
tions to guide the search more effectively by focusing atten-
tion on more promising redesign options. The Best First De-
sign (BFD) algorithm (detailed in Algorithm 1) accepts as
input an ER-UMD model ω, and a heuristic function h. The
algorithm starts by creating an OPEN priority queue (line 1)
holding the front of unexpanded nodes. In line 2, ncur is
assigned the original model, which is represented by a flag
design and the empty modification sequence ~
.
The iterative exploration of the currently most promising
node in the OPEN queue is given in lines 3-14. If the current
node represents an execution model (indicated by the execu-
tion flag) the search ends successfully in line 5, returning the
modification sequence associated with the node. Otherwise,
the successor design nodes of the current node are generated
by GetSuccessors in line 7. Each successor sub-problem
nsuc is placed in the OPEN list with its associated heuristic
value h(nsuc)(line 8), to be discussed in detail next. In ad-
dition, if the modification sequence ncur .~
associated with
the current node is valid according to Φ, an execution node is
generated and assigned a value that corresponds to the actual
value V(ω~
new )in the resulting environment (lines 10-12).
The next node to explore is extracted from OPEN in line 13.
Both termination and completeness of the algorithm de-
pend on the implementation of GetSuccessors, which con-
trols the graph search strategy by generating the sub-problem
design nodes related to the current node. For example, when
a modification budget is specified, GetSuccessors generates
a sub-problem for every modification that is appended to the
sequence ~
of the parent node, discarding sequences that vi-
olate the budget and updating it for the valid successors.
For optimality, we require the heuristic function hto be
admissible. An admissible estimation of a design node nis
one that never underestimates Vmax
ω, the maximal system’s
utility in the ER-UMD problem ωrepresented by ncur.1
Running BFD with an admissible heuristic is guaranteed to
yield an optimal modification sequence.
1When utility is cost, it needs not overestimate the real cost.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4357
Theorem 1 Given an ER-UMD model ωand an admissible
heuristic h, BFD(ω, h)returns ~
ω~
ω.
The proof of Theorem 1 bares similarity to the proof of
optimality of A[Nilsson, 1980]and is omitted here for the
sake of brevity.
The simplified-environment heuristic
To produce efficient over-estimations of the maximal system
utility Vmax(ω), we suggest a heuristic that requires a sin-
gle pre-processing simplification of the original environment
used to produce estimates for the design nodes of the search.
Definition 3 Given an ER-UMD model ω, a function f:
E → E is an environment simplification in ωif , 0∈ Eω
s.t. 0=f(),V(ω)≤ V(ωf), where ωfis the ER-
UMD model with f()as its initial environment.
The simplified-environment heuristic, denoted by hsim es-
timates the value of applying a modification sequence ~
to ω
by the value of applying it to ωf.
hsim(ω~
)def
=Vmax(ωf
~
)(1)
The search applies modifications on the simplified model
and uses its optimal solution as an estimate of the value of
applying the modifications in the original setting. In partic-
ular, the simplified model can be solved using the Design-
Comp compilation presented in the previous section.
The literature is rich with simplification approaches, in-
cluding adding macro actions that do more with the same
cost, removing some action preconditions, eliminating neg-
ative effects of actions (delete relaxation) or eliminating un-
desired outcomes [Holte et al., 1996]. Particular to stochas-
tic settings is the commonly used all outcome determiniza-
tion [Yoon et al., 2007], which creates a deterministic action
for each probabilistic outcome of every action.
Lemma 2 Given an ER-UMD model ω, applying the
simplified-environment heuristic with fimplemented as an
all outcome determinization function is admissible.
The proof of Lemma 2, omitted for brevity, uses the obser-
vation that fonly adds solutions with higher reward (lower
cost) to a given problem (either before or after redesign). A
similar reasoning can be applied to the commonly used delete
relaxation or any other approaches discussed above.
Note that admissibility of a heuristic function depends on
specific characteristics of the ER-UMD setting. In particu-
lar, the simplified-environment heuristic is not guaranteed to
produce admissible estimates for policy teaching [Zhang and
Parkes, 2008]or goal recognition design [Keren et al., 2014;
Wayllace et al., 2016], where design is performed to incen-
tivize agents to follow specific policies. This is because the
relaxation itself may change the set of optimal agent policies
and therefore underestimate the value of a modification.
change init reduce probability
BOX relocate a truck driving to wrong destination
BLOCK dropping a block or tower
EX-BLOCK as for Blocks World
TIRE add spare tires having a flat tire
ELEVATOR add elevator shaft falling to initial state
VACUUM (re)move furniture add high friction tile
Table 1: Allowed modifications for each domain
B=1 B=2 B=3
solved reduc solved reduc solved reduc
BOX 8 28 8 42 7 44
BLOCK 6 21 3 24 3 24
EX-BLOCK 10 42 9 42 9 42
TIRE 9 44 8 51 6 54
ELEVATOR 9 22 7 24 1 18
VACUUM 8 15 6 17 0 —
Table 2: Utility improvement for optimal solvers
4 Empirical Evaluation
We evaluated the ability to maximize agent utility given a de-
sign budget in various ER-UMD problems, as well as the per-
formance of both optimal and approximate techniques.
We used five PPDDL domains from the probabilistic tracks
of the sixth and eighth International Planning Competition2
(IPPC06 and IPPC08) representing stochastic shortest path
MDPs with uniform action cost: Box World (IPPC08/ BOX),
Blocks World (IPPC08/ BLOCK), Exploding Blocks World
(IPPC08/ EX-BLOCK), Triangle Tire (IPPC08/ TIRE) and
Elevators (IPPC06/ ELEVATOR). In addition, we imple-
mented the vacuum cleaning robot setting from Example 1
(VACUUM) as an adaptation of the Taxi domain [Dietterich,
2000]. The robot moves in a grid and collects pieces of dirt. It
cannot move through furniture, represented by occupied cells,
and may fail to move, remaining in its current position.
In all domains, agent utility is expressed as expected cost
and constraints as a design budget. For each domain, we ex-
amined at least two possible modifications, including at least
one that modifies the probability distribution. Modifications
by domain are specified in Table 1 with change init mark-
ing modifications of the initial state and probability change
marking modifications of the probability function.
4.1 Optimal Solutions
Setup For each domain, we created 10 smaller instances op-
timally solvable within a time bound of five minutes. Each
instance was optimally solved using:
EX- Exhaustive exploration of possible modifications.
DC- Solution of the DesignComp compilation.
BFD- Algorithm 1 with simplified-environment heuristic
using the delete relaxation to simplify the model and the
DesignComp compilation to optimally solve it.
We used a portfolio of 3 admissible heuristics:
2http://icaps-conference.org/index.php/main/competitions
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4358
Ex-h0Ex-h0+ Ex-hMinM in DC-h0DC-h0+DC-hM inMin BFD-h0BFD-h0+ BFD-hM inM in
BOX
B=1 158.4(8) 159.0(8) 158.9(8) 163.9(8) 70.7(8) 221.4 (8) 157.4(8) 68.2(8) 216.4(8)
B=2 264.7(7) 264.9(7) 267.8(7) 270.6(7) 92.1(8) 332.7(7) 260.8(7) 88.0(8) 325.3(7)
B=3 238.5(4) 236.5(4) 235.6 (4) 241.5(4) 73.5(4) 271.7(4) 234.3(4) 70.2 (7) 265.94(4)
BLOCKS
B=1 50.5(6) 50.5(6) 50.8(6) 50.7(6) 41.7(6) 77.1(6) 50.3(6) 41.6(6) 74.4(6)
B=2 28.0(4) 28.2(4) 28.0(4) 28.2(4) 17.4(4) 36.4(3) 28.0(4) 17.2(4) 35.4(3)
B=3 348.9(2) 347.3(2) 348.2(2) 354.5(2) 194.6(3) 363.5(2) 352.2(2) 118.2(3) 354.8(2)
EX. BLOCKS
B=1 69.4(9) 70.2(9) 69.9 (9) 68.4(9) 38.7(9) 6.7(10) 69.5(9) 40.3(9) 60.3(9)
B=2 161.7(9) 170.9(9) 168.1(9) 153.1(9) 88.2(9) 30.2(10) 153.9(9) 85.6(9) 135.0(9)
B=3 250.7 (9) 265.9(9) 292.2(9) 252.5(9) 134.9(9) 88.8(8) 285.9(9) 160.9(9) 237.4(9)
TIRE
B=1 32.9(9) 32.9(9) 33.1(9) 33.3(9) 30.2(9) 36.8(9) 33.0(9) 29.5 (9) 36.9(9)
B=2 55.2(7) 55.4(7) 55.0(7) 55.5(7) 51.1(8) 88.8(8) 55.0(7) 50.9(8) 89.1(8)
B=3 270.3(6) 136.5)6 258.4(6) 269.7(6) 136.5(6) 258.4(6) 267.6(6) 188.3(6) 256.2(6)
ELEVATOR
B=1 300.4(8) 299.6(8) 301.6(8) 301.9(8) 236.2(9) 192.6(9) 302.6(8) 238.3(9) 176.6(9)
B=2 361.8(5) 360.9(5) 366.2(5) 363.4(5) 261.0(5) 243.89(5) 360.8(5) 258.6(5) 231.2(5)
B=3 na na na na 1504.6(1) 1117.4(1) na 1465.8(1) 1042.5(1)
VACUUM
B=1 0.15(9) 0.16(9) 0.15(9) 0.17(9) 0.099(9) 0.15(9) 0.15(9) 0.096(9) 0.13(9)
B=2 3.6(9) 3.27(9) 3.44(9) 3.25(9) 2.13(9) 2.49(9) 3.39(9) 2.021(9) 2.61(9)
B=3 na na na na na na na na na
Table 3: Running time and number of instances solved for optimal solvers
h0assigns 0to all states and serves as a baseline for the
assessing the value of more informative heuristics.
h0+assigns 1to all non-goal states and 0otherwise.
hMinM in solves all outcome determinization using the
zero heuristic [Bonet and Geffner, 2005].
Each problem was tested on a Intel(R) Xeon(R) CPU
X5690 machine with a budget of 1,2and 3. Design actions
were assigned a cost of 104, and problems were solved us-
ing LAO* [Hansen and Zilberstein, 1998]with convergence
error bound of 106. Each run had a 30 minutes time limit.
Results Separated by domain and budget, Table 2 summa-
rizes the number of solved instances (solved) and average per-
centage of expected cost reduction over instances solved (re-
duc). In all domains, complexity brought by increased budget
reduces the number of solved instances, while the actual re-
duction varies among domains. As for solution improvement,
all domains show an improvement of 15% to 54%.
Table 3 compares solution performance. Each row repre-
sents a solver and heuristic pair. Results are separated by
domain and budget, depicting the average running time for
problems solved by all approaches for a given budget and the
number of instances solved in parenthesis (na indicates no
instances were solved within the time limit). The dominating
approach for each row (representing a domain and budget) is
emphasized in bold. In all cases, the use of informed search
outperformed the exhaustive approach.
4.2 Approximate Solutions
Setup For approximate solutions we used an MDP reduced
model approach that creates simplified MDPs accounting for
the full probabilistic model a bounded number of times (for
each execution history), and treat the problem as determin-
istic afterwards [Pineda and Zilberstein, 2014]. The deter-
ministic problems were solved using the FF classical plan-
ner [Hoffmann and Nebel, 2001], as explained in [Pineda and
Zilberstein, 2017]. We used Monte-Carlo simulations to eval-
uate the policies’ probability of reaching a goal state and its
expected cost. In particular, we gave the planner 20 minutes
to solve each problem 50 times. We used the first 10 instances
of each competition domain mentioned above, excluding Box
World, due to limitations of the planner. For the VACUUM
domain we generated ten configurations of up to 5×7grid
size rooms, based on Figure 1.
Results Table 4 reports three measures (per budget): the num-
ber of problems completed within allocated time (solved), im-
proved probability of reaching a goal of the resulting policies
with respect to the policies obtained without design (δPs), and
the average percentage of reduction in expected cost after re-
design (reduc) (δPsand reduc are averaged only over prob-
lems solved 50 times when using both the original and mod-
ified model). In general, we observed that redesign enables
either improvement in expected cost or in probability of suc-
cesses (and sometimes both), across all budgets. For BLOCK
and EX-BLOCK, a budget of 2yielded best results, while for
ELEVATOR, TIRE, and VACUUM a budget of 3was better.
However, the increased difficulty of the compiled problem re-
sulted sometimes in a lower number of solved problems (e.g.,
solving only 3problems on TIRE with budget of 3). Never-
theless, these results demonstrate the feasibility of obtaining
good solutions when compromising optimality.
B= 1 B= 2 B= 3
solved δPsreduc solved δPsreduc solved δ Psreduc
BLOCK 8 0 19.1 8 0 21.2 8 0 18.6
EX-BLOCK 10 0.42 0 10 0.50 0 10 0.41 0
TIRE 7 0 6.98 7 0 17.9 3 0 33
ELEVATOR 10 -0.33 25 10 0.1 30 10 0.1 38.3
VACUUM 10 0.2 8.12 10 0.2 4.72 10 0.3 9.72
Table 4: Utility improvement for sub-optimal solver
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4359
4.3 Discussion
For all examined domains, results indicate the benefit of using
heuristic search over an exhaustive search in the modification
space. However, the dominating heuristic approach varied
between domains, and for TIRE also between budget allo-
cation. Investigating the reasons for this variance, we note
a key difference between BFD and DC. While DC applies
modifications to the original model, BFD uses the simplified-
environment heuristic that applies them to a simplified model.
Poor performance of BFD can be due to either the minor
effect the applied simplifications have on the computational
complexity or to an exaggerated effect that may limit the in-
formative value of heuristic estimations. In particular, this
could happen due to the overlap between the design process
and the simplification.
To illustrate, by applying the all outcome determinization
to the Vacuum domain depicted in Example 1, we ignore the
undesired outcome of slipping. Consequently, the heuris-
tic completely overlooks the value of adding high-friction
tiles, while providing informative estimations to the value of
(re)moving furniture. This observation may explain the poor
performance of BFD with EX-BLOCKS, where simplifica-
tion via the delete relaxation ignores the possibility of blocks
exploding and overlooks the value of the proposed modifica-
tions. Therefore, estimations of BFD may be improved by
developing a heuristic that uses the aggregation of several es-
timations. Also, when the order of application is immaterial,
a closed list may be used for examined sets in the BFD ap-
proach but not with DC. Finally, a combination of relaxation
approaches may enhance performance of sub-optimal solvers.
5 Related Work
Environment design [Zhang et al., 2009]provides a frame-
work for an interested party (system) to seek an optimal way
to modify an environment to maximize some utility. Among
the many ways to instantiate the general model, policy teach-
ing [Zhang and Parkes, 2008; Zhang et al., 2009]enables a
system to modify the reward function of a stochastic agent
to entice it to follow specific policies. We focus on a differ-
ent special case where the system is altruistic and redesigns
the environment in order to maximize agent utility. The tech-
niques used for solving the policy teaching do not apply to
our setting, which supports arbitrary modifications.
The DesignComp compilation is inspired by the technique
of G¨
obelbecker et al. (2010) of coming up with good ex-
cuses for why there is no solution to a planning problem.
Our compilation extends the original approach in four direc-
tions. First, we move from a deterministic environment to
a stochastic one. Second, we maximize agent utility rather
than only moving from unsolvable to solvable. Third, we em-
bed support of a design budget. Finally, we support arbitrary
modification alternatives including modifications specific to
stochastic settings as well as all those suggested for deter-
ministic settings [Herzig et al., 2014; Menezes et al., 2012].
6 Conclusions
We presented the ER-UMD framework for maximizing agent
utility by the off-line design of stochastic environments.
We presented two solution approaches; a compilation-based
method that embeds design into the definition of a planning
problem and an informed heuristic search in the modification
space, for which we provided an admissible heuristic. Our
empirical evaluation supports the feasibility of the approaches
and shows substantial utility gain on all evaluated domains.
In future work, we will explore creating tailored heuris-
tics to improve planner performance. Also, we will extend
the model to deal with partial observability using POMDPs,
as well as automatically finding possible modifications, sim-
ilarly to [G¨
obelbecker et al., 2010]. In addition, we plan to
extend the offline design paradigm, by accounting for online
design that can be dynamically applied to a model.
Acknowledgements
The work was supported in part by the National Science
Foundation grant number IIS-1405550.
References
[Bellman, 1957]Richard Bellman. A Markovian decision
process. Indiana University Mathematics Journal, 6:679–
684, 1957.
[Bertsekas, 1995]Dimitri P. Bertsekas. Dynamic program-
ming and optimal control, volume 1. Athena Scientific
Belmont, MA, 1995.
[Bonet and Geffner, 2005]Blai Bonet and H´
ector Geffner.
mGPT: A probabilistic planner based on heuristic search.
Journal of Artificial Intelligence Research, 24:933–944,
2005.
[Dietterich, 2000]Thomas G. Dietterich. Hierarchical re-
inforcement learning with the maxq value function de-
composition. Journal of Artificial Intelligence Research,
13:227–303, 2000.
[G¨
obelbecker et al., 2010]Moritz G¨
obelbecker, Thomas
Keller, Patrick Eyerich, Michael Brenner, and Bernhard
Nebel. Coming up with good excuses: What to do when
no plan can be found. Cognitive Robotics, 2010.
[Hansen and Zilberstein, 1998]Eric A. Hansen and Shlomo
Zilberstein. Heuristic search in cyclic AND/OR graphs.
In Proceedings of the Fifteenth National Conference on
Artificial Intelligence, pages 412–418, 1998.
[Herzig et al., 2014]Andreas Herzig, Viviane Menezes,
Leliane Nunes de Barros, and Renata Wassermann. On
the revision of planning tasks. In Proceedings of the
Twenty-first European Conference on Artificial Intelli-
gence (ECAI), pages 435–440, 2014.
[Hoffmann and Nebel, 2001]J¨
org Hoffmann and Bernhard
Nebel. The FF planning system: Fast plan generation
through heuristic search. Journal of Artificial Intelligence
Research, 14:253–302, 2001.
[Holte et al., 1996]Robert C. Holte, M. B. Perez, R. M. Zim-
mer, and Alan J. MacDonald. Hierarchical A*: Searching
abstraction hierarchies efficiently. In Proceedings of the
Thirteenth National Conference on Artificial Intelligence
(AAAI), pages 530–535, 1996.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
4360
[Keren et al., 2014]Sarah Keren, Avigdor Gal, and Erez
Karpas. Goal recognition design. In Proceedings of
the Twenty-Fourth International Conference on Automated
Planning and Scheduling (ICAPS), pages 154–162, 2014.
[Keren et al., 2015]Sarah Keren, Avigdor Gal, and Erez
Karpas. Goal recognition design for non-optimal agents.
In Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence (AAAI), pages 3298–3304, 2015.
[Mausam and Kolobov, 2012]Mausam and Andrey
Kolobov. Planning with Markov decision processes:
An AI perspective. Morgan & Claypool Publishers, 2012.
[Menezes et al., 2012]Maria Viviane Menezes, Leliane N.
de Barros, and Silvio do Lago Pereira. Planning task val-
idation. In ICAPS Workshop on Scheduling and Planning
Applications (SPARK), pages 48–55, 2012.
[Nilsson, 1980]Nils J. Nilsson. Principles of Artificial Intel-
ligence. Tioga Publishers, Palo Alto, California, 1980.
[Pineda and Zilberstein, 2014]Luis Pineda and Shlomo Zil-
berstein. Planning under uncertainty using reduced mod-
els: Revisiting determinization. In Proceedings of the
Twenty-Fourth International Conference on Automated
Planning and Scheduling (ICAPS), pages 217–225, 2014.
[Pineda and Zilberstein, 2017]Luis Pineda and Shlomo Zil-
berstein. Generalizing the role of determinization in prob-
abilistic planning. Technical Report UM-CS-2017-006,
University of Massachusetts Amherst, 2017.
[Wayllace et al., 2016]Christabel Wayllace, Ping Hou,
William Yeoh, and Tran Cao Son. Goal recognition design
with stochastic agent action outcomes. In Proceedings
of the Twenty-Fifth International Joint Conference on
Artificial Intelligence (IJCAI), pages 3279–3285, 2016.
[Yoon et al., 2007]Sung Wook Yoon, Alan Fern, and Robert
Givan. FF-Replan: A baseline for probabilistic planning.
In Proceedings of the Seventeenth International Confer-
ence on Automated Planning and Scheduling (ICAPS),
pages 352–359, 2007.
[Younes and Littman, 2004]H˚
akan L. S. Younes and
Michael L. Littman. PPDDL1.0: The language for
the probabilistic part of IPC-4. In Proceedings of the
International Planning Competition (IPC), 2004.
[Zhang and Parkes, 2008]Haoqi Zhang and David Parkes.
Value-based policy teaching with active indirect elicita-
tion. In Proceedings of the Twenty-Third Conference on
Artificial Intelligence (AAAI), pages 208–214, 2008.
[Zhang et al., 2009]Haoqi Zhang, Yiling Chen, and David
Parkes. A general approach to environment design with
one agent. In Proceedings of the Twenty-First Interna-
tional Joint Conference on Artifical Intelligence, pages
2002–2008, 2009.
... We present an actor-designer framework to mitigate the impacts of NSE by shaping the environment. The general idea of environment modification to influence the behaviors of the acting agents has been previously explored in other contexts (Zhang, Chen, and Parkes 2009), such as to accelerate agent learning (Randløv 2000), to quickly infer the goals of the actor (Keren, Gal, and Karpas 2014), and to maximize the agent's reward (Keren et al. 2017). We study how environment shaping can mitigate NSE. ...
Preprint
Agents operating in unstructured environments often produce negative side effects (NSE), which are difficult to identify at design time. While the agent can learn to mitigate the side effects from human feedback, such feedback is often expensive and the rate of learning is sensitive to the agent's state representation. We examine how humans can assist an agent, beyond providing feedback, and exploit their broader scope of knowledge to mitigate the impacts of NSE. We formulate this problem as a human-agent team with decoupled objectives. The agent optimizes its assigned task, during which its actions may produce NSE. The human shapes the environment through minor reconfiguration actions so as to mitigate the impacts of the agent's side effects, without affecting the agent's ability to complete its assigned task. We present an algorithm to solve this problem and analyze its theoretical properties. Through experiments with human subjects, we assess the willingness of users to perform minor environment modifications to mitigate the impacts of NSE. Empirical evaluation of our approach shows that the proposed framework can successfully mitigate NSE, without affecting the agent's ability to complete its assigned task.
... Environment design has been considered in the symbolic AI community, as a means to influence an agent's decisions (Zhang and Parkes, 2008;Zhang et al., 2009). This was extended with automated design (Keren et al., 2017(Keren et al., , 2019, which seeks to redesign environments to improve agents. Unlike these works, ACCEL automatically designs environments to produce a curriculum for a student agent. ...
Preprint
It remains a significant challenge to train generally capable agents with reinforcement learning (RL). A promising avenue for improving the robustness of RL agents is through the use of curricula. One such class of methods frames environment design as a game between a student and a teacher, using regret-based objectives to produce environment instantiations (or levels) at the frontier of the student agent's capabilities. These methods benefit from their generality, with theoretical guarantees at equilibrium, yet they often struggle to find effective levels in challenging design spaces. By contrast, evolutionary approaches seek to incrementally alter environment complexity, resulting in potentially open-ended learning, but often rely on domain-specific heuristics and vast amounts of computational resources. In this paper we propose to harness the power of evolution in a principled, regret-based curriculum. Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), seeks to constantly produce levels at the frontier of an agent's capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior regret-based methods, while providing significant empirical gains in a diverse set of environments. An interactive version of the paper is available at accelagent.github.io.
... Three paradigms are widely used in recent research on goal recognition, such as the decision-theoretic paradigm based equi-reward utility maximizing design (ER-UMD) model, the environment is controlled by applying a sequence of modifications to maximize the agent's utility [53]. The game-theoretic paradigm-based goal recognition and design models were used to reason the opponents' minds, such as the adversarial intention recognition as inverse planning model, one generative stochastic game model was used for threat assessment [7]. ...
Article
Full-text available
Goal recognition (GR) is a method of inferring the goals of other agents, which enables humans or AI agents to proactively make response plans. Goal recognition design (GRD) has been proposed to deliberately redesign the underlying environment to accelerate goal recognition. Along with the GR and GRD problems, in this paper, we start by introducing the goal recognition control (GRC) problem under network interdiction, which focuses on controlling the goal recognition process. When the observer attempts to facilitate the explainability of the actor’s behavior and accelerate goal recognition by reducing the uncertainty, the actor wants to minimize the privacy information leakage by manipulating the asymmetric information and delay the goal recognition process. Then, the GRC under network interdiction is formulated as one static Stackelberg game, where the observer obtains asymmetric information about the actor’s intended goal and proactively interdicts the edges of the network with a bounded resource. The privacy leakage of the actor’s actions about the real goals is quantified by a min-entropy information metric and this privacy information metric is associated with the goal uncertainty. Next in importance, we define the privacy information metric based GRC under network interdiction (InfoGRC) and the information metric based GRC under threshold network interdiction (InfoGRCT). After dual reformulating, the InfoGRC and InfoGRCT as bi-level mixed-integer programming problems, one Benders decomposition-based approach is adopted to optimize the observer’s optimal interdiction resource allocation and the actor’s cost-optimal path-planning. Finally, some experimental evaluations are conducted to demonstrate the effectiveness of the InfoGRC and InfoGRCT models in the task of controlling the goal recognition process.
... The agent is assumed to have a fixed form of decision making and the system can influence the agent's decisions via limited changes to the environment. Several special cases of this model have been suggested, including policy teaching (Zhang & Parkes, 2008), where the system aims to influence an agent's decisions by modifying its reward function; and equi-reward utility maximizing design (ER-UMD) (Keren, Pineda, Gal, Karpas, & Zilberstein, 2017), where design is used to maximize agent utility. In GRD, the recognition system is penalized for non-distinctive behavior and modifications are applied to minimize WCD. ...
Article
Goal recognition design (GRD) facilitates understanding the goals of acting agents through the analysis and redesign of goal recognition models, thus offering a solution for assessing and minimizing the maximal progress of any agent in the model before goal recognition is guaranteed. In a nutshell, given a model of a domain and a set of possible goals, a solution to a GRD problem determines (1) the extent to which actions performed by an agent within the model reveal the agent’s objective; and (2) how best to modify the model so that the objective of an agent can be detected as early as possible. This approach is relevant to any domain in which rapid goal recognition is essential and the model design can be controlled. Applications include intrusion detection, assisted cognition, computer games, and human-robot collaboration. A GRD problem has two components: the analyzed goal recognition setting, and a design model specifying the possible ways the environment in which agents act can be modified so as to facilitate recognition. This work formulates a general framework for GRD in deterministic and partially observable environments, and offers a toolbox of solutions for evaluating and optimizing model quality for various settings. For the purpose of evaluation we suggest the worst case distinctiveness (WCD) measure, which represents the maximal cost of a path an agent may follow before its goal can be inferred by a goal recognition system. We offer novel compilations to classical planning for calculating WCD in settings where agents are bounded-suboptimal. We then suggest methods for minimizing WCD by searching for an optimal redesign strategy within the space of possible modifications, and using pruning to increase efficiency. We support our approach with an empirical evaluation that measures WCD in a variety of GRD settings and tests the efficiency of our compilation-based methods for computing it. We also examine the effectiveness of reducing WCD via redesign and the performance gain brought about by our proposed pruning strategy.
Chapter
Reinforcement Learning (RL) has emerged as an effective approach to address a variety of complex control tasks. In a typical RL problem, an agent interacts with the environment by perceiving observations and performing actions, with the ultimate goal of maximizing the cumulative reward. In the traditional formulation, the environment is assumed to be a fixed entity that cannot be externally controlled. However, there exist several real-world scenarios in which the environment offers the opportunity to configure some of its parameters, with diverse effects on the agent’s learning process. In this contribution, we provide an overview of the main aspects of environment configurability. We start by introducing the formalism of the Configurable Markov Decision Processes (Conf-MDPs) and we illustrate the solutions concepts. Then, we revise the algorithms for solving the learning problem in Conf-MDPs. Finally, we present two applications of Conf-MDPs: policy space identification and control frequency adaptation.
Chapter
Full-text available
The prediction of chaotic dynamical systems’ future evolution is widely debated and represents a hot topic in the context of nonlinear time series analysis. Recent advances in the field proved that machine learning techniques, and in particular artificial neural networks, are well suited to deal with this problem. The current state-of-the-art primarily focuses on noise-free time series, an ideal situation that never occurs in real-world applications. This chapter provides a comprehensive analysis that aims at bridging the gap between the deterministic dynamics generated by archetypal chaotic systems, and the real-world time series. We also deeply explore the importance of different typologies of noise, namely observation and structural noise. Artificial intelligence techniques turned out to provide robust predictions, and potentially represent an effective and flexible alternative to the traditional physically-based approach for real-world applications. Besides the accuracy of the forecasting, the domain-adaptation analysis attested the high generalization capability of the neural predictors across a relatively heterogeneous spatial domain.
Article
Past work on plan explanations primarily involved the AI system explaining the correctness of its plan and the rationale for its decision in terms of its own model. Such soliloquy is wholly inadequate in most realistic scenarios where users have domain and task models that differ from that used by the AI system. We posit that the explanations are best studied in light of these differing models. In particular, we show how explanation can be seen as a “model reconciliation problem” (MRP), where the AI system in effect suggests changes to the user's mental model so as to make its plan be optimal with respect to that changed user model. We will study the properties of such explanations, present algorithms for automatically computing them, discuss relevant extensions to the basic framework, and evaluate the performance of the proposed algorithms both empirically and through controlled user studies.
Article
Full-text available
In many e-learning settings, allowing students to choose which skills to practice encourages their motivation and contributes to learning. However, when given choice, students may prefer to practice skills that they already master, rather than practice skills they need to master. On the other hand, requiring students only to practice their required skills may reduce their motivation and lead to dropout. In this paper, we model this tradeoff as a multi-agent planning task, which we call SWOPP (Supervisor- Worker Problem with Partially Overlapping goals), involving two agents—a supervisor (teacher) and a worker (student)—each with different, yet non-conflicting, goals. The supervisor and worker share joint goals (mastering skills). The worker plans to achieve his/her own goals (completing an e-learning session) at a minimal cost (effort required to solve problems). The supervisor guides the worker towards achieving the joint goals by controlling the problems in the choice set for the worker. We provide a formal model for the SWOPP task and two sound and complete algorithms for the supervisor to guide the worker’s plan to achieve their joint goals. We deploy SWOPP for the first time in a real-world study to personalize math questions for K5 students using an e-learning software in schools. We show that SWOPP was able to guide students’ interactions with the software to practice necessary skills without deterring their motivation.
Article
This article provides new techniques for optimizing domain design for goal and plan recognition using plan libraries. We define two new problems: Goal Recognition Design for Plan Libraries (GRD-PL) and Plan Recognition Design (PRD). Solving the GRD-PL helps to infer which goal the agent is trying to achieve, while solving PRD can help to infer how the agent is going to achieve its goal. For each problem, we define a worst-case distinctiveness measure that is an upper bound on the number of observations that are necessary to unambiguously recognize the agent’s goal or plan. This article studies the relationship between these measures, showing that the worst-case distinctiveness of GRD-PL is a lower bound of the worst-case plan distinctiveness of PRD and that they are equal under certain conditions. We provide two complete algorithms for minimizing the worst-case distinctiveness of plan libraries without reducing the agent’s ability to complete its goals: One is a brute-force search over all possible plans and one is a constraint-based search that identifies plans that are most difficult to distinguish in the domain. These algorithms are evaluated in three hierarchical plan recognition settings from the literature. We were able to reduce the worst-case distinctiveness of the domains using our approach, in some cases reaching 100% improvement within a predesignated time window. Our iterative algorithm outperforms the brute-force approach by an order of magnitude in terms of runtime.
Conference Paper
FF-Replan was the winner of the 2004 International Proba- bilistic Planning Competition (IPPC-04) (Younes & Littman 2004a) and was also the top performer on IPPC-06 domains, though it was not an official entry. This success was quite sur- prising, due to the simplicity of the approach. In particular, FF-Replan calls FF on a carefully constructed deterministic variant of the planning problem and selects actions according to the plan until observing an unexpected effect, upon which it replans. Despite the obvious shortcomings of the approach and its strawman nature, it is the state-of-the-art in probabilis- tic planning as measured on recent competition benchmarks. This paper gives the first technical description of FF-Replan and provides an analysis of its results on all of the recent IPPC-04 and IPPC-06 domains. We hope that this will in- spire extensions and insight into the approach and planning domains themselves that will soon lead to the dethroning of FF-Replan.
Heuristic search in cyclic AND/OR graphs Andreas Herzig, Viviane Menezes, Leliane Nunes de Barros, and Renata Wassermann. On the revision of planning tasks
  • Zilberstein Hansen
  • A Eric
  • Shlomo Hansen
  • Zilbersteinherzig
[Hansen and Zilberstein, 1998] Eric A. Hansen and Shlomo Zilberstein. Heuristic search in cyclic AND/OR graphs. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages 412–418, 1998. [Herzig et al., 2014] Andreas Herzig, Viviane Menezes, Leliane Nunes de Barros, and Renata Wassermann. On the revision of planning tasks. In Proceedings of the Twenty-first European Conference on Artificial Intelligence (ECAI), pages 435–440, 2014.
2012] Maria Viviane Menezes, Leliane N. de Barros, and Silvio do Lago Pereira. Planning task validation
  • Ai An
An AI perspective. Morgan & Claypool Publishers, 2012. [Menezes et al., 2012] Maria Viviane Menezes, Leliane N. de Barros, and Silvio do Lago Pereira. Planning task validation. In ICAPS Workshop on Scheduling and Planning Applications (SPARK), pages 48–55, 2012. [Nilsson, 1980] Nils J. Nilsson. Principles of Artificial Intelligence. Tioga Publishers, Palo Alto, California, 1980.
Planning under uncertainty using reduced models: Revisiting determinization
  • Zilberstein Pineda
  • Luis Pineda
  • Shlomo Zilberstein
[Pineda and Zilberstein, 2014] Luis Pineda and Shlomo Zilberstein. Planning under uncertainty using reduced models: Revisiting determinization. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling (ICAPS), pages 217–225, 2014.
Generalizing the role of determinization in probabilistic planning Tran Cao Son. Goal recognition design with stochastic agent action outcomes
  • Zilberstein Pineda
  • Luis Pineda
  • Shlomo Zilbersteinwayllace
[Pineda and Zilberstein, 2017] Luis Pineda and Shlomo Zilberstein. Generalizing the role of determinization in probabilistic planning. Technical Report UM-CS-2017-006, University of Massachusetts Amherst, 2017. [Wayllace et al., 2016] Christabel Wayllace, Ping Hou, William Yeoh, and Tran Cao Son. Goal recognition design with stochastic agent action outcomes. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI), pages 3279–3285, 2016.
PPDDL1.0: The language for the probabilistic part of IPC-4
  • Littman Younes
  • L S Håkan
  • Michael L Younes
  • Littman
[Younes and Littman, 2004] Håkan L. S. Younes and Michael L. Littman. PPDDL1.0: The language for the probabilistic part of IPC-4. In Proceedings of the International Planning Competition (IPC), 2004.