Content uploaded by Avigdor Gal

Author content

All content in this area was uploaded by Avigdor Gal on Sep 27, 2017

Content may be subject to copyright.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4353

Equi-Reward Utility Maximizing Design in Stochastic Environments

Sarah Keren†, Luis Pineda‡, Avigdor Gal†, Erez Karpas†, Shlomo Zilberstein‡

†Technion–Israel Institute of Technology

‡College of Information and Computer Sciences, University of Massachusetts Amherst

sarahn@technion.ac.il, lpineda@cs.umass.edu∗

Abstract

We present the Equi-Reward Utility Maximiz-

ing Design (ER-UMD) problem for redesigning

stochastic environments to maximize agent perfor-

mance. ER-UMD ﬁts well contemporary applica-

tions that require ofﬂine design of environments

where robots and humans act and cooperate. To

ﬁnd an optimal modiﬁcation sequence we present

two novel solution techniques: a compilation that

embeds design into a planning problem, allowing

use of off-the-shelf solvers to ﬁnd a solution, and

a heuristic search in the modiﬁcations space, for

which we present an admissible heuristic. Evalu-

ation shows the feasibility of the approach using

standard benchmarks from the probabilistic plan-

ning competition and a benchmark we created for a

vacuum cleaning robot setting.

1 Introduction

We are surrounded by physical and virtual environments with

a controllable design. Hospitals are designed to minimize the

daily distance covered by staff, computer networks are struc-

tured to maximize message throughput, human-robot assem-

bly lines are designed to maximize productivity, etc. Com-

mon to all these environments is that they are designed with

the intention of maximizing some user beneﬁt while account-

ing for different forms of uncertainty.

Typically, design is performed manually, often leading to

far from optimal solutions. We therefore suggest to auto-

mate the design process and formulate the Equi-Reward Util-

ity Maximizing Design (ER-UMD) problem where a system

controls the environment by applying a sequence of modiﬁ-

cations in order to maximize agent utility.

We assume a fully observable stochastic setting and use

Markov decision processes [Bellman, 1957]to model the

agent environment. We exploit the alignment of system and

agent utility to show a compilation of the design problem into

a planning problem and piggyback on the search for an opti-

mal policy to ﬁnd an optimal sequence of modiﬁcations. In

∗Last three authors email addresses: avigal@ie.technion.ac.il,

karpase@technion.ac.il, shlomo@cs.umass.edu

addition, we exploit the structure of the ofﬂine design pro-

cess and offer a heuristic search in the modiﬁcations space to

yield optimal design strategies. We formulate the conditions

for heuristic admissibility and propose an admissible heuris-

tic based on environment simpliﬁcation. Finally, for settings

where practicality is prioritized over optimality, we present a

way to efﬁciently acquire sub-optimal solutions.

The contributions of this work are threefold. First, we

formulate the ER-UMD problem as a special case of envi-

ronment design [Zhang et al., 2009]. ER-UMD supports

arbitrary modiﬁcation methods. Particularly, for stochastic

settings, we propose modifying probability distributions, an

approach which offers a wide range of subtle environment

modiﬁcations. Second, we present two new approaches for

solving ER-UMD problems, specify the conditions for ac-

quiring an optimal solution and present an admissible heuris-

tic to support the solution. Finally, we evaluate our ap-

proaches given a design budget, using probabilistic bench-

marks from the International Planning Competitions, where a

variety of stochastic shortest path MDPs are introduced [Bert-

sekas, 1995]and on a domain we created for a vacuum clean-

ing robot. We show how redesign substantially improves ex-

pected utility, expressed via reduced cost, achieved with a

small modiﬁcation budget. Moreover, the techniques we de-

velop outperform the exhaustive approach reducing calcula-

tion effort by up to 30% .

The remaining of the paper is organized as follows. Sec-

tion 2 describes the ER-UMD framework. In Section 3, we

describe our novel techniques for solving the ER-UMD prob-

lem. Section 4 describes an empirical evaluation followed by

related work (Section 5) and concluding remarks (Section 6).

2 Equi-Reward Utility Maximizing Design

The equi-reward utility maximizing design (ER-UMD) prob-

lem takes as input an environment with stochastic action out-

comes, a set of allowed modiﬁcations, and a set of constraints

and ﬁnds an optimal sequence of modiﬁcations (atomic

changes such as additions and deletions of environment el-

ements) to apply to the environment for maximizing agent

expected utility under the constraints. We refer to sequences

rather then sets to support settings where different application

orders impact the model differently. Such a setting may in-

volve, for example, modiﬁcations that add preconditions nec-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4354

Figure 1: An example of an ER-UMD problem

essary for the application of other modiﬁcations (e.g. a dock-

ing station can only be added after adding a power outlet).

We consider stochastic environments deﬁned by the

quadruple =hS, A, f, s0,iwith a set of states S, a set

of actions A, a stochastic transition function f:S×A×

S→[0,1] specifying the probability f(s, a, s0)of reach-

ing state s0after applying action ain s∈S, and an initial

state s0, ∈S. We let E,SEand AEdenote the set of all

environments, states and actions, respectively. Adopting the

notation of Zhang and Parkes (2008) for environment design,

we deﬁne the ER-UMD model as follows.

Deﬁnition 1 An equi-reward utility maximizing (ER-UMD)

model ωis a tuple h0

ω,Rω, γω,∆ω,Fω,Φωiwhere

•0

ω∈ E is an initial environment.

• Rω:SE× AE× SE→Ris a Markovian and station-

ary reward function, specifying the reward r(s, a, s0)an

agent gains from transitioning from state sto s0by the

execution of a.

•γωis a discount factor in (0,1], representing the depre-

cation of agent rewards over time.

•∆ωis a ﬁnite set of atomic modiﬁcations a system can

apply. A modiﬁcation sequence is an ordered set of mod-

iﬁcations ~

∆ = h∆1, . . . , ∆nis.t. ∆i∈∆ω. We denote

by ~

∆ωthe set of all such sequences.

• Fω:∆ω× E → E is a deterministic modiﬁcation tran-

sition function, specifying the result of applying a modi-

ﬁcation to an environment.

•Φω:~

∆ω× E → {0,1}is an indicator that speciﬁes the

allowed modiﬁcation sequences in an environment.

Whenever ωis clear from the context we use 0,R,γ,∆,F,

and Φ. Note that a reward becomes a cost when negative.

The reward function Rand discount factor γform, to-

gether with an environment ∈ E an inﬁnite horizon dis-

counted reward Markov decision process (MDP) [Bertsekas,

1995]hS, A, f, s0,R, γ i. The solution of an MDP is a con-

trol policy π:S→Adescribing the appropriate action to

perform at each state. We let Πrepresent the set of all pos-

sible policies in . Optimal policies Π∗

⊆Πyield maximal

expected accumulated reward for every state s∈S[Bellman,

1957]. We assume agents are optimal and let V∗(ω)represent

the discounted expected agent reward of following an optimal

policy from the initial state s0in a model ω.

Modiﬁcations ∆∈∆can be deﬁned arbitrarily, support-

ing all the changes applicable to a deterministic environment

[Herzig et al., 2014]. For example, we can allow adding a

transition between previously disconnected states. Particu-

lar to a stochastic environment is the option of modifying the

transition function by increasing and decreasing the proba-

bility of speciﬁc outcomes. Each modiﬁcation may be as-

sociated with a system cost C:∆→R+and a sequence

cost C(~

∆) = P∆i∈~

∆C(∆i). Given a sequence ~

∆such that

Φ(~

∆, ) = 1 (i.e.,~

∆can be applied to ∈ E ) we let ~

∆

represent the environment that is the result of applying ~

∆to

and ω~

∆is the same model with ~

∆as its initial environment.

The solution to an ER-UMD problem is a modiﬁcation se-

quence ~

∆∈~

∆∗to apply to 0

ωthat maximizes agent utility

V∗(ω~

∆)under the constraints, formulated as follows.

Problem 1 Given a model ω=h0,R, γ, ∆,F,Φi, the ER-

UMD problem ﬁnds a modiﬁcation sequence ~

∆∈~

∆

argmax

~

∆∈~

∆|Φ(~

∆)=1

V∗(ω~

∆)

We let ~

∆∗

ωrepresent the set of solutions to Problem 1 and

Vmax(ω) = max

~

∆∈~

∆|Φ(~

∆)=1

V∗(ω~

∆)represent the maximal

agent utility achievable via design in ω. In particular, we seek

solutions ~

∆∗∈~

∆∗

ωthat minimize design cost C(~

∆∗).

Example 1 As an example of a controllable environment

where humans and robots co-exist consider Figure 1(left),

where a vacuum cleaning robot is placed in a living room.

The set Eof possible environments speciﬁes possible room

conﬁgurations. The robot’s utility, expressed via the reward

Rand discount factor γ, may be deﬁned in various ways;

it may try to clean an entire room as quickly as possible or

cover as much space as possible before its battery runs out.

(Re)moving a piece of furniture from or within the room (Fig-

ure 1(center)) may impact the robot’s utility. For example,

removing a chair from the room may create a shortcut to a

speciﬁc location but may also create access to a corner the

robot may get stuck in. Accounting for uncertainty, there may

be locations in which the robot tends to slip, ending up in a

different location than intended. Increasing friction, e.g., by

introducing a high friction tile (Figure 1(right)), may reduce

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4355

the probability of undesired outcomes. All types of modiﬁca-

tions, expressed by ∆and F, are applied ofﬂine (since such

robots typically perform their task unsupervised) and should

be applied economically in order to maintain usability of the

environment. These type of constraints are reﬂected by Φthat

can restrict the design process by a predeﬁned budget or by

disallowing speciﬁc room conﬁgurations.

3 Finding

~

∆∗

A baseline method for ﬁnding an optimal modiﬁcation se-

quence involves applying an exhaustive best ﬁrst search

(BFS) in the space of allowed sequences and selecting one

that maximizes system utility. This approach was used for

ﬁnding the optimal set of modiﬁcations in a goal recogni-

tion design setting [Keren et al., 2014; Wayllace et al., 2016].

The state space pruning applied there assumes that disallow-

ing actions is the only allowed modiﬁcation, making it non-

applicable for ER-UMD, which supports arbitrary modiﬁca-

tion methods. We therefore present two novel techniques for

ﬁnding an optimal design strategy for ER-UMD.

3.1 ER-UMD Compilation to Planning

As a ﬁrst approach, we embed design into a planning prob-

lem description. The DesignComp compilation (Deﬁnition 2)

extends the agent’s underlying MDP by adding pre-process

operators that modify the environment off-line. After initial-

ization, the agent acts in the new optimized environment.

The compilation uses the PPDDL notation [Younes and

Littman, 2004]which uses a factored MDP representation.

Accordingly, an environment ∈ E is represented as a

tuple hX, s0,, Aiwith states speciﬁed as a combination

of state variables Xand a transition function embedded

in the description of actions. Action a∈Ais rep-

resented by hprec, hp1, add1, del1i, . . . , hpm, addm, delmii

where prec is the set of literals that need to be true as

a precondition for applying a. The probabilistic effects

hp1, add1, del1i, . . . , hpm, addm, delmiare represented by

pi, the probability of the i-th effect. When outcome ioc-

curs, addiand deliare literals, added and removed from the

state description, respectively [Mausam and Kolobov, 2012].

The policy of the compiled planning problem has two

stages: design - in which the system is modiﬁed and exe-

cution - describing the policy agents follow to maximize util-

ity. Accordingly, the compiled domain has two action types:

Ades, corresponding to modiﬁcations applied by the design

system and Aexe, executed by the agent. To separate between

the stages we use a ﬂuent execution, initially false to allow

the application of Ades, and a no-cost action astar t that sets

execution to true rending Aexe applicable.

The compilation process supports two modiﬁcations types.

Modiﬁcations ∆Xchange the initial state by modifying the

value of state variables X∆⊆X. Modiﬁcations ∆Achange

the action set by enabling actions A∆⊆A. Accord-

ingly, the deﬁnition includes a set of design action Ades =

Ades-s0∪Ades-A, where Ades-s0are actions that change the

initial value of variables and Ades-Aincludes actions A∆that

are originally disabled but can be enabled in the modiﬁed en-

vironment. In particular, we include in A∆actions that share

the same structure as actions in the original environment ex-

cept for a modiﬁed probability distribution.

The following deﬁnition of DesignComp supports a design

budget Bimplemented using a timer mechanism as in [Keren

et al., 2015]. The timer advances with up to Bdesign actions

that can be applied before performing astart. This constraint

is represented by ΦBthat returns 0for any modiﬁcations se-

quences that exceeds the given budget.

Deﬁnition 2 For an ER-UMD problem

ω=h0

ω,Rω, γω,∆ω,Fω,ΦB

ωi

where ∆ω=∆X∪∆Awe create a planning problem

P0=hX0, s0

0, A0,R0, γ0i,where:

•X0={X0

ω} ∪ {execution} ∪ {timet|t∈0, . . . , B} ∪

{enableda|a∈A∆}

•s0

0={s0,0

ω}∪{time0}

•A0=Aexe ∪Ades-s0∪Ades-A∪astart where

–Aexe =A0∪A∆s.t.

{hprec(a)∪execution,eff(a)i | a∈A0}

{hprec(a)∪execution ∪enableda,eff(a)i | a∈

A∆}

–Ades-s0={hh¬execution, timeii,h1,hx, timei+1i,

htimeiiii | x∈X∆}

–Ades-A={hh¬execution, timeii,h1,henableda,

timei+1i, timeiii | a∈A∆}}

–astart =h∅,h1,¬execution,∅}ii

• R0=R(a),if a∈Aexe

0,if a∈Ades, ainit

•γ0=γ

Optimally solving the compiled problem P0yields an op-

timal policy π∗

P0with two components, separated by the ex-

ecution of astart. The initialization component consists of a

possibly empty sequence of deterministic design actions de-

noted by ~

∆P0, while the execution component represents the

optimal policy in the modiﬁed environment.

The next two propositions establish the correctness of the

compilation. Proofs are omitted due to space constraints. We

ﬁrst argue that V∗(P0), the expected reward from the initial

state in the compiled planning problem, is equal to the ex-

pected reward in the optimal modiﬁed environment.

Lemma 1 Given an ER-UMD problem ωand an optimal

modiﬁcation sequence ~

∆∈~

∆∗

ω

V∗(P0) = V∗(ω~

∆).

An immediate corollary is that the compilation outcome is

indeed an optimal sequence of modiﬁcations.

Corollary 1 Given an ER-UMD problem ωand the compiled

model P0,~

∆P0∈~

∆∗

ω

The reward function R0assigns zero cost to all design ac-

tions Ades. To ensure the compilation not only respects the

budget B, but also minimizes design cost, we can assign a

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4356

∅

Original

environment

2

1

1,2

Execution -

agent policy

Design -

system policy

3

1,3 2,3 3,1 3,22,1

Figure 2: Search space of an ER-UMD problem

small cost (negative reward) cdto design actions Ades. If cd

is too high, it might lead the solver to omit design actions that

improve utility by less than cd. However, the loss of utility

will be at most cdB. Thus, by bounding the minimum im-

provement in utility from a modiﬁcation, we can still ensure

optimality.

3.2 Design as Informed Search

The key beneﬁt of compiling ER-UMD to planning is the

ability to use any off-the-shelf solver to ﬁnd a design strat-

egy. However, this approach does not fully exploit the special

characteristics of the off-line design setting we address. We

therefore observe that embedding design into the deﬁnition

of a planning problem results in an MDP with a special struc-

ture, depicted in Figure 2. The search of an optimal redesign

policy is illustrated as a tree comprising of two component.

The design component, at the top of the ﬁgure, describes the

deterministic ofﬂine design process with nodes representing

the different possibilities of modifying the environment. The

execution component, at the bottom of the ﬁgure, represents

the stochastic modiﬁed environments in which agents act.

Each design node represents a different ER-UMD model,

characterized by the sequence ~

∆of modiﬁcations that has

been applied to the environment and a constraints set Φ, spec-

ifying the allowed modiﬁcations in the subtree rooted at a

node. With the original ER-UMD problem ωat the root, each

successor design node represents a sub-problem ω~

∆of the

ancestor ER-UMD problem, accounting for all modiﬁcation

sequences that have ~

∆as their preﬁx. The set of constraints

of the successors is updated with relation to the parent node.

For example, when a design budget is speciﬁed, it is reduced

when moving down the tree from a node to its successor.

When a design node is associated with a valid modiﬁcation,

it is connected to a leaf node representing a ER-UMD model

with the environment ~

∆that results from applying the mod-

iﬁcation. To illustrate, invalid modiﬁcation sequences are

crossed out in Figure 2.

Using this search tree we propose an informed search in

the space of allowed modiﬁcations, using heuristic estima-

Algorithm 1 Best First Design (BFD)

BFD(ω, h)

1: create OPEN list for unexpanded nodes

2: ncur =hdesign, ~

∆∅i(initial model)

3: while ncur do

4: if IsE xecution(ncur )then

5: return ncur.~

∆(best modiﬁcation found - exit)

6: end if

7: for each nsuc ∈GetSuccessors(ncur , ω )do

8: put hhdesign, nsuc.~

∆i, h(nsuc)iin OPEN

9: end for

10: if Φσ(ncur.~

∆) = 1 then

11: put hhexecution, ~

∆newi,V∗(ω~

∆new )iin OPEN

12: end if

13: ncur =ExtractMax(OPEN)

14: end while

15: return error

tions to guide the search more effectively by focusing atten-

tion on more promising redesign options. The Best First De-

sign (BFD) algorithm (detailed in Algorithm 1) accepts as

input an ER-UMD model ω, and a heuristic function h. The

algorithm starts by creating an OPEN priority queue (line 1)

holding the front of unexpanded nodes. In line 2, ncur is

assigned the original model, which is represented by a ﬂag

design and the empty modiﬁcation sequence ~

∆∅.

The iterative exploration of the currently most promising

node in the OPEN queue is given in lines 3-14. If the current

node represents an execution model (indicated by the execu-

tion ﬂag) the search ends successfully in line 5, returning the

modiﬁcation sequence associated with the node. Otherwise,

the successor design nodes of the current node are generated

by GetSuccessors in line 7. Each successor sub-problem

nsuc is placed in the OPEN list with its associated heuristic

value h(nsuc)(line 8), to be discussed in detail next. In ad-

dition, if the modiﬁcation sequence ncur .~

∆associated with

the current node is valid according to Φ, an execution node is

generated and assigned a value that corresponds to the actual

value V∗(ω~

∆new )in the resulting environment (lines 10-12).

The next node to explore is extracted from OPEN in line 13.

Both termination and completeness of the algorithm de-

pend on the implementation of GetSuccessors, which con-

trols the graph search strategy by generating the sub-problem

design nodes related to the current node. For example, when

a modiﬁcation budget is speciﬁed, GetSuccessors generates

a sub-problem for every modiﬁcation that is appended to the

sequence ~

∆of the parent node, discarding sequences that vi-

olate the budget and updating it for the valid successors.

For optimality, we require the heuristic function hto be

admissible. An admissible estimation of a design node nis

one that never underestimates Vmax

ω, the maximal system’s

utility in the ER-UMD problem ωrepresented by ncur.1

Running BFD with an admissible heuristic is guaranteed to

yield an optimal modiﬁcation sequence.

1When utility is cost, it needs not overestimate the real cost.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4357

Theorem 1 Given an ER-UMD model ωand an admissible

heuristic h, BFD(ω, h)returns ~

∆∗

ω∈~

∆∗

ω.

The proof of Theorem 1 bares similarity to the proof of

optimality of A∗[Nilsson, 1980]and is omitted here for the

sake of brevity.

The simpliﬁed-environment heuristic

To produce efﬁcient over-estimations of the maximal system

utility Vmax(ω), we suggest a heuristic that requires a sin-

gle pre-processing simpliﬁcation of the original environment

used to produce estimates for the design nodes of the search.

Deﬁnition 3 Given an ER-UMD model ω, a function f:

E → E is an environment simpliﬁcation in ωif ∀, 0∈ Eω

s.t. 0=f(),V∗(ω)≤ V∗(ωf), where ωfis the ER-

UMD model with f()as its initial environment.

The simpliﬁed-environment heuristic, denoted by hsim es-

timates the value of applying a modiﬁcation sequence ~

∆to ω

by the value of applying it to ωf.

hsim(ω~

∆)def

=Vmax(ωf

~

∆)(1)

The search applies modiﬁcations on the simpliﬁed model

and uses its optimal solution as an estimate of the value of

applying the modiﬁcations in the original setting. In partic-

ular, the simpliﬁed model can be solved using the Design-

Comp compilation presented in the previous section.

The literature is rich with simpliﬁcation approaches, in-

cluding adding macro actions that do more with the same

cost, removing some action preconditions, eliminating neg-

ative effects of actions (delete relaxation) or eliminating un-

desired outcomes [Holte et al., 1996]. Particular to stochas-

tic settings is the commonly used all outcome determiniza-

tion [Yoon et al., 2007], which creates a deterministic action

for each probabilistic outcome of every action.

Lemma 2 Given an ER-UMD model ω, applying the

simpliﬁed-environment heuristic with fimplemented as an

all outcome determinization function is admissible.

The proof of Lemma 2, omitted for brevity, uses the obser-

vation that fonly adds solutions with higher reward (lower

cost) to a given problem (either before or after redesign). A

similar reasoning can be applied to the commonly used delete

relaxation or any other approaches discussed above.

Note that admissibility of a heuristic function depends on

speciﬁc characteristics of the ER-UMD setting. In particu-

lar, the simpliﬁed-environment heuristic is not guaranteed to

produce admissible estimates for policy teaching [Zhang and

Parkes, 2008]or goal recognition design [Keren et al., 2014;

Wayllace et al., 2016], where design is performed to incen-

tivize agents to follow speciﬁc policies. This is because the

relaxation itself may change the set of optimal agent policies

and therefore underestimate the value of a modiﬁcation.

change init reduce probability

BOX relocate a truck driving to wrong destination

BLOCK — dropping a block or tower

EX-BLOCK — as for Blocks World

TIRE add spare tires having a ﬂat tire

ELEVATOR add elevator shaft falling to initial state

VACUUM (re)move furniture add high friction tile

Table 1: Allowed modiﬁcations for each domain

B=1 B=2 B=3

solved reduc solved reduc solved reduc

BOX 8 28 8 42 7 44

BLOCK 6 21 3 24 3 24

EX-BLOCK 10 42 9 42 9 42

TIRE 9 44 8 51 6 54

ELEVATOR 9 22 7 24 1 18

VACUUM 8 15 6 17 0 —

Table 2: Utility improvement for optimal solvers

4 Empirical Evaluation

We evaluated the ability to maximize agent utility given a de-

sign budget in various ER-UMD problems, as well as the per-

formance of both optimal and approximate techniques.

We used ﬁve PPDDL domains from the probabilistic tracks

of the sixth and eighth International Planning Competition2

(IPPC06 and IPPC08) representing stochastic shortest path

MDPs with uniform action cost: Box World (IPPC08/ BOX),

Blocks World (IPPC08/ BLOCK), Exploding Blocks World

(IPPC08/ EX-BLOCK), Triangle Tire (IPPC08/ TIRE) and

Elevators (IPPC06/ ELEVATOR). In addition, we imple-

mented the vacuum cleaning robot setting from Example 1

(VACUUM) as an adaptation of the Taxi domain [Dietterich,

2000]. The robot moves in a grid and collects pieces of dirt. It

cannot move through furniture, represented by occupied cells,

and may fail to move, remaining in its current position.

In all domains, agent utility is expressed as expected cost

and constraints as a design budget. For each domain, we ex-

amined at least two possible modiﬁcations, including at least

one that modiﬁes the probability distribution. Modiﬁcations

by domain are speciﬁed in Table 1 with change init mark-

ing modiﬁcations of the initial state and probability change

marking modiﬁcations of the probability function.

4.1 Optimal Solutions

Setup For each domain, we created 10 smaller instances op-

timally solvable within a time bound of ﬁve minutes. Each

instance was optimally solved using:

•EX- Exhaustive exploration of possible modiﬁcations.

•DC- Solution of the DesignComp compilation.

•BFD- Algorithm 1 with simpliﬁed-environment heuristic

using the delete relaxation to simplify the model and the

DesignComp compilation to optimally solve it.

We used a portfolio of 3 admissible heuristics:

2http://icaps-conference.org/index.php/main/competitions

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4358

Ex-h0Ex-h0+ Ex-hMinM in DC-h0DC-h0+DC-hM inMin BFD-h0BFD-h0+ BFD-hM inM in

BOX

B=1 158.4(8) 159.0(8) 158.9(8) 163.9(8) 70.7(8) 221.4 (8) 157.4(8) 68.2(8) 216.4(8)

B=2 264.7(7) 264.9(7) 267.8(7) 270.6(7) 92.1(8) 332.7(7) 260.8(7) 88.0(8) 325.3(7)

B=3 238.5(4) 236.5(4) 235.6 (4) 241.5(4) 73.5(4) 271.7(4) 234.3(4) 70.2 (7) 265.94(4)

BLOCKS

B=1 50.5(6) 50.5(6) 50.8(6) 50.7(6) 41.7(6) 77.1(6) 50.3(6) 41.6(6) 74.4(6)

B=2 28.0(4) 28.2(4) 28.0(4) 28.2(4) 17.4(4) 36.4(3) 28.0(4) 17.2(4) 35.4(3)

B=3 348.9(2) 347.3(2) 348.2(2) 354.5(2) 194.6(3) 363.5(2) 352.2(2) 118.2(3) 354.8(2)

EX. BLOCKS

B=1 69.4(9) 70.2(9) 69.9 (9) 68.4(9) 38.7(9) 6.7(10) 69.5(9) 40.3(9) 60.3(9)

B=2 161.7(9) 170.9(9) 168.1(9) 153.1(9) 88.2(9) 30.2(10) 153.9(9) 85.6(9) 135.0(9)

B=3 250.7 (9) 265.9(9) 292.2(9) 252.5(9) 134.9(9) 88.8(8) 285.9(9) 160.9(9) 237.4(9)

TIRE

B=1 32.9(9) 32.9(9) 33.1(9) 33.3(9) 30.2(9) 36.8(9) 33.0(9) 29.5 (9) 36.9(9)

B=2 55.2(7) 55.4(7) 55.0(7) 55.5(7) 51.1(8) 88.8(8) 55.0(7) 50.9(8) 89.1(8)

B=3 270.3(6) 136.5)6 258.4(6) 269.7(6) 136.5(6) 258.4(6) 267.6(6) 188.3(6) 256.2(6)

ELEVATOR

B=1 300.4(8) 299.6(8) 301.6(8) 301.9(8) 236.2(9) 192.6(9) 302.6(8) 238.3(9) 176.6(9)

B=2 361.8(5) 360.9(5) 366.2(5) 363.4(5) 261.0(5) 243.89(5) 360.8(5) 258.6(5) 231.2(5)

B=3 na na na na 1504.6(1) 1117.4(1) na 1465.8(1) 1042.5(1)

VACUUM

B=1 0.15(9) 0.16(9) 0.15(9) 0.17(9) 0.099(9) 0.15(9) 0.15(9) 0.096(9) 0.13(9)

B=2 3.6(9) 3.27(9) 3.44(9) 3.25(9) 2.13(9) 2.49(9) 3.39(9) 2.021(9) 2.61(9)

B=3 na na na na na na na na na

Table 3: Running time and number of instances solved for optimal solvers

•h0assigns 0to all states and serves as a baseline for the

assessing the value of more informative heuristics.

•h0+assigns 1to all non-goal states and 0otherwise.

•hMinM in solves all outcome determinization using the

zero heuristic [Bonet and Geffner, 2005].

Each problem was tested on a Intel(R) Xeon(R) CPU

X5690 machine with a budget of 1,2and 3. Design actions

were assigned a cost of 10−4, and problems were solved us-

ing LAO* [Hansen and Zilberstein, 1998]with convergence

error bound of 10−6. Each run had a 30 minutes time limit.

Results Separated by domain and budget, Table 2 summa-

rizes the number of solved instances (solved) and average per-

centage of expected cost reduction over instances solved (re-

duc). In all domains, complexity brought by increased budget

reduces the number of solved instances, while the actual re-

duction varies among domains. As for solution improvement,

all domains show an improvement of 15% to 54%.

Table 3 compares solution performance. Each row repre-

sents a solver and heuristic pair. Results are separated by

domain and budget, depicting the average running time for

problems solved by all approaches for a given budget and the

number of instances solved in parenthesis (na indicates no

instances were solved within the time limit). The dominating

approach for each row (representing a domain and budget) is

emphasized in bold. In all cases, the use of informed search

outperformed the exhaustive approach.

4.2 Approximate Solutions

Setup For approximate solutions we used an MDP reduced

model approach that creates simpliﬁed MDPs accounting for

the full probabilistic model a bounded number of times (for

each execution history), and treat the problem as determin-

istic afterwards [Pineda and Zilberstein, 2014]. The deter-

ministic problems were solved using the FF classical plan-

ner [Hoffmann and Nebel, 2001], as explained in [Pineda and

Zilberstein, 2017]. We used Monte-Carlo simulations to eval-

uate the policies’ probability of reaching a goal state and its

expected cost. In particular, we gave the planner 20 minutes

to solve each problem 50 times. We used the ﬁrst 10 instances

of each competition domain mentioned above, excluding Box

World, due to limitations of the planner. For the VACUUM

domain we generated ten conﬁgurations of up to 5×7grid

size rooms, based on Figure 1.

Results Table 4 reports three measures (per budget): the num-

ber of problems completed within allocated time (solved), im-

proved probability of reaching a goal of the resulting policies

with respect to the policies obtained without design (δPs), and

the average percentage of reduction in expected cost after re-

design (reduc) (δPsand reduc are averaged only over prob-

lems solved 50 times when using both the original and mod-

iﬁed model). In general, we observed that redesign enables

either improvement in expected cost or in probability of suc-

cesses (and sometimes both), across all budgets. For BLOCK

and EX-BLOCK, a budget of 2yielded best results, while for

ELEVATOR, TIRE, and VACUUM a budget of 3was better.

However, the increased difﬁculty of the compiled problem re-

sulted sometimes in a lower number of solved problems (e.g.,

solving only 3problems on TIRE with budget of 3). Never-

theless, these results demonstrate the feasibility of obtaining

good solutions when compromising optimality.

B= 1 B= 2 B= 3

solved δPsreduc solved δPsreduc solved δ Psreduc

BLOCK 8 0 19.1 8 0 21.2 8 0 18.6

EX-BLOCK 10 0.42 0 10 0.50 0 10 0.41 0

TIRE 7 0 6.98 7 0 17.9 3 0 33

ELEVATOR 10 -0.33 25 10 0.1 30 10 0.1 38.3

VACUUM 10 0.2 8.12 10 0.2 4.72 10 0.3 9.72

Table 4: Utility improvement for sub-optimal solver

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4359

4.3 Discussion

For all examined domains, results indicate the beneﬁt of using

heuristic search over an exhaustive search in the modiﬁcation

space. However, the dominating heuristic approach varied

between domains, and for TIRE also between budget allo-

cation. Investigating the reasons for this variance, we note

a key difference between BFD and DC. While DC applies

modiﬁcations to the original model, BFD uses the simpliﬁed-

environment heuristic that applies them to a simpliﬁed model.

Poor performance of BFD can be due to either the minor

effect the applied simpliﬁcations have on the computational

complexity or to an exaggerated effect that may limit the in-

formative value of heuristic estimations. In particular, this

could happen due to the overlap between the design process

and the simpliﬁcation.

To illustrate, by applying the all outcome determinization

to the Vacuum domain depicted in Example 1, we ignore the

undesired outcome of slipping. Consequently, the heuris-

tic completely overlooks the value of adding high-friction

tiles, while providing informative estimations to the value of

(re)moving furniture. This observation may explain the poor

performance of BFD with EX-BLOCKS, where simpliﬁca-

tion via the delete relaxation ignores the possibility of blocks

exploding and overlooks the value of the proposed modiﬁca-

tions. Therefore, estimations of BFD may be improved by

developing a heuristic that uses the aggregation of several es-

timations. Also, when the order of application is immaterial,

a closed list may be used for examined sets in the BFD ap-

proach but not with DC. Finally, a combination of relaxation

approaches may enhance performance of sub-optimal solvers.

5 Related Work

Environment design [Zhang et al., 2009]provides a frame-

work for an interested party (system) to seek an optimal way

to modify an environment to maximize some utility. Among

the many ways to instantiate the general model, policy teach-

ing [Zhang and Parkes, 2008; Zhang et al., 2009]enables a

system to modify the reward function of a stochastic agent

to entice it to follow speciﬁc policies. We focus on a differ-

ent special case where the system is altruistic and redesigns

the environment in order to maximize agent utility. The tech-

niques used for solving the policy teaching do not apply to

our setting, which supports arbitrary modiﬁcations.

The DesignComp compilation is inspired by the technique

of G¨

obelbecker et al. (2010) of coming up with good ex-

cuses for why there is no solution to a planning problem.

Our compilation extends the original approach in four direc-

tions. First, we move from a deterministic environment to

a stochastic one. Second, we maximize agent utility rather

than only moving from unsolvable to solvable. Third, we em-

bed support of a design budget. Finally, we support arbitrary

modiﬁcation alternatives including modiﬁcations speciﬁc to

stochastic settings as well as all those suggested for deter-

ministic settings [Herzig et al., 2014; Menezes et al., 2012].

6 Conclusions

We presented the ER-UMD framework for maximizing agent

utility by the off-line design of stochastic environments.

We presented two solution approaches; a compilation-based

method that embeds design into the deﬁnition of a planning

problem and an informed heuristic search in the modiﬁcation

space, for which we provided an admissible heuristic. Our

empirical evaluation supports the feasibility of the approaches

and shows substantial utility gain on all evaluated domains.

In future work, we will explore creating tailored heuris-

tics to improve planner performance. Also, we will extend

the model to deal with partial observability using POMDPs,

as well as automatically ﬁnding possible modiﬁcations, sim-

ilarly to [G¨

obelbecker et al., 2010]. In addition, we plan to

extend the ofﬂine design paradigm, by accounting for online

design that can be dynamically applied to a model.

Acknowledgements

The work was supported in part by the National Science

Foundation grant number IIS-1405550.

References

[Bellman, 1957]Richard Bellman. A Markovian decision

process. Indiana University Mathematics Journal, 6:679–

684, 1957.

[Bertsekas, 1995]Dimitri P. Bertsekas. Dynamic program-

ming and optimal control, volume 1. Athena Scientiﬁc

Belmont, MA, 1995.

[Bonet and Geffner, 2005]Blai Bonet and H´

ector Geffner.

mGPT: A probabilistic planner based on heuristic search.

Journal of Artiﬁcial Intelligence Research, 24:933–944,

2005.

[Dietterich, 2000]Thomas G. Dietterich. Hierarchical re-

inforcement learning with the maxq value function de-

composition. Journal of Artiﬁcial Intelligence Research,

13:227–303, 2000.

[G¨

obelbecker et al., 2010]Moritz G¨

obelbecker, Thomas

Keller, Patrick Eyerich, Michael Brenner, and Bernhard

Nebel. Coming up with good excuses: What to do when

no plan can be found. Cognitive Robotics, 2010.

[Hansen and Zilberstein, 1998]Eric A. Hansen and Shlomo

Zilberstein. Heuristic search in cyclic AND/OR graphs.

In Proceedings of the Fifteenth National Conference on

Artiﬁcial Intelligence, pages 412–418, 1998.

[Herzig et al., 2014]Andreas Herzig, Viviane Menezes,

Leliane Nunes de Barros, and Renata Wassermann. On

the revision of planning tasks. In Proceedings of the

Twenty-ﬁrst European Conference on Artiﬁcial Intelli-

gence (ECAI), pages 435–440, 2014.

[Hoffmann and Nebel, 2001]J¨

org Hoffmann and Bernhard

Nebel. The FF planning system: Fast plan generation

through heuristic search. Journal of Artiﬁcial Intelligence

Research, 14:253–302, 2001.

[Holte et al., 1996]Robert C. Holte, M. B. Perez, R. M. Zim-

mer, and Alan J. MacDonald. Hierarchical A*: Searching

abstraction hierarchies efﬁciently. In Proceedings of the

Thirteenth National Conference on Artiﬁcial Intelligence

(AAAI), pages 530–535, 1996.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4360

[Keren et al., 2014]Sarah Keren, Avigdor Gal, and Erez

Karpas. Goal recognition design. In Proceedings of

the Twenty-Fourth International Conference on Automated

Planning and Scheduling (ICAPS), pages 154–162, 2014.

[Keren et al., 2015]Sarah Keren, Avigdor Gal, and Erez

Karpas. Goal recognition design for non-optimal agents.

In Proceedings of the Twenty-Ninth AAAI Conference on

Artiﬁcial Intelligence (AAAI), pages 3298–3304, 2015.

[Mausam and Kolobov, 2012]Mausam and Andrey

Kolobov. Planning with Markov decision processes:

An AI perspective. Morgan & Claypool Publishers, 2012.

[Menezes et al., 2012]Maria Viviane Menezes, Leliane N.

de Barros, and Silvio do Lago Pereira. Planning task val-

idation. In ICAPS Workshop on Scheduling and Planning

Applications (SPARK), pages 48–55, 2012.

[Nilsson, 1980]Nils J. Nilsson. Principles of Artiﬁcial Intel-

ligence. Tioga Publishers, Palo Alto, California, 1980.

[Pineda and Zilberstein, 2014]Luis Pineda and Shlomo Zil-

berstein. Planning under uncertainty using reduced mod-

els: Revisiting determinization. In Proceedings of the

Twenty-Fourth International Conference on Automated

Planning and Scheduling (ICAPS), pages 217–225, 2014.

[Pineda and Zilberstein, 2017]Luis Pineda and Shlomo Zil-

berstein. Generalizing the role of determinization in prob-

abilistic planning. Technical Report UM-CS-2017-006,

University of Massachusetts Amherst, 2017.

[Wayllace et al., 2016]Christabel Wayllace, Ping Hou,

William Yeoh, and Tran Cao Son. Goal recognition design

with stochastic agent action outcomes. In Proceedings

of the Twenty-Fifth International Joint Conference on

Artiﬁcial Intelligence (IJCAI), pages 3279–3285, 2016.

[Yoon et al., 2007]Sung Wook Yoon, Alan Fern, and Robert

Givan. FF-Replan: A baseline for probabilistic planning.

In Proceedings of the Seventeenth International Confer-

ence on Automated Planning and Scheduling (ICAPS),

pages 352–359, 2007.

[Younes and Littman, 2004]H˚

akan L. S. Younes and

Michael L. Littman. PPDDL1.0: The language for

the probabilistic part of IPC-4. In Proceedings of the

International Planning Competition (IPC), 2004.

[Zhang and Parkes, 2008]Haoqi Zhang and David Parkes.

Value-based policy teaching with active indirect elicita-

tion. In Proceedings of the Twenty-Third Conference on

Artiﬁcial Intelligence (AAAI), pages 208–214, 2008.

[Zhang et al., 2009]Haoqi Zhang, Yiling Chen, and David

Parkes. A general approach to environment design with

one agent. In Proceedings of the Twenty-First Interna-

tional Joint Conference on Artiﬁcal Intelligence, pages

2002–2008, 2009.