Content uploaded by Xuesu Xiao

Author content

All content in this area was uploaded by Xuesu Xiao on Mar 06, 2021

Content may be subject to copyright.

Team Orienteering Coverage Planning with Uncertain Reward

Bo Liu1, Xuesu Xiao1and Peter Stone1,2

Abstract— Many municipalities and large organizations have

ﬂeets of vehicles that need to be coordinated for tasks such as

garbage collection or infrastructure inspection. Motivated by

this need, this paper focuses on the common subproblem in

which a team of vehicles needs to plan coordinated routes to

patrol an area over iterations while minimizing temporally and

spatially dependent costs. In particular, at a speciﬁc location

(e.g., a vertex on a graph), we assume the cost grows linearly

in expectation with an unknown rate, and the cost is reset

to zero whenever any vehicle visits the vertex (representing

the robot “servicing” the vertex). We formulate this problem

in graph terminology and call it Team Orienteering Coverage

Planning with Uncertain Reward (TOCPUR). We propose to

solve TOCPUR by simultaneously estimating the accumulated

cost at every vertex on the graph and solving a novel variant

of the Team Orienteering Problem (TOP) iteratively, which

we call the Team Orienteering Coverage Problem (TOCP). We

provide the ﬁrst mixed integer programming formulation for

the TOCP, as a signiﬁcant adaptation of the original TOP.

We introduce a new benchmark consisting of hundreds of

randomly generated graphs for comparing different methods.

We show the proposed solution outperforms both the exact

TOP solution and a greedy algorithm. In addition, we provide

a demo of our method on a team of three physical robots in

a real-world environment. The code is publicly available at

https://github.com/Cranial-XIX/TOCPUR.git.

I. INTRODUCTION

Mobile agent ﬂeets are now being used for many purposes

in our daily life, such as a team of mobile robots delivering

food [1], a school bus ﬂeet picking up students, or a garbage

truck ﬂeet collecting garbage.

In many such situations, visiting a particular location

results in some beneﬁt (e.g. collecting piled up garbage),

which we model as a reward. The overall objective is there-

fore to collect as much reward as possible, while ensuring

that each vehicle’s travel time stays within some budget.

This problem can be formulated as the Team Orienteering

Problem (TOP) [2]. However, TOP assumes the reward at

each location is a known constant before being collected

and set to zero. This formulation does not suit problems in

which the reward can accumulate over time. For example,

consider a garbage truck ﬂeet collecting garbage in a city.

The amount of garbage in general grows over time and it

becomes much more beneﬁcial to visit a location that has

not been visited for a long time. The expected garbage

growth rate at different locations might be different and

unknown to the agents beforehand. But whenever an agent

visits a location and collects the garbage, it can update its

estimation of the growth rate at that location. In addition

1Xuesu Xiao, Bo Liu, and Peter Stone are with Department of Computer

Science, University of Texas at Austin, Austin, TX 78712 {bliu, xiao,

pstone}@cs.utexas.edu.2Peter Stone is also afﬁliated with Sony AI .

Fig. 1: Real-World Demonstration: From the same black

vertex (v1), a ﬂeet of three robots is tasked with visiting all

red vertices and as many yellow vertices as possible. Within

the same budget, TOCP ﬁnds the optimal plan that covers

all red and yellow vertices (lower left), while the greedy

baseline misses many optional yellow vertices and does a

lot of backtracking (lower right).

to TOP assuming a known constant reward, in its typical

formulation each location can only be visited once, which

again signiﬁcantly limits its application.

As a result, in this paper, we introduce a novel cover-

age planning problem, called Team Orienteering Coverage

Planning with Uncertain Reward (TOCPUR). In TOCPUR,

a team of mobile agents is tasked to patrol an area over

multiple iterations. The goal is to reduce the time- and place-

dependent costs (i.e., negative rewards). In particular, we

assume the cost grows between iterations and stays constant

within an iteration, and each vertex can be visited multiple

times in an iteration but the cost is only reduced once (e.g.,

the garbage is collected during the ﬁrst visit on that day

where each day is an iteration). Optionally, we allow users

to specify a subset of locations that the ﬂeet has to visit. This

A

edge planned route (w/ direction)

start/end vertex

A

TOP TOCP

Fig. 2: The TOP (left) and the TOCP (right) plans on the

same graph instance. In the standard TOP formulation, an

agent cannot traverse a vertex twice. Hence in TOP, it is

impossible to visit the nodes on the right of A in a closed

route that starts/ends at the leftmost vertex.

option is useful when the ﬂeet of agents has a primary task

(e.g. routine check at certain locations) and some secondary

tasks (e.g. collecting as much garbage as possible).

We solve TOCPUR by simultaneously estimating the

unknown and growing costs over the area and solving a novel

variant of the TOP, called the Team Orienteering Coverage

Problem (TOCP), which allows multiple visits to the same

nodes. In this paper, we refer to TOCP and TOP plans as

the solutions to their corresponding problems. We introduce

a benchmark of hundreds of randomly generated graphs

and show improved performance using the proposed method

compared to both the TOP solution and a greedy algorithm.

We also demonstrate the proposed planner working on a team

of three physical robots in a real-world environment (Fig. 1).

II. REL AT ED WORK

In this section, we brieﬂy review prior literature in orien-

teering problems and coverage planning.

a) Orienteering Problem: The proposed TOCPUR

problem is closely related to the team orienteering problem

(TOP) [2], which is a multi-tour extension of the orienteering

problem (OP) [3]. The OP originates from the Travelling

Salesman Problem and is heavily studied in the optimization

community. In the standard OP, an agent needs to plan a path

under a ﬁxed length constraint that collects as much reward

as possible as it traverses the vertices along that path. The

TOP is then a multi-tour extension of the OP where a team

of agents plan consecutively, which is equivalent to a single

agent planning multiple times. The major difference between

TOCPUR and TOP is that in TOCPUR, we allow each vertex

to be visited multiple times, which better resembles many

real-world scenarios but makes the problem harder. As a

result, the planned route becomes a walk instead of a path

(e.g. a walk without loops) for each agent (See Fig. 2).

While this modiﬁcation seems to be a small change, the core

constraints in the formulation of the standard TOP problem

can no longer be used. This modiﬁcation is also the key

contribution of our formulation of the TOCPUR problem.

In addition, TOCPUR models cumulative reward over time

where at each time step, the reward is sampled identically

and independently from a ﬁxed unknown distribution. This

setup is similar to the OP with stochastic proﬁts (OPSP) [4].

However, in OPSP, the objective is to maximize the prob-

ability that the total collected proﬁt will be greater than a

predeﬁned target value. In addition, OPSP does not allow

for reward/proﬁt to grow over time.

Besides TOP and OPSP, there exists a big family of

variants of the OP. For instance, prior research has considered

the OP with time window (OPTW) [5] where the agent

can only visit a vertex within a pre-speciﬁed time window.

Generalized OP (GOP) extends OP such that the reward

at each vertex is a non-linear function with respect to a

set of attributes. Multi-agent OP (MAOP) [6] considers a

competitive game among multiple agents trying to solve

the OP individually. More recently, multi-visit TOP (MV-

TOP) [7] has been proposed where each vertex needs to be

served by multiple agents with different skills in a certain

order. In general, different combinations of having multiple

tours, multiple agents, different time windows, and stochastic

rewards have been studied. However, in the formulation of

all these variants of the OP, each vertex is only allowed to be

visited once by a single vehicle in a single tour. TOCPUR, by

allowing multiple visit to a single vertex for a single agent,

has much broader applications. Gunawan et al. [8] provide

a comprehensive overview of existing variants of the OP.

b) Coverage Planning: Coverage planning (CP) is

the task in robotics of determining a path to cover all points

in an area while avoiding obstacles. Standard CP problems

involve both high-level path planning and low-level motion

planning. Typical methods will break the free space, i.e., the

space free of obstacles, down into simple, non-overlapping

regions called cells [9]. Then an exhaustive walk through the

graph deﬁned by the decomposed cells is found. TOPCUR is

similar to CP problems in that it also attempts to maximize

the coverage over all vertices through planned routes. But

in contrast, TOPCUR does not require complete coverage of

all vertices. In our formulation, TOPCUR primarily focuses

on the high-level path planning.

III. MET HOD

In this section, we ﬁrst deﬁne the TOCPUR problem using

graph terminology. Then we propose to optimize the per-

iteration objective while updating the reward estimation si-

multaneously. The per-iteration objective is then speciﬁed by

a novel mixed integer programming formulation. In addition,

we provide a simple greedy method as a baseline for solving

the problem approximately.

A. Notation

We represent the area to be patrolled as a symmetric

directed graph G= (V, E ), where V={v1, v2, . . . , vN}

is the set of Nvertices and Eis the edge set. eij denotes an

edge from vito vj, which has a length of lij .1In addition, we

assume the agent ﬂeet consists of Magents in total, which

we call agent 1to Mwhen the context is clear. We use [x]

to denote {1,2, . . . , x}, and [a, b]to denote {a, a+1, . . . , b}.

1Since Gis symmetric directed, if eij exists, eji also exists and lij =lji .

The symmetric directed assumption represents bidirectional trafﬁc.

B. Problem Formulation

In this section, we deﬁne the TOCPUR problem. We

consider Hiterations of route planning for a vehicle ﬂeet

of size Mover a symmetric directed graph G= (V, E).

Each vertex viaccumulates a cost of κi,t right before the

t-th iteration, where E[κi,t] = µ∗

iand µ∗

iis unknown

a priori. In other words, the cost at each vertex grows

linearly in expectation at an unknown rate (In the garbage

truck application, think of an iteration as a day of garbage

collection and κi,t as the amount of garbage that appears

overnight at location iprior to day t). Next, denote Ti,t as

the most recent iteration prior to twhen the vehicle ﬂeet

visited vi. Then viat iteration taccumulates a total cost of

ci,t =Pt

k=Ti,t+1 κi,k . The entire cost over Gat iteration t

is speciﬁed as c(G, t) = Pi∈[N]ci,t. Intuitively, the total

cost at the end of an iteration is the sum of the current

accumulated costs of all the vertices not visited during that

iteration (the current cost of those visited is reset to 0). The

goal is then to plan Mroutes {τm,t}M

m=1, each no longer

than a maximum length lmax, for all vehicles at each iteration

t∈[H], such that the total cost on Gover the horizon is

minimized. Speciﬁcally, each route τm,t = (v1, . . . , v1)is a

sequence of vertices that starts and ends in v1. Optionally, we

can include a set of “must visit” vertices Isuch that the ﬂeet

has to visit all vertices in Iduring each iteration. The whole

problem can be described as the following optimization:

minimize

{{τm,t}M

m=1}H

t=1

L=

H

X

t=1

c(G, t),where

c(G, t) =

N

X

i=1

t

X

k=Ti,t+1

κi,k.

subject to ∀m, t X

eij ∈τm,t

lij ≤lmax,

∀i, t, Ti,t+1 =(tif vi∈Smτm,t,

Ti,t otherwise

∀i, Ti,1= 0,

(optionally) ∀vi∈I, vi∈[

m

τm,t.

(1)

Here, Smτm,t =Sm{v|v∈τm,t}and we abuse the

notation a bit to denote eij ∈τif vi, vjappear consecutively

in τ. This optimization problem (Opt. 1) is NP-hard since it

can easily reduce to the Travelling Salesman Problem. We

emphasize two points: 1) As κi,t is drawn randomly from

a distribution with an unknown mean µ∗

i, it is in general

impossible to have an optimal open-loop plans for solving

Opt. 1. 2) Unlike in TOP, we do not assume that each vi

can only appear once in any τm,t. However, as we will see

in the following, we can safely assume each eij appears at

most once in any τm,t.

C. Reward Estimation and Per-Iteration Planning

Solving Opt. 1 exactly for large His computationally

intractable. In fact, even when H= 1, the problem remains

i j i j

Fig. 3: If eij is visited twice (blue), we can remove them by

reversing the direction of another route connecting vjto vi.

NP-hard. In addition, as the true parameters {µ∗

i}N

i=1 are

hidden, an optimal solution must be closed-loop plans that

take past observations (e.g. κ) into consideration. Therefore,

we propose to optimize the cost c(G, t)iteratively while up-

dating the estimates of {µ∗

i}M

i=1 simultaneously. Speciﬁcally,

we keep track of Ti,t for each node vi, as well as the total

observed cumulative cost at vi, i.e. Ci,t =PTi,t

k=1 κi,k. The

maximum likelihood estimation of µ∗

iat time tis therefore

ˆµi,t =

Ci,t

Ti,t =PTi,t

k=1 κi,k

Ti,t Ti,t >0

µdefault Ti,t = 0,

(2)

where µdefault is a default value or a pre-speciﬁed value based

on prior knowledge when there has been no visits to a node.

Therefore, at each iteration t, we can use {ˆµi,t}N

i=1 to ap-

proximately estimate c(G, t)as ˆc(G, t) = PN

i=1 ˆµi,t(t−Ti,t ).

Following the convention in orienteering problems [2], we

formulate the per-iteration optimization as a mixed integer

program (MIP). In the following, we will temporarily ignore

the subscript t. Let {xijm },{yim}and {zi}be the binary

decision variables and i, j ∈[N]and m∈[M].xijm = 1

if agent mtraverses eij and 0otherwise. Similarly, yim = 1

if agent mvisits viand zi= 1 if any agent visits vi. It

is sufﬁcient to assume xijm is binary, as justiﬁed by the

following proposition.

Proposition 1. Any optimal solution of Opt. 1 has an

equivalent solution where each eij is traversed at most once.

Proof: Assume otherwise. Let eij be the edge that is

visited at least twice. If viis only visited once, then certainly

we can remove one traversal of eij with no problems. If vi

is visited more than once, then there has to exist a route p,

i.e. a sequence of vertices, that goes from vjback to vi. But

then, since Gis symmetric directed, we can remove both

traversals of eij by reversing the traversal direction of the

edges along the route p. This change results in visiting the

same set of vertices with a shorter length (See Fig. 3).

The per-iteration optimization is then formulated as the

following MIP, which we call the team orienteering coverage

problem (TOCP).2

maximize

M

X

m=1

N

X

i=2

ˆci,t zi(3)

∀m,

N

X

j=2

x1jm =

N

X

i=2

xi1m= 1 (4)

∀m, i, xiim = 0 (5)

2Without futher speciﬁcation, we assume m∈[M], i, j ∈[N].

∀m, i, yim ≤

N

X

j=1

xijm ≤yim ·lmax

mineij ∈Elij

(6)

∀i, zi≤X

m

yim ≤zi·M(7)

∀m, i,

N

X

j=1

xijm =

N

X

j=1

xjim (8)

∀m,

N

X

i=1

N

X

j=1

lij xijm ≤lmax (9)

∀m,

N

X

j=1

u1jm −

N

X

i=1

ui1m=

N

X

j=2

yjm (10)

∀m, i > 1,

N

X

j=1

uijm −

N

X

j=1

ujim =yim (11)

∀m, i, j, 0≤uijm ≤N·xij m (12)

(optionally) ∀vi∈I, zi= 1 (13)

Eq. 3 is the objective. Eq. 4 ensures all agents start and

end in v1. Eq. 5 eliminates self-loops. Eq. 6-7 enforce the

deﬁnitions of yim and zi. For instance, yim = 0 if and

only if ∀j, xijm = 0.3Eq. 8 ensures the conservation of

ﬂow. Eq. 9 ensures each agent travels within the length

budget lmax. Eq. 10-12 ensures the found route τmfor each

agent mis strongly connected. The variables {uijm}deﬁne

an amount of ﬂow that begins at v1and is reduced by 1

at every node mvisits in sequence. For instance, the net

outﬂow at v1for mshould be Pj∈[2,N]yj m (Eq. 10), the

number of vertices mvisits in τm. If m’s trajectory were

to include two disconnected components, then there would

be no way to consume all of the ﬂow (e.g., Eq. 10-12

would be violated). Note that uijm should only be positive

if mtraverses eij, which is ensured by Eq. 12. Finally,

Eq. 13 optionally ensures that all “must visit” vertices in

Iare visited. In the above MIP, Eq. 6-7 and Eq. 10-12 are

novel constraints designed speciﬁcally for TOCP. The whole

algorithm for solving TOCPUR is summarized in Alg. 1.

Remark: The sub-optimality of Alg. 1 originates

from two sources: 1) the decomposition sub-optimality

caused by decomposing the long horizon planning into

per-iteration planning; and 2) the uncertainty sub-optimality

caused by inaccurate estimation of {µ∗

i}, which decreases as

we visit each vertex more often. It is not immediately clear

whether the decomposition sub-optimality can ever arise.

To illustrate that it can arise, we provide an example such

that even when the true {µ∗

i}are provided, the per-iteration

optimal plans are still not optimal over the horizon. The

example is illustrated in Fig. 4 with H= 2. In Fig. 4, the

leftmost graph shows the initial costs on all vertices. For

illustration simplicity, we assume there is no growing cost

at this moment (so each vertex only has accumulate an

initial cost). Due to the travel length budget lmax = 5, the

3The ideal constraint is yim =1(Pjxijm >1). But due to the need

for constraints to be linear, we represent the constraint with two inequalities.

Algorithm 1 Reward Estimation and Per-Iteration Planning

1: Maintain: for each vi, we maintain Ti,t, the most recent

iteration when viwas visited (Ti,1= 0), and Ci,t, the

observed cumulative cost at viup to time Ti,t.

2: Input: the graph G= (V, E ), the “must visit” vertices

I, the maximum traversal budget lmax.

3: for t= 1 to Hdo

4: ∀i, update ˆµi,t according to Eq. 2.

5: ∀i,ˆci,t = ˆµi,t ·(t−Ti,t ).

6: Plan {τm,t}M

m=1 by solving Opt. 3-13.

7: ∀i∈[N], Ti,t+1 =(tif vi∈Smτm,t,

Ti,t otherwise. .

8: Ci,t =PTi,t

k=1 κi,k.

9: end for

Algorithm 2 Greedy Per-Iteration Planning

1: Input: the graph G= (V, E ), the estimated cost growth

for all vertices {ˆµi,t}N

i=1, the time since last visit to each

node {Ti,t}N

i=1, the “must visit” vertices I, the maximum

traversal budget lmax,D:V×V→R, a function that

outputs the shortest distance between any pair of vertices

(e.g., calculated by Floyd-Warshall).

2: ∀i∈[N], compute ˆci,t = ˆµi,t ·(t−Ti,t).

3: A←Iand B←V\I. (must/optionally visit vertices)

4: ∀m∈[M], τm,t = [ ], fm= 0 (whether mﬁnishes),

vm=v1(current location of m), lm= 0 (travelled

distance of m).

5: while Pmfm< M do

6: for agent m∈ {x|fx= 0, x ∈[M]}do

7: Z← {v|lm+D(vm, v) + D(v, v1)≤lmax }.

8: if A∩Z6=∅then

9: v←arg minx∈A∩ZD(vm, v);X←A.

10: else if B∩Z6=∅then

11: v←arg maxv∈B∩Zˆci,t /D(vm, v);X←B.

12: else

13: v=v1;fm= 1;X← ∅.

14: end if

15: lm←lm+D(vm, v).

16: Let pbe the shortest path from vmto v.

17: X←X\p;τm,t append p;vm←v.

18: end for

19: end while

per-iteration optimal plans achieve a total cost of 12, while

the optimal plans achieve a cost of 9over two steps.

D. Greedy Per Step Planning

In addition to the exact MIP solution from Opt. 3-13, we

provide a greedy algorithm that efﬁciently and approximately

solves Opt. 1, summarized in Alg. 2. The principle idea is

to prioritize must visit vertices ﬁrst. When all “must visit”

vertices have been visited, we prioritize vertices with a larger

ratio between ˆci,t and the distance to vi.

1

1

1

2

1

1

2

1

1

4

4

4

4

3

2

t = 1 t = 2 t = 1 t = 2

Per-Iteration Optimal Plans Optimal PlansEdge Length

5

1

4

4

3

2

Vertex Cost

xvertex with cost x xedge with length x planned routesstart/end vertex

cost = 4+4 +4 = 12 cost = 4+3+2 +0 = 9= 5

Fig. 4: An example problem with H= 2 and M= 1, where the per-iteration plans are not optimal. left: the graph G. The

initial vertex cost is speciﬁed on the vertex and the edge length is labeled on each edge. Note that all edges are bi-directional

and we assume lmax = 5.middle: the optimal per-iteration plans visits 4vertices at t= 1, and only one of the two un-visited

vertices at t= 2.right: the optimal plans visit 3 nodes at each iteration, covering all nodes in the end.

IV. EXP ERI MEN TS

In this section, we conduct simulated experiments to

evaluate two hypotheses: 1) the proposed iterative method

outperforms an exact TOP method, and 2) solving the MIP in

Opt. 3-13 exactly outperforms the simple greedy algorithm.

A. Simulated Experiments

To compare the proposed method with the exact TOP

method and the greedy method, we introduce a new bench-

mark that consists of 600 randomly generated graphs. For

each graph, the number of vertices Nis drawn uniformly

at random from {10,12,14,16,18,20}. The vertex v1is

positioned at (0,0) in the standard euclidean plain. For ver-

tices v2, . . . , vN, each vertex’s position is drawn uniformly

at random from a 10 ×10 square centered at the origin,

with its sides parallel to the x- and y-axes. Then each

vertex viis connected to its closest nivertices, where ni

is drawn uniformly at random from {3,4,5}.lmax is drawn

from U(20,20 + 2N), where U(a, b)denotes a uniform

distribution from ato b. The number of agents Mis drawn

uniformly at random from {2,3,4,5}. The number of “must

visit” vertices, NI, is set to min(X, M ), where Xis drawn

uniformly at random from {1,2,3}. Denote Vreachable as all

vertices reachable from v1within the travel budget lmax;

we sample NIvertices from Vreachable without replacement

to form I. Finally, for each vi, the expected growth of

cost µ∗

iis drawn from U(0.1,0.9). For each iteration t,

the actual growth κi,t = min(max(xi,t,0),1), where

xi,t ∼ N (µ∗

i,0.1). Finally, we generate 120 random graphs

with random seeds ranging from [1,120], following the

above procedure for each horizon Hin {2,4,6,8,10}, which

results in a total of 600 random graphs.

We summarize the experiment results in Fig. 5. In all

cases, we follow Alg. 1, and varying the per-iteration method

(line 6 of Alg. 1). We compare TOCP solutions with exact

TOP solutions and the solutions found by Alg. 2. We limit

the per-iteration computation time of TOP and TOCP to 1000

seconds. Therefore, occasionally TOP and TOCP might not

ﬁnd a feasible solution. The ﬁrst row of Fig. 5 shows the

TOP TOCP Greedy

Fig. 5: Comparison among TOCP, TOP, and the greedy

algorithm on 600 randomly generated graphs.

total cumulative cost for each horizon H, averaged over

the subset of 120 random graphs where all methods ﬁnd a

solution. The total cumulative cost is essentially the objective

in Eq. 1. TOCP solutions outperform the ones found by the

greedy method, and both outperform TOP solutions by a

large margin. We provide additional pairwise independent-

samples T-tests in Table I and highlight the conclusions that

are statistically signiﬁcant (p≤0.05).

TOP-TOCP Greedy-TOCP

Horizon t-score p-value t-score p-value

H= 2 3.63 .0004 2.38 .018

H= 4 4.00 .0001 2.04 .043

H= 6 4.36 .00004 1.97 .050

H= 8 4.46 .00003 1.95 .054

H= 10 4.48 .00003 1.96 .052

TABLE I: Independent-samples T-tests for TOP v.s. TOCP

and the greedy method v.s. TOCP.

The second row in Fig. 5 reports the average computation

time, again over the subset of graphs where all methods ﬁnd

Fig. 6: Solutions from TOP, TOCP and the greedy method on an example graph with H= 1 and M= 3.

solutions. As expected, TOP and TOCP take much longer

than the greedy method does. The last row of Fig. 5 reports

the number of graphs for which each method fails to ﬁnd

a solution. Since TOP does not allow a second visit to any

vertex, TOP fails more often than TOCP.

In addition to the quantitative evaluations in Fig. 5. We

also provide a qualitative visualization with H= 1 and M=

3in Fig. 6 to further showcase the difference in the three

methods used. Among all three methods, TOCP is the only

one that cover all vertices, thus clearing all the costs.

B. Physical Demo

We additionally record a demo of applying Alg. 1 on three

physical robots with H= 1 in a real-world environment.4

V. C ONCLUSION

In this work, we formulate a novel variant of the team

orienteering problem (TOP) that allows multiple visits to the

same vertex and uncertain cumulative costs on each vertex

over a horizon. We propose a method to iteratively ﬁnd

the per-iteration optimal plans using a novel mixed integer

programming formulation based on the maximum likelihood

estimates of each vertex’s costs. The simulated experiments

show that the proposed method greatly outperforms the exact

TOP solution. We also provide a real-world demo of the

proposed method on three physical robots. In this paper, we

focus on high-level route planning. An interesitng direction

for future work is to incorporate obstacle avoidance into

TOCPUR.

ACKNOWLEDGMENT

This work has taken place in the Learning Agents Re-

search Group (LARG) at UT Austin. LARG research is

4The video link is at https://drive.google.com/file/d/

1pwE-zLbpcYK2DGeWZ2L5ePZsuO78KCpc/view?usp=sharing.

supported in part by NSF (CPS-1739964, IIS-1724157, NRI-

1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARO

(W911NF-19-2-0333), DARPA, Lockheed Martin, GM, and

Bosch. Peter Stone serves as the Executive Director of Sony

AI America and receives ﬁnancial compensation for this

work. The terms of this arrangement have been reviewed and

approved by the University of Texas at Austin in accordance

with its policy on objectivity in research. We thank Harel

Yedidsion and Yuqian Jiang for their thoughtful discussion

on designing the greedy algorithm.

REFERENCES

[1] Y. Sun, L. Guan, Z. Chang, C. Li, and Y. Gao, “Design of a low-cost

indoor navigation system for food delivery robot based on multi-sensor

information fusion,” Sensors, vol. 19, no. 22, p. 4980, 2019.

[2] I.-M. Chao, B. L. Golden, and E. A. Wasil, “The team orienteering

problem,” European journal of operational research, vol. 88, no. 3, pp.

464–474, 1996.

[3] B. L. Golden, L. Levy, and R. Vohra, “The orienteering problem,” Naval

Research Logistics (NRL), vol. 34, no. 3, pp. 307–318, 1987.

[4] T. Ilhan, S. M. Iravani, and M. S. Daskin, “The orienteering problem

with stochastic proﬁts,” Iie Transactions, vol. 40, no. 4, pp. 406–421,

2008.

[5] N. Labadie, R. Mansini, J. Melechovsk`

y, and R. W. Calvo, “The team

orienteering problem with time windows: An lp-based granular variable

neighborhood search,” European Journal of Operational Research, vol.

220, no. 1, pp. 15–27, 2012.

[6] C. Chen, S.-F. Cheng, and H. C. Lau, “Multi-agent orienteering problem

with time-dependent capacity constraints,” Web Intelligence and Agent

Systems: An International Journal, vol. 12, no. 4, pp. 347–358, 2014.

[7] S. Hanaﬁ, R. Mansini, and R. Zanotti, “The multi-visit team orienteering

problem with precedence constraints,” European journal of operational

research, vol. 282, no. 2, pp. 515–529, 2020.

[8] A. Gunawan, H. C. Lau, and P. Vansteenwegen, “Orienteering problem:

A survey of recent variants, solution approaches and applications,”

European Journal of Operational Research, vol. 255, no. 2, pp. 315–

332, 2016.

[9] T. Oksanen and A. Visala, “Coverage path planning algorithms for

agricultural ﬁeld machines,” Journal of ﬁeld robotics, vol. 26, no. 8,

pp. 651–668, 2009.