Content uploaded by Philipp Braun

Author content

All content in this area was uploaded by Philipp Braun on Oct 26, 2022

Content may be subject to copyright.

Capture the ﬂag games: Observations from

the 2022 Aquaticus competition ⋆

Philipp Braun ∗Iman Shames ∗David Hubczenko ∗∗

Anna Dostovalova ∗∗ Bradley Fraser ∗∗

∗School of Engineering, Australian National University, Canberra,

Australia (e-mail: {philipp.braun,iman.shames}@anu.edu.au).

∗∗ Defence Science and Technology Group, Adelaide, Australia (e-mail:

{anna.dostovalova,bradley.fraser1,david.hubczenko}@defence.gov.au).

Abstract: Undoubtedly, reinforcement learning has gained a lot of momentum over the last

years through successful applications in various ﬁelds and by being prominently advertised world

wide through applications such as AlphaZero and AlphaGo. Despite its success, the potential

and limitations of reinforcement learning particularly in control applications are only at the

beginning of being rigorously understood. In particular, the curse of dimensionality arising in

system identiﬁcation and controller design for control systems acting in continuous state and

input spaces in (model free) reinforcement learning, seems to be a crucial barrier yet to overcome.

In this paper we illustrate these observations in terms of classical controller designs versus

reinforcement learning, based on our experiences from the 2022 Aquaticus competition. The

Aquaticus competition is a capture the ﬂag multi-player game where two teams of (autonomous)

robots compete with each other. We illustrate ideas of a simple controller that combines ideas

from optimal control, model predictive control and path planning using Dubins curves. The

corresponding controller outperformed the other reinforcement learning based controllers in the

Aquaticus competition. While reinforcement learning might be well suited for a competitive

controller design in future competitions, a good understanding of the problem seems to be

necessary to initiate the learning process.

Keywords: Model predictive control; Dubins path; multi-player games; path planning

1. INTRODUCTION

The recent success of reinforcement learning receiving

world wide media attention through AlphaZero and Al-

phaGo, for example, led to a trend of data-driven decision

making and controller designs in the control community

and in related ﬁelds. Excellent references on reinforcement

learning and control theory include the recent books Bert-

sekas (2019), Bertsekas (2022), for example.

While the success and the impact of reinforcement learning

is undeniable, in some areas of application, solely rely-

ing on new data-driven controller designs might not lead

to the promised results when applied to dynamical sys-

tems described through diﬀerential equations (or diﬀerence

equations) and deﬁned over a continuous state space, as

outlined in detail in Alamir (2022). In particular, while

the number of possible moves in Chess is very large, the

number of moves is not comparable to the number of

solutions of a dynamical system acting over a continuous

state space x∈Rn(n∈N) and a continuous input space

u∈Rm(m∈N). Said diﬀerently “[...] if the initial

ambitious promises of [reinforcement learning] are to be

fulﬁlled, namely computing a stochastically optimal policy

over a continuous set of states, the curse of dimensionality

⋆This research was funded by DSTG DWE STaR Shot Program. P.

Braun and I. Shames are additionally supported by the CIICADA

Lab.

is unavoidable. [Reinforcement learning] inherits the stan-

dard limitations of Approximate Dynamic Programming

(ADP). The beauty of the gradient theorem and the uni-

versality of deep neural networks does not help overcoming

this fundamental obstacle”, (Alamir, 2022, Fact 1).

The last statement does not claim that reinforcement

learning is not useful in control application. However, it

highlights that reinforcement learning is not a universal

tool that can be applied to an arbitrary control problem

with the assumption that classical controller designs are

easily outperformed or that even a useful control law is

obtained through the learning process. Similar observa-

tions in terms of limitations are also discussed in Sznaier

et al. (2022), for example. While this is not a problem by

itself, the authors of this paper agree with the statement

that “[i]n some recent works, the already available and

purely control related solutions are too easily forgotten

when it comes to contributing in any possible way to

the [reinforcement learning] buzzword induced euphoria”,

(Alamir, 2022, page 23).

In this paper, we outline our experiences with respect

to the last statement in the context of the Aquaticus

competition. 1The Aquaticus competition is a capture

the ﬂag game where two teams consisting of autonomous

1Aquaticus Competition: https://oceanai.mit.edu/aquaticus/pmwi

ki/pmwiki.php?n=Main.HomePage

(or human operated) robots compete. In particular, as

outlined in detail in the next section, the task of the two

teams is to grab the opponent’s ﬂag and to return it safely

to the own ﬂag position. To prevent such a maneuver,

the opponent’s robots can be tagged, which removes them

temporarily from the game.

While we initially intended to use reinforcement learning

for the controller design, we quickly realized that we 2lack

the intuition behind reinforcement learning to being able

to use it as a black box controller design methodology.

We thus stepped back and focused on purely control re-

lated solutions before considering reinforcement learning

techniques (as suggested in Alamir (2022)). The controller

discussed in this paper, which is combining model pre-

dictive control (MPC) and Dubins curves (see Dubins

(1957)) while additionally relying on heuristic switching

rules, clearly outperformed competing controller designs

in the 2022 Aquaticus competition based on reinforcement

learning techniques. In the same way as the 2022 Aquati-

cus competition, the following sections focus on a 1vs1

capture the ﬂag game.

The paper and its contributions are structured as follows.

Section 2, starts with a description of the Aquaticus com-

petition. Section 3, discusses optimal control and MPC in

the general setting of capture the ﬂag games. In Section

4, we recall necessary results on Dubins curves for mobile

robots described using unicycle dynamics. The main con-

troller design is outlined in Section 5. The paper is closed

with ﬁnal remarks in Section 6.

Throughout the paper, the following notation is used.

The Euclidean norm of a vector x∈Rnis denoted by

|x|2=√x⊤x. A ball of radius 10 (meters) centered around

p∈R2is denoted by Bp={q∈R2| |p−q|2≤10}.

2. THE AQUATICUS COMPETITION

As outlined in the introduction, the Aquaticus competition

is a capture the ﬂag game where two teams consisting of

autonomous and human operated marine surface vehicles

compete with each other. While the exact rules are out-

lined on the homepage of the Aquaticus competition 3,

here we focus on slightly simpliﬁed rules necessary for the

understanding of the rest of the paper. In addition, we

focus on a purely simulation based analysis of the game

and use MATLAB for illustrations instead of MOOS-IvP

(see Benjamin et al. (2010)) used in the competition.

The Aquaticus competition consists of two teams referred

to as blue team (B) and red team (R) in the following and

consisting of nB, nR∈Nindependent robots, respectively.

Here we focus on the 1vs1 setting and assume that nB=

nR= 1 in the remainder of the paper. We assume that

each robot is modeled through unicycle dynamics

˙x(t) =

˙p1(t)

˙p2(t)

˙

θ(t)

="u1(t) cos(θ(t))

u1(t) sin(θ(t))

u2(t)#=f(x(t), u(t)) (1)

2The authors do not claim that experts on reinforcement learning

would have run into the same problems.

3Aquaticus game mechanics: https://oceanai.mit.edu/aquaticus/p

mwiki/pmwiki.php?n=Memo.AQGameMechanics

where p= [p1p2]⊤∈R2deﬁnes the position, θ∈R

denotes the heading and u= [u1, u2]⊤∈R2is the input in

terms of the velocity and the angular velocity. Similarly, in

the discrete-time setting, the robots are described through

x(k+ 1) = g(x(k), u(k)) = x(k)+∆f(x(k), u(k)) (2)

obtained from the continuous-time setting using an ex-

plicit Euler discretization with sampling time ∆ >0. In

shorthand notation, the discrete dynamics are given by

x+=g(x, u).

The playing ﬁeld X= [−80,80] ×[−40,40] is divided in

two halves

XB= [−80,0] ×[−40,40] and XR= [0,80] ×[−40,40].

At the points

FB= [−60 0]⊤and FR= [60 0]⊤

the ﬂags of the two teams are positioned. The setting with

six randomly positioned robots is shown in Figure 1.

-80 -60 -40 -20 0 20 40 60 80

-40

-20

0

20

40

Fig. 1. Playing ﬁeld of the Aquaticus competition with

three blue and three red robots randomly positioned

in their own halves. Additionally, the ﬂags FRand FB

and ten meter radii around the ﬂags are shown.

By taking the perspective of a blue robot B versus a red

robot R, the rules of the game can be summarized as

follows:

•If B and R are in the blue half pB, pR∈XBand

if pR∈ BpB, then R is tagged, i.e., R is temporarily

taken out of the game and the blue ﬂag FBis returned

to its position if R was carrying it.

•A robot R, which is temporarily taken out of the game

by being tagged or by leaving the playing ﬁeld pR/∈X

needs to satisfy pR∈ BFRto be reactivated.

•After tagging R, B loses its ability to tag another

robot for 30 seconds.

•If B is not temporarily taken out of the game, i.e., B

is not tagged and did not leave X, and if pB∈ BFR,

then B grabs the red ﬂag. (Note that only one robot

can carry the ﬂag at a time.)

The goal of blue team is to grab the red ﬂag and to suc-

cessfully return it to the own ﬂag area BFB. Nevertheless,

in the 2022 Aquaticus competition points are awarded in

the following situations:

•B tags R;

•B grabs the red team’s ﬂag;

•B successfully returns the red team’s ﬂag to BFB.

With respect to the last item it is worth noting that as

soon as pB∈XBwhile carrying the ﬂag, the red team

cannot prevent B from reaching BFB. In addition, once

the blue robot reaches BFBwith the red ﬂag, the ﬂag is

returned to its original position and the game continues.

Even though the setting and the rules of the game can be

explained in a couple of paragraphs, analyzing the game

and deriving control strategies even for the simpliﬁed 1vs1

setting is surprisingly complex. In the following sections

we investigate fundamental properties of the competition

and outline control strategies. To simplify the notation in

the following, the robot of the blue and the robot of the red

team are referred to as B and R, respectively. Additionally,

with a slight abuse of notation, FBand FRwill be used to

refer to the ﬂag positions and to the ﬂags themselves.

3. OPTIMAL AND MODEL PREDICTIVE CONTROL

FORMULATIONS

To start the discussion on strategies for the Aquaticus

competition, we take the perspective of the blue team and

consider the optimal control problem (OCP)

u⋆

B(·) = arg min RT

0ℓ(xB(t), uB(t); xR(·),Ω(·)) dt (3a)

s.t. xB(0) = xB,0

˙xB(t) = f(xB(t), uB(t)) ∀t∈[0, T ]

uB(t)∈U∀t∈[0, T ]

(3b)

The solution of (3) 4deﬁnes an open-loop input u⋆

B(·) :

[0, T ]→U, which is optimal with respect to the selected

cost function (3a) under the constraints (3b). The optimal

open-loop input u⋆

B(·) implicitly deﬁnes the corresponding

open-loop trajectory denoted by x⋆

B(·) : [0, T ]→R. Here,

T∈R≥0∪{∞} denotes the prediction horizon and ℓ:R3×

R2→Rdenotes stage costs. The stage costs capture

the points awarded in the game. In addition to the pair

(xB, uB), the stage costs ℓadditionally depend on the

opponent’s strategy, i.e., the (unknown) trajectory xR(·)

of R, as well as discrete states of the game captured in the

(unknown) function Ω(·) : R≥0→ {0,1}q, for q∈N. As

an example, the function Ω(·) could be deﬁned as follows:

Ω(t) = 1, FRis at its position at time t∈R≥0,

0,B is carrying FRat time t∈R≥0.

Similarly, Ω(·) can capture if a robot is tagged or if a robot

currently has its tagging capabilities according to the rules

(see Section 2).

Note that due to the fact that ℓdepends on the red teams

strategy and on the discrete states, also u⋆

B(·) depends

on xR(·) and Ω(·). This dependency is omitted here for

simplicity of notation. Under the natural assumption that

xR(·) is not known to B, (3) constitutes a Stackelberg

game (which is discussed in Hansen and Sargent (2011),

for example). If xR(·) is known to B, the strategies of B

and R are decoupled.

Instead of maximizing a reward (i.e., maximizing the

points awarded to a team), which is common in the rein-

forcement learning literature, we take the control perspec-

tive and minimize costs, i.e., we minimize the opposite of

the reward function. For clariﬁcation, see (Bertsekas, 2019,

Chapter 1) on the terminology of reinforcement learning

and optimal control, for example.

4For simplicity, we assume that the OCP is deﬁned in such a way

that an optimal solution exists. If the optimal solution is not unique,

u⋆

B(·) denotes an arbitrary optimal solution.

Since xR(·) and Ω(·) are unknown to B, at best a subop-

timal solution of (3) can be calculated, in general. Note

that Ω(·) is unknown to B, since it potentially depends on

R’s strategy. A strategy to overcome this issue is to make

additional assumptions on R’s strategy and to use Model

Predictive Control (MPC) (see Gr¨une and Pannek (2017)

and Rawlings et al. (2017), for example) to iteratively

compute a feedback law deﬁning the strategy for B. The

basic MPC algorithm is given in Algorithm 1.

For k= 0,1,2, . . .

(1) Measure the current state and deﬁne xB,0=xB(∆k).

(2) Assume xR(·) is known, solve OCP (3) to obtain the

open-loop input u⋆

B(·) : [0, T ]→U.

(3) For ∆ ∈(0, T ], deﬁne the feedback law

µB(t;xB,0) = u⋆

B(t), t ∈[0,∆].(4)

(4) Compute xB(∆(k+ 1)) as the solution of

˙xB(t) = f(xB(t), µB(t;xB,0)), xB,0=xB(∆k),

increment kto k+ 1 and go to 1.

Algorithm 1. Model Predictive Control (MPC)

A straightforward guess for the unknown trajectory xR(·)

is to assume that it is constant, i.e., xR(t) = xR,0for all t∈

[0, T ]. Since only a portion ∆ ∈(0, T ] of the optimal open-

loop solution u⋆

B(·) is used to deﬁne the feedback law µB

and before u⋆

B(·) is recomputed, intuitively compensates

for xR(·) not being known up to some level. In the same

way as in the deﬁnition of u⋆

B(·), the dependence of the

feedback law µB(·) on xR(·) and Ω(·) is ignored in (4).

Instead of the assumption that xR(·) is constant, more

complicated strategies can be taken into account through

robust or stochastic MPC (see (Rawlings et al., 2017,

Chapter 3) and Kouvaritakis and Cannon (2016), for

example).

To be able to solve the OCP (3) numerically, it might be

necessary to consider a discrete time version of the opti-

mization problem where the integral in the cost function

is replaced by a sum, the dynamics are replaced by the

discrete-time counterpart (2) and input uB(·) is piecewise

constant by assumption.

4. MINIMUM TIME OPTIMAL CONTROL AND

DUBINS CURVES

In classical control applications the stage cost ℓin (3)

is usually deﬁned in such a way that the distance to a

reference point is penalized. In the Aquaticus competition

the natural selection of ℓwould consist of delta distribu-

tions, awarding points for discrete-time events. To obtain

cost functions which are tractable when it comes to the

numerical solution of OCP (3), we consider alternative

stage costs, approximating the goals of the game and

dividing the game into diﬀerent components. Speciﬁcally,

we use Dubins curves (see Dubins (1957)), which can

be characterized through solutions of minimum time and

shortest path OCPs.

Consider the minimum time OCP

(u⋆(·), T ⋆) = arg min RT

01dt

s.t. x(0) = x0

x(T) = xG

˙x(t) = f(x(t), u(t)) ∀t∈[0, T ]

u(t)∈U∀t∈[0, T ]

(5)

returning the optimal open-loop input and the minimal

time T⋆∈R≥0to transition from an initial state x0to an

arbitrary target state xG∈R3while satisfying the input

constraints U. The corresponding optimal trajectory is de-

noted by x⋆(·), i.e., x⋆(0) = x0and ˙x⋆(t) = f(x⋆(t), u⋆(t))

for almost all t∈[0, T ⋆].

Theorem 1. ((Lynch and Park, 2017, Thm. 13.11)). Con-

sider the dynamics (1) together with the input constraints

U={c1} × [−c2, c2] (6)

for c1, c2∈R>0. Additionally consider the OCP (5)

for arbitrary x0, xG∈R3. Then the shortest path x⋆(·)

corresponding to the optimal solution of (5) consists only

of arcs at the minimum turning radius r=c1

c2and straight-

line segments. Moreover, denoting a circular arc segment

as Cand a straight-line segment as S, the shortest path

between any two conﬁgurations follows either

•the sequence CS C; or

•the sequence CCαC, where Cαindicates a circular

arc of angle α > π.

Any of the Cor Ssegments can be of length zero. ⌟

The original result was published in Dubins (1957). Based

on this result, the optimal open-loop input can be written

as

u⋆

B,1(t) = c1, t ∈[0, T ⋆],

u⋆

B,2(t) = (ω1, t ∈[0, t1),

ω2, t ∈[t1, t2),

ω3, t ∈[t2, T ⋆],

(7)

for t1, t2∈[0, T ⋆] selected appropriately and ωi∈

{−c2,0, c2}. For (6) deﬁned in Theorem 1, the minimal

turning radius is given by

r=c1

c2.(8)

The input constraints (6) imply that the robot is driving

with a constant speed, but with a degree of freedom in the

angular velocity. The constraints (6) are consistent with

the setup of the 2022 Aquaticus competition. The same

result is true for the slightly more general input constraints

U={u∈R2|0≤u1≤c1,|u2| ≤ c2

c1u1},(9)

i.e., the shortest path can be characterized through Theo-

rem 1 and the shortest path can be followed in minimum

time by selecting the input (7). The minimal turning

radius rdoes not change.

Two illustrations of Dubins curves following the sequences

CS C and CCαC, respectively, are shown in Figure 2. The

shortest path calculation in Theorem 1 can be done an-

alytically or geometrically and is numerically inexpensive

(LaValle, 2006, Chapter 15.3.1). Additionally, extensions

to compute reachable sets based on Dubins path exist,

Patsko et al. (2003). Shortest path and minimal time

results as in Theorem 1 with a simple geometric inter-

pretation are not restricted to constraints of the form (6),

(9). Results for alternative deﬁnitions of (6), (9) leading

to similar (but more complicated) switching strategies can

-2 -1 0 1

0

0.5

1

1.5

2

C

S

C

012

-0.5

0

0.5

1

1.5

C

C

Cα

Fig. 2. Illustration of the two cases of Dubins path in

Theorem 1 with turning radius of r= 1.

be found in (Lynch and Park, 2017, Section 13.3.3.1), for

example.

5. CONTROLLER DESIGNS

In this section we discuss strategies for the 1vs1 Aquaticus

competition through controller designs relying on MPC,

minimal time (and shortest path) OCPs (5) and variations

of Theorem 1. We start with a discussion on defensive

strategies before we extend the purely defensive control

laws to include maneuvers capturing the opponents ﬂag.

5.1 Purely defensive strategies

As a ﬁrst observation, note that under the assumption

that the input constraints are given by (9), B can simply

stay at the ﬂag position pB(t) = FBfor all t∈R≥0.

In this case it is impossible for R to grab FBand it is

impossible for B to be tagged. Consequently, the blue

team cannot lose. Similarly, if the constraints (6) are used

but both teams stick to a purely defensive strategy, i.e.,

both robots never leave their half of the playing ﬁeld, then

the game will naturally end in a draw. Accordingly, we

have recommended to change the rules of the game. Note

that games like basketball and handball, for example, have

rules in place that avoid purely defensive strategies. In

particular, the team in possession of the ball only has

a limited amount of time to try to score before being

penalized.

The ﬁrst interesting setting occurs when R tries to attack,

i.e., xR(t)∈XBfor some t∈R≥0, while B stays in its

own half xB(t)∈XBfor all t∈R≥0. Here, we take the

perspective of the defending blue team. For the controller

design we consider a slight variation of OCP (5), where

only the ﬁnal position pG∈R2but not the ﬁnal orientation

is ﬁxed:

(u⋆(·), T ⋆) = arg min RT

01dt

s.t. x(0) = x0

p(T) = pG

˙x(t) = f(x(t), u(t)) ∀t∈[0, T ]

u∈U∀t∈[0, T ].

(10)

In this case, Theorem 1 immediately implies the following

result.

Corollary 2. Consider the dynamics (1) together with

the input constraints (6) for c1, c2∈R>0. Additionally,

consider OCP (10) for arbitrary x0∈R3and pG∈R2.

Then the shortest path x⋆(·) corresponding to the optimal

solution of (10) consists only of the sequence CS or CC

where any of the segments can be of length zero. ⌟

Proof. Consider the set of shortest paths corresponding

to x(0) ∈R3,pG∈R2ﬁxed and θG∈[0,2π). The set

is characterized through Theorem 1. Then it follows that

among these paths, the shortest is given by the ones which

do not require the ﬁnal turn Cin their maneuver. 2

Corollary 2 can be used to deﬁne a controller intercepting

R on its way to FB. In particular, we consider the controller

design in Algorithm 2 relying on OCP (10) with respect

to the trajectory of R as well as the trajectory of B.

We assume that R tries to approach FBon the shortest

For k= 0,1,2, . . .

(1) Deﬁne xB,0=xB(∆k), xR,0=xR(∆k).

(2) Solve (10) with xR(0) = xR,0and pR(T) = FB.

(3) Compute x⋆

R(δ) for δ > 0.

(4) Solve (10) with xB(0) = xB,0and pB(T) = p⋆

R(δ).

(5) Deﬁne the feedback law

µB(t;xB,0) = u⋆

B(t), t ∈[0,∆],∆>0.(11)

(6) Compute xB(∆(k+ 1)) as the solution of

˙xB(t) = f(xB(t), µB(t;xB,0)), xB,0=xB(∆k),

increment kto k+ 1 and go to 1.

Algorithm 2. Intercepting controller

path characterized through (10). Since the shortest path

only relies on the knowledge of the initial condition, xR,0,

the shortest path x⋆

R(·) can be computed by the blue

team. Based on x⋆

R(·), the position p⋆

R(δ) for δ > 0

can be estimated. Hence, B can track p⋆

R(δ) by again

using OCP (10). By implementing the control law in a

receding horizon fashion, as suggested in Algorithm 2, the

reference point p⋆

R(δ) is updated regularly. Moreover, if the

assumption on the trajectory R is signiﬁcantly wrong, i.e.,

R is not approaching FB, then the blue team is not in

danger.

Of course, Algorithm 2 is only a valid strategy in the case

that the condition xR∈XBis satisﬁed. We thus propose

the following switching strategy changing the goal of the

blue team:

•If xR(0) ∈XBand R is not tagged, then follow the

steps in Algorithm 2.

•Otherwise replace p⋆

R(δ) in step (4) of Algorithm 2

with FB.

The second item ensures that the blue robot stays in a

neighborhood around its ﬂag whenever the red team is

not attacking. Note that the reference point FBcan be

replaced by an alternative strategy. According to Corollary

2, tracking the reference point FBleads to circular motions

of the robot with radius r=c1

c2going through FBevery

2π

c2time units.

Remark 1. Note that the controller design proposed in this

section is only a heuristic method, which does not ensure

that B is not leaving the playing ﬁeld, for example. A

rigorous strategy switching between defending the ﬂag and

tagging the opponents robot depending on the parameters

c1and c2will be investigated in future work. ◦

5.2 Attacking controller augmentation

In the preceding section a defensive controller with the

goal to either tag R or to ensure that R does not get

close to FBhas been described. Since a tagged robot

is temporarily taken out of the game, the red team is

temporarily defenseless if B successfully tagged R through

the strategy derived in Section 5.1. In this section we

discuss how this information can be used by B to perform

a maneuver capturing FRand successfully returning it to

XBwithout the risk of being tagged.

To this end, note that the minimal time T⋆:R3×

R3as a solution of (5) can be written as a mapping

(x0, xG)7→ T⋆(x0, xG) depending on the initial condition

and the target position. Using this notation, the minimal

time to capture FRand to return it to XBis upper bounded

by

TB

cF = min{T⋆(xB,0, v1) + T⋆(v1, v2),

T⋆(xB,0, w1) + T⋆(w1, w2)}

where v1,v2,w1and w2denote the reference points

v1=

FR,1−10

FR,2

π

2

, v2=

0

c1

c2

−π

,

w1=

FR,1−10

FR,2

−π

2

, w2=

0

−c1

c2

−π

.

Without loss of generality (based on symmetry arguments)

assume that TB

cF corresponds to the solution going through

the reference points v1and v2. We denote the correspond-

ing optimal trajectory as x⋆

B(·). Under this assumption,

considering the same time instance from the perspective

of the red team, when R is tagged, an estimate of the

minimal time to regain the tagging property and to catch

B is given by

TR

cB =T⋆(xR,0, v1) + T⋆(v1, v2)

and with corresponding trajectory denoted by x⋆

R(·).

If TB

cF ≥TR

cB, the condition

|p⋆

R(t)−p⋆

B(t)|2≥10 ∀t∈[T⋆(xR,0, v1), T R

cB],(12)

(which is straightforward to verify using a suﬃciently ﬁne

time discretization of the two trajectories) ensures that R

is unable to tag B.

The maneuver is illustrated in Figure 3 for two selections

of the minimal turning radius rin terms of c1= 15,

c2= 2 and c1= 10, c2= 2, respectively. In Figure 4,

we can observe that for r= 7.5, B successfully returns

with the ﬂag while in the case r= 5 the maneuver of B is

unsuccessful and B is tagged by R.

Since all the necessary calculations verifying if such a

maneuver is successful or not can be performed online, B

can switch from the defensive strategy in Section 5.1 to the

capture the ﬂag maneuver if the predictions are promising.

Once the maneuver is over, B is able to switch back to a

defensive strategy.

Remark 2. The strategies leading to x⋆

B(·) and x⋆

R(·)

shown in Figures 3 and 4 are not necessarily the best

strategies for both teams. To be more conservative from

the perspective of B, the bound 10 in (12) triggering if

a maneuver is predicted to be successful or not can be

increased. ◦

Remark 3. Note that in Figure 4 (right), B still success-

fully grabs the ﬂag FRbefore being tagged. Thus, in the

-80 -60 -40 -20 0 20 40 60 80

-40

-20

0

20

40

-80 -60 -40 -20 0 20 40 60 80

-40

-20

0

20

40

Fig. 3. Illustration of a maneuver where B tags R, captures

FRand returns to XBwith the ﬂag. Similarly, R

returns to its ﬂag area to regain the tagging property

before trying to catch B. On the top, the minimal

turning radius satisﬁes r=c1

c2=15

2= 7.5 and on the

bottom the turning radius satisﬁes r=c1

c2=10

2= 5.

0 2 4 6 8

5

10

15

20

0 5 10

5

10

15

20

Fig. 4. Distances between B and R corresponding to the

maneuvers in Figure 3. The black line indicates the

time when R regains its tagging property. The red

line indicates the critical distance between R and B.

For a turning radius of r= 7.5, B is able to return

with the ﬂag. For r= 5, R tags B.

case that grabbing the ﬂag is worth more points than tag-

ging a robot, the maneuver can be considered successful.

◦

5.3 Optimization of the tagging angle

Section 5.1 discusses a defensive strategy focusing on tag-

ging R or staying close to FB. Section 5.2, outlines condi-

tions which encourage B to temporarily drop the defensive

strategy and to perform a capture the ﬂag maneuver.

The success of the capture the ﬂag maneuver depends on

the minimal turning radius rand on the state xBand xR

at them time B tags R. In particular, the initial angles

θBand θRcan lead to a big advantage or disadvantage

in the race to the ﬂag FR. This can be seen in Figure 4,

for example, where R ﬁrst needs to complete half a circle

before getting closer to FRwhile the B only needs to adjust

its orientation slightly.

This information can be used in step 4 in Algorithm 2 by

replacing OCP (10) with OCP (5) with target state

xG=pR(δ)⊤0⊤,

for example. While this increases the chances of a success-

ful capture the ﬂag maneuver, it is also possible that B

misses its chance to tag R since B intends to tag R from

a speciﬁc angle. To overcome this potential issue, one can

switch between OCP (5) and (10) in step 4 of Algorithm 2.

As a trigger to switch between the objectives, the relation

between the times

TR

gF =T⋆(xR,0, FB)−10

c1,

TB

tR =T⋆(xB,0, pR,0)−10

c1,

representing the minimal time for R to grab FBand the

minimal time for B to tag R, respectively, can be used.

Here, with a slight abuse of notation T⋆:R3×R2→R≥0

corresponds to (10) instead of (5).

•If TB

tR −TR

gF >0 is large, B can focus on improving

the tagging angle (i.e., use OCP (10)),

•Otherwise B needs to focus on tagging R (i.e., use

OCP (5)).

The term large, triggering the switching behavior, needs

to be analytically investigated in future work.

6. CONCLUDING REMARKS

A variation of the controller design outlined in this pa-

per has outperformed other reinforcement learning based

approaches in the 2022 Aquaticus competition. Instead

of trying to learn optimal control strategies only relying

on the rules of the game and the robot dynamics, we

have deﬁned suboptimal control laws tuned for particular

components (i.e., attacking or defending) of the game,

simply by considering which maneuvers are achievable by a

single robot. The control laws are based on classical results

in terms of Dubins curves and switching strategies. While

the results are currently restricted to the 1vs1 setting,

future work will focus on extensions of collaborative con-

trollers for teams consisting of multiple robots. Moreover,

even though we have not used reinforcement learning in

the controller design discussed here, incorporating Dubins

curves and their optimality properties in learning based

controller designs will be a second stream of future work.

REFERENCES

Alamir, M. (2022). Learning against uncertainty in control

engineering. Annual Reviews in Control, 53, 19–29.

Benjamin, M.R., Schmidt, H., Newman, P.M., and

Leonard, J.J. (2010). Nested autonomy for unmanned

marine vehicles with MOOS-IvP. Journal of Field

Robotics, 27(6), 834–875.

Bertsekas, D. (2019). Reinforcement learning and optimal

control. Athena Scientiﬁc.

Bertsekas, D. (2022). Lessons from AlphaZero for Opti-

mal, Model Predictive, and Adaptive Control. Athena

Scientiﬁc.

Dubins, L.E. (1957). On curves of minimal length with

a constraint on average curvature, and with prescribed

initial and terminal positions and tangents. American

Journal of mathematics, 79(3), 497–516.

Gr¨une, L. and Pannek, J. (2017). Nonlinear Model Pre-

dictive Control, volume 2. Springer.

Hansen, L.P. and Sargent, T.J. (2011). Robustness. In

Robustness. Princeton University Press.

Kouvaritakis, B. and Cannon, M. (2016). Model predictive

control. Springer International Publishing.

LaValle, S.M. (2006). Planning Algorithms. Cambridge

University Press.

Lynch, K.M. and Park, F.C. (2017). Modern Robotics.

Cambridge University Press.

Patsko, V.S., Pyatko, S.G., and Fedotov, A.A. (2003).

Three-dimensional reachability set for a nonlinear con-

trol system. Journal of Computer and Systems Sciences

International C/c of Tekhnicheskaia Kibernetika, 42(3),

320–328.

Rawlings, J.B., Mayne, D.Q., and Diehl, M. (2017). Model

Predictive Control: Theory, Computation, and Design,

volume 2. Nob Hill Publishing.

Sznaier, M., Olshevsky, A., and Sontag, E.D. (2022). The

role of systems theory in control oriented learning. In

25th International Symposium on Mathematical Theory

of Networks and Systems. (extended abstract).