PreprintPDF Available

Capture the flag games: Observations from the 2022 Aquaticus competition

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
Capture the flag games: Observations from
the 2022 Aquaticus competition
Philipp Braun Iman Shames David Hubczenko ∗∗
Anna Dostovalova ∗∗ Bradley Fraser ∗∗
School of Engineering, Australian National University, Canberra,
Australia (e-mail: {philipp.braun,iman.shames}@anu.edu.au).
∗∗ Defence Science and Technology Group, Adelaide, Australia (e-mail:
{anna.dostovalova,bradley.fraser1,david.hubczenko}@defence.gov.au).
Abstract: Undoubtedly, reinforcement learning has gained a lot of momentum over the last
years through successful applications in various fields and by being prominently advertised world
wide through applications such as AlphaZero and AlphaGo. Despite its success, the potential
and limitations of reinforcement learning particularly in control applications are only at the
beginning of being rigorously understood. In particular, the curse of dimensionality arising in
system identification and controller design for control systems acting in continuous state and
input spaces in (model free) reinforcement learning, seems to be a crucial barrier yet to overcome.
In this paper we illustrate these observations in terms of classical controller designs versus
reinforcement learning, based on our experiences from the 2022 Aquaticus competition. The
Aquaticus competition is a capture the flag multi-player game where two teams of (autonomous)
robots compete with each other. We illustrate ideas of a simple controller that combines ideas
from optimal control, model predictive control and path planning using Dubins curves. The
corresponding controller outperformed the other reinforcement learning based controllers in the
Aquaticus competition. While reinforcement learning might be well suited for a competitive
controller design in future competitions, a good understanding of the problem seems to be
necessary to initiate the learning process.
Keywords: Model predictive control; Dubins path; multi-player games; path planning
1. INTRODUCTION
The recent success of reinforcement learning receiving
world wide media attention through AlphaZero and Al-
phaGo, for example, led to a trend of data-driven decision
making and controller designs in the control community
and in related fields. Excellent references on reinforcement
learning and control theory include the recent books Bert-
sekas (2019), Bertsekas (2022), for example.
While the success and the impact of reinforcement learning
is undeniable, in some areas of application, solely rely-
ing on new data-driven controller designs might not lead
to the promised results when applied to dynamical sys-
tems described through differential equations (or difference
equations) and defined over a continuous state space, as
outlined in detail in Alamir (2022). In particular, while
the number of possible moves in Chess is very large, the
number of moves is not comparable to the number of
solutions of a dynamical system acting over a continuous
state space xRn(nN) and a continuous input space
uRm(mN). Said differently “[...] if the initial
ambitious promises of [reinforcement learning] are to be
fulfilled, namely computing a stochastically optimal policy
over a continuous set of states, the curse of dimensionality
This research was funded by DSTG DWE STaR Shot Program. P.
Braun and I. Shames are additionally supported by the CIICADA
Lab.
is unavoidable. [Reinforcement learning] inherits the stan-
dard limitations of Approximate Dynamic Programming
(ADP). The beauty of the gradient theorem and the uni-
versality of deep neural networks does not help overcoming
this fundamental obstacle”, (Alamir, 2022, Fact 1).
The last statement does not claim that reinforcement
learning is not useful in control application. However, it
highlights that reinforcement learning is not a universal
tool that can be applied to an arbitrary control problem
with the assumption that classical controller designs are
easily outperformed or that even a useful control law is
obtained through the learning process. Similar observa-
tions in terms of limitations are also discussed in Sznaier
et al. (2022), for example. While this is not a problem by
itself, the authors of this paper agree with the statement
that “[i]n some recent works, the already available and
purely control related solutions are too easily forgotten
when it comes to contributing in any possible way to
the [reinforcement learning] buzzword induced euphoria”,
(Alamir, 2022, page 23).
In this paper, we outline our experiences with respect
to the last statement in the context of the Aquaticus
competition. 1The Aquaticus competition is a capture
the flag game where two teams consisting of autonomous
1Aquaticus Competition: https://oceanai.mit.edu/aquaticus/pmwi
ki/pmwiki.php?n=Main.HomePage
(or human operated) robots compete. In particular, as
outlined in detail in the next section, the task of the two
teams is to grab the opponent’s flag and to return it safely
to the own flag position. To prevent such a maneuver,
the opponent’s robots can be tagged, which removes them
temporarily from the game.
While we initially intended to use reinforcement learning
for the controller design, we quickly realized that we 2lack
the intuition behind reinforcement learning to being able
to use it as a black box controller design methodology.
We thus stepped back and focused on purely control re-
lated solutions before considering reinforcement learning
techniques (as suggested in Alamir (2022)). The controller
discussed in this paper, which is combining model pre-
dictive control (MPC) and Dubins curves (see Dubins
(1957)) while additionally relying on heuristic switching
rules, clearly outperformed competing controller designs
in the 2022 Aquaticus competition based on reinforcement
learning techniques. In the same way as the 2022 Aquati-
cus competition, the following sections focus on a 1vs1
capture the flag game.
The paper and its contributions are structured as follows.
Section 2, starts with a description of the Aquaticus com-
petition. Section 3, discusses optimal control and MPC in
the general setting of capture the flag games. In Section
4, we recall necessary results on Dubins curves for mobile
robots described using unicycle dynamics. The main con-
troller design is outlined in Section 5. The paper is closed
with final remarks in Section 6.
Throughout the paper, the following notation is used.
The Euclidean norm of a vector xRnis denoted by
|x|2=xx. A ball of radius 10 (meters) centered around
pR2is denoted by Bp={qR2| |pq|210}.
2. THE AQUATICUS COMPETITION
As outlined in the introduction, the Aquaticus competition
is a capture the flag game where two teams consisting of
autonomous and human operated marine surface vehicles
compete with each other. While the exact rules are out-
lined on the homepage of the Aquaticus competition 3,
here we focus on slightly simplified rules necessary for the
understanding of the rest of the paper. In addition, we
focus on a purely simulation based analysis of the game
and use MATLAB for illustrations instead of MOOS-IvP
(see Benjamin et al. (2010)) used in the competition.
The Aquaticus competition consists of two teams referred
to as blue team (B) and red team (R) in the following and
consisting of nB, nRNindependent robots, respectively.
Here we focus on the 1vs1 setting and assume that nB=
nR= 1 in the remainder of the paper. We assume that
each robot is modeled through unicycle dynamics
˙x(t) =
˙p1(t)
˙p2(t)
˙
θ(t)
="u1(t) cos(θ(t))
u1(t) sin(θ(t))
u2(t)#=f(x(t), u(t)) (1)
2The authors do not claim that experts on reinforcement learning
would have run into the same problems.
3Aquaticus game mechanics: https://oceanai.mit.edu/aquaticus/p
mwiki/pmwiki.php?n=Memo.AQGameMechanics
where p= [p1p2]R2defines the position, θR
denotes the heading and u= [u1, u2]R2is the input in
terms of the velocity and the angular velocity. Similarly, in
the discrete-time setting, the robots are described through
x(k+ 1) = g(x(k), u(k)) = x(k)+∆f(x(k), u(k)) (2)
obtained from the continuous-time setting using an ex-
plicit Euler discretization with sampling time >0. In
shorthand notation, the discrete dynamics are given by
x+=g(x, u).
The playing field X= [80,80] ×[40,40] is divided in
two halves
XB= [80,0] ×[40,40] and XR= [0,80] ×[40,40].
At the points
FB= [60 0]and FR= [60 0]
the flags of the two teams are positioned. The setting with
six randomly positioned robots is shown in Figure 1.
-80 -60 -40 -20 0 20 40 60 80
-40
-20
0
20
40
Fig. 1. Playing field of the Aquaticus competition with
three blue and three red robots randomly positioned
in their own halves. Additionally, the flags FRand FB
and ten meter radii around the flags are shown.
By taking the perspective of a blue robot B versus a red
robot R, the rules of the game can be summarized as
follows:
If B and R are in the blue half pB, pRXBand
if pR BpB, then R is tagged, i.e., R is temporarily
taken out of the game and the blue flag FBis returned
to its position if R was carrying it.
A robot R, which is temporarily taken out of the game
by being tagged or by leaving the playing field pR/X
needs to satisfy pR BFRto be reactivated.
After tagging R, B loses its ability to tag another
robot for 30 seconds.
If B is not temporarily taken out of the game, i.e., B
is not tagged and did not leave X, and if pB BFR,
then B grabs the red flag. (Note that only one robot
can carry the flag at a time.)
The goal of blue team is to grab the red flag and to suc-
cessfully return it to the own flag area BFB. Nevertheless,
in the 2022 Aquaticus competition points are awarded in
the following situations:
B tags R;
B grabs the red team’s flag;
B successfully returns the red team’s flag to BFB.
With respect to the last item it is worth noting that as
soon as pBXBwhile carrying the flag, the red team
cannot prevent B from reaching BFB. In addition, once
the blue robot reaches BFBwith the red flag, the flag is
returned to its original position and the game continues.
Even though the setting and the rules of the game can be
explained in a couple of paragraphs, analyzing the game
and deriving control strategies even for the simplified 1vs1
setting is surprisingly complex. In the following sections
we investigate fundamental properties of the competition
and outline control strategies. To simplify the notation in
the following, the robot of the blue and the robot of the red
team are referred to as B and R, respectively. Additionally,
with a slight abuse of notation, FBand FRwill be used to
refer to the flag positions and to the flags themselves.
3. OPTIMAL AND MODEL PREDICTIVE CONTROL
FORMULATIONS
To start the discussion on strategies for the Aquaticus
competition, we take the perspective of the blue team and
consider the optimal control problem (OCP)
u
B(·) = arg min RT
0(xB(t), uB(t); xR(·),Ω(·)) dt (3a)
s.t. xB(0) = xB,0
˙xB(t) = f(xB(t), uB(t)) t[0, T ]
uB(t)Ut[0, T ]
(3b)
The solution of (3) 4defines an open-loop input u
B(·) :
[0, T ]U, which is optimal with respect to the selected
cost function (3a) under the constraints (3b). The optimal
open-loop input u
B(·) implicitly defines the corresponding
open-loop trajectory denoted by x
B(·) : [0, T ]R. Here,
TR0∪{∞} denotes the prediction horizon and :R3×
R2Rdenotes stage costs. The stage costs capture
the points awarded in the game. In addition to the pair
(xB, uB), the stage costs additionally depend on the
opponent’s strategy, i.e., the (unknown) trajectory xR(·)
of R, as well as discrete states of the game captured in the
(unknown) function Ω(·) : R0 {0,1}q, for qN. As
an example, the function Ω(·) could be defined as follows:
Ω(t) = 1, FRis at its position at time tR0,
0,B is carrying FRat time tR0.
Similarly, Ω(·) can capture if a robot is tagged or if a robot
currently has its tagging capabilities according to the rules
(see Section 2).
Note that due to the fact that depends on the red teams
strategy and on the discrete states, also u
B(·) depends
on xR(·) and Ω(·). This dependency is omitted here for
simplicity of notation. Under the natural assumption that
xR(·) is not known to B, (3) constitutes a Stackelberg
game (which is discussed in Hansen and Sargent (2011),
for example). If xR(·) is known to B, the strategies of B
and R are decoupled.
Instead of maximizing a reward (i.e., maximizing the
points awarded to a team), which is common in the rein-
forcement learning literature, we take the control perspec-
tive and minimize costs, i.e., we minimize the opposite of
the reward function. For clarification, see (Bertsekas, 2019,
Chapter 1) on the terminology of reinforcement learning
and optimal control, for example.
4For simplicity, we assume that the OCP is defined in such a way
that an optimal solution exists. If the optimal solution is not unique,
u
B(·) denotes an arbitrary optimal solution.
Since xR(·) and Ω(·) are unknown to B, at best a subop-
timal solution of (3) can be calculated, in general. Note
that Ω(·) is unknown to B, since it potentially depends on
R’s strategy. A strategy to overcome this issue is to make
additional assumptions on R’s strategy and to use Model
Predictive Control (MPC) (see Gr¨une and Pannek (2017)
and Rawlings et al. (2017), for example) to iteratively
compute a feedback law defining the strategy for B. The
basic MPC algorithm is given in Algorithm 1.
For k= 0,1,2, . . .
(1) Measure the current state and define xB,0=xB(∆k).
(2) Assume xR(·) is known, solve OCP (3) to obtain the
open-loop input u
B(·) : [0, T ]U.
(3) For (0, T ], define the feedback law
µB(t;xB,0) = u
B(t), t [0,∆].(4)
(4) Compute xB(∆(k+ 1)) as the solution of
˙xB(t) = f(xB(t), µB(t;xB,0)), xB,0=xB(∆k),
increment kto k+ 1 and go to 1.
Algorithm 1. Model Predictive Control (MPC)
A straightforward guess for the unknown trajectory xR(·)
is to assume that it is constant, i.e., xR(t) = xR,0for all t
[0, T ]. Since only a portion (0, T ] of the optimal open-
loop solution u
B(·) is used to define the feedback law µB
and before u
B(·) is recomputed, intuitively compensates
for xR(·) not being known up to some level. In the same
way as in the definition of u
B(·), the dependence of the
feedback law µB(·) on xR(·) and Ω(·) is ignored in (4).
Instead of the assumption that xR(·) is constant, more
complicated strategies can be taken into account through
robust or stochastic MPC (see (Rawlings et al., 2017,
Chapter 3) and Kouvaritakis and Cannon (2016), for
example).
To be able to solve the OCP (3) numerically, it might be
necessary to consider a discrete time version of the opti-
mization problem where the integral in the cost function
is replaced by a sum, the dynamics are replaced by the
discrete-time counterpart (2) and input uB(·) is piecewise
constant by assumption.
4. MINIMUM TIME OPTIMAL CONTROL AND
DUBINS CURVES
In classical control applications the stage cost in (3)
is usually defined in such a way that the distance to a
reference point is penalized. In the Aquaticus competition
the natural selection of would consist of delta distribu-
tions, awarding points for discrete-time events. To obtain
cost functions which are tractable when it comes to the
numerical solution of OCP (3), we consider alternative
stage costs, approximating the goals of the game and
dividing the game into different components. Specifically,
we use Dubins curves (see Dubins (1957)), which can
be characterized through solutions of minimum time and
shortest path OCPs.
Consider the minimum time OCP
(u(·), T ) = arg min RT
01dt
s.t. x(0) = x0
x(T) = xG
˙x(t) = f(x(t), u(t)) t[0, T ]
u(t)Ut[0, T ]
(5)
returning the optimal open-loop input and the minimal
time TR0to transition from an initial state x0to an
arbitrary target state xGR3while satisfying the input
constraints U. The corresponding optimal trajectory is de-
noted by x(·), i.e., x(0) = x0and ˙x(t) = f(x(t), u(t))
for almost all t[0, T ].
Theorem 1. ((Lynch and Park, 2017, Thm. 13.11)). Con-
sider the dynamics (1) together with the input constraints
U={c1} × [c2, c2] (6)
for c1, c2R>0. Additionally consider the OCP (5)
for arbitrary x0, xGR3. Then the shortest path x(·)
corresponding to the optimal solution of (5) consists only
of arcs at the minimum turning radius r=c1
c2and straight-
line segments. Moreover, denoting a circular arc segment
as Cand a straight-line segment as S, the shortest path
between any two configurations follows either
the sequence CS C; or
the sequence CCαC, where Cαindicates a circular
arc of angle α > π.
Any of the Cor Ssegments can be of length zero.
The original result was published in Dubins (1957). Based
on this result, the optimal open-loop input can be written
as
u
B,1(t) = c1, t [0, T ],
u
B,2(t) = (ω1, t [0, t1),
ω2, t [t1, t2),
ω3, t [t2, T ],
(7)
for t1, t2[0, T ] selected appropriately and ωi
{−c2,0, c2}. For (6) defined in Theorem 1, the minimal
turning radius is given by
r=c1
c2.(8)
The input constraints (6) imply that the robot is driving
with a constant speed, but with a degree of freedom in the
angular velocity. The constraints (6) are consistent with
the setup of the 2022 Aquaticus competition. The same
result is true for the slightly more general input constraints
U={uR2|0u1c1,|u2| c2
c1u1},(9)
i.e., the shortest path can be characterized through Theo-
rem 1 and the shortest path can be followed in minimum
time by selecting the input (7). The minimal turning
radius rdoes not change.
Two illustrations of Dubins curves following the sequences
CS C and CCαC, respectively, are shown in Figure 2. The
shortest path calculation in Theorem 1 can be done an-
alytically or geometrically and is numerically inexpensive
(LaValle, 2006, Chapter 15.3.1). Additionally, extensions
to compute reachable sets based on Dubins path exist,
Patsko et al. (2003). Shortest path and minimal time
results as in Theorem 1 with a simple geometric inter-
pretation are not restricted to constraints of the form (6),
(9). Results for alternative definitions of (6), (9) leading
to similar (but more complicated) switching strategies can
-2 -1 0 1
0
0.5
1
1.5
2
C
S
C
012
-0.5
0
0.5
1
1.5
C
C
Cα
Fig. 2. Illustration of the two cases of Dubins path in
Theorem 1 with turning radius of r= 1.
be found in (Lynch and Park, 2017, Section 13.3.3.1), for
example.
5. CONTROLLER DESIGNS
In this section we discuss strategies for the 1vs1 Aquaticus
competition through controller designs relying on MPC,
minimal time (and shortest path) OCPs (5) and variations
of Theorem 1. We start with a discussion on defensive
strategies before we extend the purely defensive control
laws to include maneuvers capturing the opponents flag.
5.1 Purely defensive strategies
As a first observation, note that under the assumption
that the input constraints are given by (9), B can simply
stay at the flag position pB(t) = FBfor all tR0.
In this case it is impossible for R to grab FBand it is
impossible for B to be tagged. Consequently, the blue
team cannot lose. Similarly, if the constraints (6) are used
but both teams stick to a purely defensive strategy, i.e.,
both robots never leave their half of the playing field, then
the game will naturally end in a draw. Accordingly, we
have recommended to change the rules of the game. Note
that games like basketball and handball, for example, have
rules in place that avoid purely defensive strategies. In
particular, the team in possession of the ball only has
a limited amount of time to try to score before being
penalized.
The first interesting setting occurs when R tries to attack,
i.e., xR(t)XBfor some tR0, while B stays in its
own half xB(t)XBfor all tR0. Here, we take the
perspective of the defending blue team. For the controller
design we consider a slight variation of OCP (5), where
only the final position pGR2but not the final orientation
is fixed:
(u(·), T ) = arg min RT
01dt
s.t. x(0) = x0
p(T) = pG
˙x(t) = f(x(t), u(t)) t[0, T ]
uUt[0, T ].
(10)
In this case, Theorem 1 immediately implies the following
result.
Corollary 2. Consider the dynamics (1) together with
the input constraints (6) for c1, c2R>0. Additionally,
consider OCP (10) for arbitrary x0R3and pGR2.
Then the shortest path x(·) corresponding to the optimal
solution of (10) consists only of the sequence CS or CC
where any of the segments can be of length zero.
Proof. Consider the set of shortest paths corresponding
to x(0) R3,pGR2fixed and θG[0,2π). The set
is characterized through Theorem 1. Then it follows that
among these paths, the shortest is given by the ones which
do not require the final turn Cin their maneuver. 2
Corollary 2 can be used to define a controller intercepting
R on its way to FB. In particular, we consider the controller
design in Algorithm 2 relying on OCP (10) with respect
to the trajectory of R as well as the trajectory of B.
We assume that R tries to approach FBon the shortest
For k= 0,1,2, . . .
(1) Define xB,0=xB(∆k), xR,0=xR(∆k).
(2) Solve (10) with xR(0) = xR,0and pR(T) = FB.
(3) Compute x
R(δ) for δ > 0.
(4) Solve (10) with xB(0) = xB,0and pB(T) = p
R(δ).
(5) Define the feedback law
µB(t;xB,0) = u
B(t), t [0,∆],>0.(11)
(6) Compute xB(∆(k+ 1)) as the solution of
˙xB(t) = f(xB(t), µB(t;xB,0)), xB,0=xB(∆k),
increment kto k+ 1 and go to 1.
Algorithm 2. Intercepting controller
path characterized through (10). Since the shortest path
only relies on the knowledge of the initial condition, xR,0,
the shortest path x
R(·) can be computed by the blue
team. Based on x
R(·), the position p
R(δ) for δ > 0
can be estimated. Hence, B can track p
R(δ) by again
using OCP (10). By implementing the control law in a
receding horizon fashion, as suggested in Algorithm 2, the
reference point p
R(δ) is updated regularly. Moreover, if the
assumption on the trajectory R is significantly wrong, i.e.,
R is not approaching FB, then the blue team is not in
danger.
Of course, Algorithm 2 is only a valid strategy in the case
that the condition xRXBis satisfied. We thus propose
the following switching strategy changing the goal of the
blue team:
If xR(0) XBand R is not tagged, then follow the
steps in Algorithm 2.
Otherwise replace p
R(δ) in step (4) of Algorithm 2
with FB.
The second item ensures that the blue robot stays in a
neighborhood around its flag whenever the red team is
not attacking. Note that the reference point FBcan be
replaced by an alternative strategy. According to Corollary
2, tracking the reference point FBleads to circular motions
of the robot with radius r=c1
c2going through FBevery
2π
c2time units.
Remark 1. Note that the controller design proposed in this
section is only a heuristic method, which does not ensure
that B is not leaving the playing field, for example. A
rigorous strategy switching between defending the flag and
tagging the opponents robot depending on the parameters
c1and c2will be investigated in future work.
5.2 Attacking controller augmentation
In the preceding section a defensive controller with the
goal to either tag R or to ensure that R does not get
close to FBhas been described. Since a tagged robot
is temporarily taken out of the game, the red team is
temporarily defenseless if B successfully tagged R through
the strategy derived in Section 5.1. In this section we
discuss how this information can be used by B to perform
a maneuver capturing FRand successfully returning it to
XBwithout the risk of being tagged.
To this end, note that the minimal time T:R3×
R3as a solution of (5) can be written as a mapping
(x0, xG)7→ T(x0, xG) depending on the initial condition
and the target position. Using this notation, the minimal
time to capture FRand to return it to XBis upper bounded
by
TB
cF = min{T(xB,0, v1) + T(v1, v2),
T(xB,0, w1) + T(w1, w2)}
where v1,v2,w1and w2denote the reference points
v1=
FR,110
FR,2
π
2
, v2=
0
c1
c2
π
,
w1=
FR,110
FR,2
π
2
, w2=
0
c1
c2
π
.
Without loss of generality (based on symmetry arguments)
assume that TB
cF corresponds to the solution going through
the reference points v1and v2. We denote the correspond-
ing optimal trajectory as x
B(·). Under this assumption,
considering the same time instance from the perspective
of the red team, when R is tagged, an estimate of the
minimal time to regain the tagging property and to catch
B is given by
TR
cB =T(xR,0, v1) + T(v1, v2)
and with corresponding trajectory denoted by x
R(·).
If TB
cF TR
cB, the condition
|p
R(t)p
B(t)|210 t[T(xR,0, v1), T R
cB],(12)
(which is straightforward to verify using a sufficiently fine
time discretization of the two trajectories) ensures that R
is unable to tag B.
The maneuver is illustrated in Figure 3 for two selections
of the minimal turning radius rin terms of c1= 15,
c2= 2 and c1= 10, c2= 2, respectively. In Figure 4,
we can observe that for r= 7.5, B successfully returns
with the flag while in the case r= 5 the maneuver of B is
unsuccessful and B is tagged by R.
Since all the necessary calculations verifying if such a
maneuver is successful or not can be performed online, B
can switch from the defensive strategy in Section 5.1 to the
capture the flag maneuver if the predictions are promising.
Once the maneuver is over, B is able to switch back to a
defensive strategy.
Remark 2. The strategies leading to x
B(·) and x
R(·)
shown in Figures 3 and 4 are not necessarily the best
strategies for both teams. To be more conservative from
the perspective of B, the bound 10 in (12) triggering if
a maneuver is predicted to be successful or not can be
increased.
Remark 3. Note that in Figure 4 (right), B still success-
fully grabs the flag FRbefore being tagged. Thus, in the
-80 -60 -40 -20 0 20 40 60 80
-40
-20
0
20
40
-80 -60 -40 -20 0 20 40 60 80
-40
-20
0
20
40
Fig. 3. Illustration of a maneuver where B tags R, captures
FRand returns to XBwith the flag. Similarly, R
returns to its flag area to regain the tagging property
before trying to catch B. On the top, the minimal
turning radius satisfies r=c1
c2=15
2= 7.5 and on the
bottom the turning radius satisfies r=c1
c2=10
2= 5.
0 2 4 6 8
5
10
15
20
0 5 10
5
10
15
20
Fig. 4. Distances between B and R corresponding to the
maneuvers in Figure 3. The black line indicates the
time when R regains its tagging property. The red
line indicates the critical distance between R and B.
For a turning radius of r= 7.5, B is able to return
with the flag. For r= 5, R tags B.
case that grabbing the flag is worth more points than tag-
ging a robot, the maneuver can be considered successful.
5.3 Optimization of the tagging angle
Section 5.1 discusses a defensive strategy focusing on tag-
ging R or staying close to FB. Section 5.2, outlines condi-
tions which encourage B to temporarily drop the defensive
strategy and to perform a capture the flag maneuver.
The success of the capture the flag maneuver depends on
the minimal turning radius rand on the state xBand xR
at them time B tags R. In particular, the initial angles
θBand θRcan lead to a big advantage or disadvantage
in the race to the flag FR. This can be seen in Figure 4,
for example, where R first needs to complete half a circle
before getting closer to FRwhile the B only needs to adjust
its orientation slightly.
This information can be used in step 4 in Algorithm 2 by
replacing OCP (10) with OCP (5) with target state
xG=pR(δ)0,
for example. While this increases the chances of a success-
ful capture the flag maneuver, it is also possible that B
misses its chance to tag R since B intends to tag R from
a specific angle. To overcome this potential issue, one can
switch between OCP (5) and (10) in step 4 of Algorithm 2.
As a trigger to switch between the objectives, the relation
between the times
TR
gF =T(xR,0, FB)10
c1,
TB
tR =T(xB,0, pR,0)10
c1,
representing the minimal time for R to grab FBand the
minimal time for B to tag R, respectively, can be used.
Here, with a slight abuse of notation T:R3×R2R0
corresponds to (10) instead of (5).
If TB
tR TR
gF >0 is large, B can focus on improving
the tagging angle (i.e., use OCP (10)),
Otherwise B needs to focus on tagging R (i.e., use
OCP (5)).
The term large, triggering the switching behavior, needs
to be analytically investigated in future work.
6. CONCLUDING REMARKS
A variation of the controller design outlined in this pa-
per has outperformed other reinforcement learning based
approaches in the 2022 Aquaticus competition. Instead
of trying to learn optimal control strategies only relying
on the rules of the game and the robot dynamics, we
have defined suboptimal control laws tuned for particular
components (i.e., attacking or defending) of the game,
simply by considering which maneuvers are achievable by a
single robot. The control laws are based on classical results
in terms of Dubins curves and switching strategies. While
the results are currently restricted to the 1vs1 setting,
future work will focus on extensions of collaborative con-
trollers for teams consisting of multiple robots. Moreover,
even though we have not used reinforcement learning in
the controller design discussed here, incorporating Dubins
curves and their optimality properties in learning based
controller designs will be a second stream of future work.
REFERENCES
Alamir, M. (2022). Learning against uncertainty in control
engineering. Annual Reviews in Control, 53, 19–29.
Benjamin, M.R., Schmidt, H., Newman, P.M., and
Leonard, J.J. (2010). Nested autonomy for unmanned
marine vehicles with MOOS-IvP. Journal of Field
Robotics, 27(6), 834–875.
Bertsekas, D. (2019). Reinforcement learning and optimal
control. Athena Scientific.
Bertsekas, D. (2022). Lessons from AlphaZero for Opti-
mal, Model Predictive, and Adaptive Control. Athena
Scientific.
Dubins, L.E. (1957). On curves of minimal length with
a constraint on average curvature, and with prescribed
initial and terminal positions and tangents. American
Journal of mathematics, 79(3), 497–516.
Gr¨une, L. and Pannek, J. (2017). Nonlinear Model Pre-
dictive Control, volume 2. Springer.
Hansen, L.P. and Sargent, T.J. (2011). Robustness. In
Robustness. Princeton University Press.
Kouvaritakis, B. and Cannon, M. (2016). Model predictive
control. Springer International Publishing.
LaValle, S.M. (2006). Planning Algorithms. Cambridge
University Press.
Lynch, K.M. and Park, F.C. (2017). Modern Robotics.
Cambridge University Press.
Patsko, V.S., Pyatko, S.G., and Fedotov, A.A. (2003).
Three-dimensional reachability set for a nonlinear con-
trol system. Journal of Computer and Systems Sciences
International C/c of Tekhnicheskaia Kibernetika, 42(3),
320–328.
Rawlings, J.B., Mayne, D.Q., and Diehl, M. (2017). Model
Predictive Control: Theory, Computation, and Design,
volume 2. Nob Hill Publishing.
Sznaier, M., Olshevsky, A., and Sontag, E.D. (2022). The
role of systems theory in control oriented learning. In
25th International Symposium on Mathematical Theory
of Networks and Systems. (extended abstract).
ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
In this chapter, we introduce the nonlinear model predictive control algorithm in a rigorous way. We start by defining a basic NMPC algorithm for constant reference and continue by formalizing state and control constraints. Viability (or weak forward invariance) of the set of state constraints is introduced and the consequences for the admissibility of the NMPC feedback law are discussed. After having introduced NMPC in a special setting, we describe various extensions of the basic algorithm, considering time varying reference solutions, terminal constraints and costs and additional weights. Finally, we investigate the optimal control problem corresponding to this generalized setting and prove several properties, most notably the dynamic programming principle.
Article
In this paper, some data-based control design options that can be used to accommodate for the presence of uncertainties in continuous-state engineering systems are recalled and discussed. Focus is made on reinforcement learning, stochastic model predictive control and certification via randomized optimization. Some thoughts are also shared regarding the positioning of the control community in a data and AI-dominated period for which some suggestions and risks are highlighted.
Article
This chapter describes the MOOS-IvP autonomy software for unmanned marine vehicles and its use in large-scale ocean sensing systems. MOOS-IvP is comprised of two open-source software projects. MOOS provides a core autonomy middleware capability and the MOOS project additionally provides a set of ubiquitous infrastructure utilities. The IvP Helm is the primary component of an additional set of capabilities implemented to form a full marine autonomy suite known as MOOS-IvP. This software and architecture are platform and mission agnostic and allow for a scalable nesting of unmanned vehicle nodes to form large-scale, long-endurance ocean sensing systems comprised of heterogeneous platform types with varying degrees of communications connectivity, bandwidth, and latency. © Springer Science+Business Media New York 2013. All rights are reserved.
Reinforcement learning and optimal control
  • D Bertsekas
Bertsekas, D. (2019). Reinforcement learning and optimal control. Athena Scientific.
Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control
  • D Bertsekas
Bertsekas, D. (2022). Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific.
Model predictive control
  • B Kouvaritakis
  • M Cannon
Kouvaritakis, B. and Cannon, M. (2016). Model predictive control. Springer International Publishing.
The role of systems theory in control oriented learning
  • M Sznaier
  • A Olshevsky
  • E D Sontag
Sznaier, M., Olshevsky, A., and Sontag, E.D. (2022). The role of systems theory in control oriented learning. In 25th International Symposium on Mathematical Theory of Networks and Systems. (extended abstract).