PreprintPDF Available

Online Optimal Control with Linear Dynamics and Predictions: Algorithms and Regret Analysis

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

This paper studies the online optimal control problem with time-varying convex stage costs for a time-invariant linear dynamical system, where a finite look-ahead window with accurate predictions of the stage costs is available at each time. We design online algorithms, Receding Horizon Gradient-based Control (RHGC), that utilizes the predictions through finite steps of gradient computations. We study the algorithm performance measured by dynamic regret: the online performance minus the optimal performance in hindsight. It is shown that the dynamic regret of RHGC decays exponentially with the size of the look-ahead window. In addition, we provide a fundamental limit of the dynamic regret for any online algorithms by considering linear quadratic tracking problems. The regret upper bound of one RHGC method almost reaches the fundamental limit, demonstrating the effectiveness of the algorithm. Finally, we numerically test our algorithms for both linear and nonlinear systems to show the effectiveness and generality of our RHGC.
arXiv:1906.11378v1 [math.OC] 26 Jun 2019
Online Optimal Control with Linear Dynamics and
Predictions: Algorithms and Regret Analysis
Yingying Li
SEAS
Harvard University
Cambridge, MA, 02138
yingyingli@g.harvard.edu
Xin Chen
SEAS
Harvard University
Cambridge, MA, 02138
chen_xin@g.harvard.edu
Na Li
SEAS
Harvard University
Cambridge, MA, 02138
nali@seas.harvard.edu
Abstract
This paper studies the online optimal control problem with time-varying convex
stage costs for a time-invariant linear dynamical system, where a finite look-ahead
window with accurate predictions of the stage costs is available at each time.
We design online algorithms, Receding Horizon Gradient-based Control (RHGC),
that utilizes the predictions through finite steps of gradient computations. We
study the algorithm performance measured by dynamic regret: the online perfor-
mance minus the optimal performance in hindsight. It is shown that the dynamic
regret of RHGC decays exponentially with the size of the look-ahead window. In
addition, we provide a fundamental limit of the dynamic regret for any online
algorithms by considering linear quadratic tracking problems. The regret upper
bound of one RHGC method almost reaches the fundamental limit, demonstrating
the effectiveness of the algorithm. Finally, we numerically test our algorithms for
both linear and nonlinear systems to show the effectiveness and generality of our
RHGC.
1 Introduction
In this paper, we consider a N-horizon discrete-time sequential decision-making problem. At each
time t= 0,...,N 1, the decision maker observes a state xtof a dynamical system and re-
ceives a W-step look-ahead window of future cost functions on states and control actions, i.e.
ft(x) + gt(u),...,ft+W1(x) + gt+W1(u); then decides the control input utwhich drives the
system to a new state xt+1 following some known dynamics. For simplicity, we consider a linear
time-invariant (LTI) system xt+1 =Axt+Butwith (A, B)known in advance. The goal is to
minimize the overall cost over the Ntime steps. This problem finds many applications in sequential
decision making problems, e.g., data center management [1, 2], robotics [3], autonomous driving
[4, 5], energy systems [6], manufacturing [7, 8]. Therefore, there has been a growing interest on the
problem, from both control and online optimization communities.
In control community, studies on the above problem focus on Economic Model Predictive Control
(EMPC), which is a variant of Model Predictive Control (MPC) with a primary goal on optimizing
economic costs [9, 10, 11, 12, 13, 14, 15, 16]. Recent years have seen a lot of attention on the
optimality performance analysis of EMPC, under both time-invariant costs [17, 18, 19] and time-
varying costs [20, 12, 14, 21, 22]. However, most studies focus on asymptotic performance and there
is still limited understanding on the non-asymptotic performance, especially under time-varying
costs. Moreover, for computationally efficient algorithms, e.g. suboptimal MPC and inexact MPC
[23, 24, 25, 26], there is limited work on the optimality performance guarantee.
In online optimization, on the contrary, there are many papers on the nonasymptotic performance
analysis, which is measured by regret, e.g., static regrets[27, 28], dynamic regrets[29], etc, but most
Preprint. Under review.
work does not consider predictions and/or dynamical systems. Motivated by the applications with
predictions, e.g. predictions of electricity prices in data center management problems [30, 31], there
is a growing interest on studying the effect of predictions for the online problems [32, 33, 30, 34, 31,
35, 36]. However, though some papers consider switching costs which can be viewed as a simple
and special dynamical model [37, 36], there is a lack of study on the general dynamical systems and
on how predictions affect the online problem with dynamical systems.
In this paper, we propose novel gradient-based online algorithms, receding horizon gradient-based
control (RHGC), and provide nonasymptotic optimality guarantees by dynamic regrets. RHGC can
be based on any gradient methods, such as the vanilla gradient descent, Nesterov gradient, triple
momentum, etc [38, 39]. Due to space limit, this paper only presents the receding horizon triple
momentum (RHTM). For the theoretical analysis, we assume the cost functions are strongly convex
and smooth, whereas applying our RHGC does not require these conditions. Specifically, we show
that the regret bound of RHTM decays exponentially fast with the prediction window’s size W,
demonstrating that our algorithm efficiently utilizes the prediction. Besides, our regret bound also
decreases when the system becomes more controllable in the sense of a controllability index [40].
Moreover, we provide a fundamental limit for any online control algorithms and show that the
fundamental lower bound almost matches the regret upper bound of our RHTM. This indicates
that our RHTM achieves near-optimal performance at least in the worst case. We also provide
some discussion on the linear quadratic tracking problems, a widely considered control problem
in literature to provide more intuitive interpretation of our results. Finally, we numerically test
our algorithms. In addition to linear systems, we also apply our RHGC to a nonlinear dynamical
system, a two-wheeled robot, for path tracking. Results show that our algorithm works effectively for
nonlinear systems although we only present our algorithm and theoretical analysis on LTI systems.
Lastly, we would like to mention that there have been some recent work on online linear quadratic
control (LQR) problems, but most papers focus on the no-prediction cases [41, 42, 37]. As we
show later in this paper, these algorithms can be used in our RHGC methods as initialization oracles.
Moreover, our regret analysis show that RHGC can reduce the regret of these no-prediction online
algorithms by a factor exponential decaying with the prediction window size W.
Notations. Consider matrices Aand B,ABmeans ABis positive semidefinite. The norm
k · k refers to L2norm. Let xidenote the ith entry of the vector. Consider a set I={k1,...,km},
then xI= (xk1,...,xkm)and A(I,:) denotes the Irows of matrix Astacked together.
2 Problem formulation and preliminaries
Consider a finite-horizon discrete-time optimal control problem with time-varying cost functions
ft(xt) + gt(ut)and a linear time-invariant (LTI) dynamical system:
min
x,u
J(x,u) =
N1
X
t=0
[ft(xt) + gt(ut)] + fN(xN)
s.t. xt+1 =Axt+But, t 0
(1)
where xtRn,utRmfor all t,x= (x
1,...,x
N),u= (u
0,...,u
N1),x0is given, Nis
the problem horizon, fN(xN)is the terminal cost. Solving the optimal control problem (1) requires
information of all the cost functions from t= 0 to t=N. However, at each time t, usually only
a finite look-ahead window of cost functions is available and the decision maker needs to make an
online decision utusing the available information.
In particular, we consider a simplified prediction model: at each time t, the decision maker is pro-
vided with accurate predictions for the next Wtime steps, ft, gt,...,ft+W1, gt+W1, but no
further prediction beyond these Wtime steps, which means that ft+W, gt+W,... can even be ad-
versarially generated. Although this prediction model may be too optimistic in the short term and
over pessimistic in the long term, this model i) is able to capture a commonly observed phenomenon
in predictions that short-term predictions are usually much more accurate than the long-term predic-
tions; ii) allows researchers to derive insights for the role of prediction and possibly to extend to
more complicated cases [31, 30, 43, 44].
The online optimal control problem is described as follows: at each time step t= 0,1,...,
The agent observes state xtand receives prediction ft, gt, . . . , ft+W1, gt+W1.
2
The agent decides and implements a control utand suffers the cost ft(xt) + gt(ut).
The system evolves to the next state xt+1 =Axt+But.1
An online control algorithm, denoted as A, can be defined as a mapping from the prediction infor-
mation and history information to the control action, denoted by ut(A):
ut(A) = A(xt(A),...,x0(A),{fs, gs}t+W1
s=0 ), t 0(2)
where xt(A)is the state generated by implementing Aand x0(A) = x0is given.
This paper evaluates the performance of online control algorithms by comparing against the optimal
control cost Jin hindsight:
J:= min
(x,u): xt+1=Axt+But
J(x,u).(3)
The performance concerned in this paper for an online algorithm Ais measured by 2
Regret(A) := J(A)J=J(x(A),u(A)) J(4)
which is sometimes called as dynamic regret [29, 45] or competitive difference [46]. Another popular
regret notion is the static regret, which compares the online performance with the optimal static
controller/policy [42, 41]. The benchmark in static regret is weaker than that in dynamic regret
because the optimal controller may be far from being static, and it has been shown in literature that
o(N)static regret can be achieved even without predictions (i.e., W= 0). Thus, we will focus on
the dynamic regret analysis and study how prediction can improve the dynamic regret.
Example 1 (Linear quadratic (LQ) tracking.).Consider a discrete time tracking problem for a sys-
tem xt+1 =Axt+But. The goal is to minimize the quadratic tracking loss of a trajectory {θt}N
t=0
J(x,u) = 1
2
N1
X
t=0 (xtθt)Qt(xtθt) + u
tRtut+1
2(xNθN)QN(xNθN)
In practice, it is usually difficult to know the complete trajectory {θt}N
t=0 in prior, what are revealed
are usually the next few steps, making it an online control problem with predictions.
Assumptions and some useful concepts. Firstly, we introduce a standard assumption in control
theory: controllability of the system, which roughly means that the system can be steered to any
state by appropriate control inputs [47].
Assumption 1. The LTI system xt+1 =Axt+Butis controllable.
It is well-known that any controllable LTI system can be linearly transformed to a canonical form
[40] and the linear transformation can be computed efficiently in prior using Aand B, which can
further be used to reformulate the cost functions ft, gt. Thus, without loss of generality, this paper
only considers LTI systems in the canonical form, defined as follows.
Definition 1 (Canonical form).A system xt+1 =Axt+Butis said to be in the canonical form if
A=
0 1 0
.
.
.......
0 1
··· ∗ ∗ ... ··· ∗
0 1 0
.
.
.......
0 1
∗ ··· ∗ ∗ ··· ··· ···
··· ··· 0 1 ··· 0
.
.
.......
0 1
∗ ··· ∗ ∗ ··· ··· ∗ ∗ ···
, B =
0 0 ...
.
.
..
.
..
.
.
0
1 0 ···
0 0
.
.
..
.
.···
0 1 ···
··· ···
0··· ···
.
.
....
0 0 ···1
where each * represents a (possibly) nonzero entry, and the rows of Bwith 1are the same rows of A
with * and the indices of these rows are denoted as {k1,...,km}=:I. Moreover, let pi=kiki1
for 1im, where k0= 0. The controllability index of a canonical-form (A, B)is defined as
p= max{p1,...,pm}.
1Different from many learning based control papers, we assume A, B are known to the agent. We also
assume the full state xtis observable. Relaxing the information requirement is left as future work.
2The optimality gap depends on the initial state x0, but we omit x0for the simplicity of notation.
3
Next, we introduce assumptions on cost functions and their optimal solutions.
Assumption 2. Assume ftis µfstrongly convex and lfLipschitz smooth for 0tN, and gtis
µgstrongly convex and lgLipschitz smooth for 0tN1for some µf, µg, lf, lg>0.
Assumption 3. Assume the minimizers to ft, gt, denoted as θt= arg minxft(x), ξt=
arg minugt(u), are uniformly bounded, i.e. there exist ¯
θ, ¯
ξsuch that kθtk ≤ ¯
θ,kξtk ≤ ¯
ξ, t.
These assumptions are commonly adopted in convex analysis. The uniform bounds rule out extreme
cases. Notice that the LQ tracking problem in Example 1 satisfies Assumption 2 and 3 if Qt, Rtare
positive definite with uniform bounds on eigenvalues and θtare uniformly bounded for all t.
3 Online control algorithms: Receding horizon gradient-based control
This section introduces our online control algorithms, receding horizon gradient-based control
(RHGC). The design is by first converting the online control problem to an equivalent online op-
timization problem with finite temporal-coupling costs and then designing gradient-based online
optimization algorithms by utilizing this finite temporal-coupling property.
3.1 Problem transformation
Firstly, we notice that the offline optimal control problem (1) can be viewed as an optimization with
equality constraints over xand u. The individual stage cost ft(xt) + gt(ut)only depends on the
current xtand utbut the equality constraints couple xt,utwith xt+1 for each t. In the following,
we will rewrite (1) in an equivalent form of an unconstrained optimization problem on some entries
of xt, but the new stage cost at each time twill depend on these new entries across a few nearby
time steps. We will harness this structure to design our online algorithm.
In particular, the entries of xtadopted in the reformulation are: xk1
t,...,xkm
t, where I=
{k1,...,km}is defined in Definition 1. For ease of notation, we define
zt:= (xk1
t,...,xkm
t), t 0(5)
and zj
t=xkj
twhere j= 1,...,m. Let z:= (z
1,...,...,z
N). By the canonical-form equality
constraint xt=Axt1+But1, we have xi
t=xi+1
t1for i6∈ I, so xtcan be represented by
ztp+1,...,ztin the following way:
xt= (z1
tp1+1,...,z1
t
|{z }
p1
, z2
tp2+1,...,z2
t
|{z }
p2
,...,zm
tpm+1,...,zm
t
|{z }
pm
), t 0,(6)
where ztfor t0is determined by x0in a way to let (6) hold for t= 0. For the ease of mathematical
exposition and without loss of generality, we consider x0= 0 in this paper; then we have zt= 0 for
t0. Similarly, utcan be determined by ztp+1 ,...,zt, zt+1 by
ut=zt+1 A(I,:)xt=zt+1 A(I,:)(z1
tp1+1,...,z1
t,...,zm
tpm+1,...,zm
t), t 0(7)
where A(I,:) consists of k1,...,kmrows of A.
Notice that equations (5, 6, 7) describe a one-to-one transformation between (x,u)and z. There-
fore, we can transform the constrained optimization problem (1) on (x,u)to be an optimization
problem on z. Furthermore, because the LTI constraint xt+1 =Axt+Butis naturally em-
bedded in the relation (6) and (7), the resulting optimization problem on zbecomes an uncon-
strained one. Specifically, the new cost functions can be obtained by substituting (6, 7) into
ft(xt)and gt(ut). We denote the corresponding cost functions as ˜
ft(ztp+1,...,zt):=ft(xt)
and ˜gt(ztp+1,...,zt, zt+1 ):=gt(ut). Then the unconstrained optimization problem’s objective
function can be written as
C(z):=
N
X
t=0
˜
ft(ztp+1,...,zt) +
N1
X
t=0
˜gt(ztp+1 ,...,zt+1 )(8)
C(z)has many nice properties, some of which are formally stated as below.
Lemma 1. C(z)has the following properties:
4
i) C(z)is µc=µfstrongly convex and lcsmooth for lc= (plf+ (p+ 1)lgkIm,A(I,:)k2);
ii) (x,u)s.t. xt+1 =xt+ut,C(z) = J(x,u)where zis defined in (5). Conversely, z,
the corresponding (x,u)defined in (6) (7) satisfies xt+1 =xt+utand J(x,u) = C(z);
iii) Each stage cost ˜
ft+ ˜gtin (8) only depends on ztp+1,...,zt+1.
Property ii) implies that any online algorithm for deciding zcan be translated to an online algorithm
for xand uby (6, 7) with the same costs. Property iii) highlights one nice property, local temporal-
coupling, of C(z), which serves as a foundation for our online algorithm design.
Example 2. For illustration, consider the following dynamical system with n= 2, m = 1:
x1
t+1
x2
t+1 =0 1
a1a2x1
t
x2
t+0
1ut(9)
Here, k1= 2,I={2},A(I,:) = (a1, a2), and zt=x2
t.(9) leads to x1
t=x2
t1and xt=
(zt1, zt). Similarly, ut=x2
t+1 A(I,:)xt=zt+1 A(I,:)(zt1, zt). Hence, ˜
ft(zt1, zt) =
ft(xt) = ft((zt1, zt)),˜gt(zt1, zt, zt+1) = gt(ut) = gt(zt+1 A(I,:)(zt1, zt)).
3.2 Online algorithm design: RHGC
This section introduces our RHGC algorithm based on the reformulation (8) and inspired by the
online algorithm RHGD in [36]. As mentioned earlier, any online algorithm on ztcan be translated
to be an online algorithm on xt, ut. So we focus on designing an online algorithm on ztnow. By the
finite temporal-coupling property of C(z), the partial gradient of the total cost C(z)only depends
on the finite local stage costs {˜
fτ,˜gτ}t+p1
τ=tand finite local stage variables (ztp,...,zt+p) =:
ztp:t+p.
∂C
∂zt
(z) =
t+p1
X
s=t
˜
fs
∂zt
(zsp+1(k),...,zs(k)) +
t+p1
X
s=t1
˜gs
∂zt
(zsp+1(k),...,zs+1 (k))
Without causing any confusion, we use ∂C
∂zt(ztp:t+p)to denote ∂C
∂zt(z)for highlighting the lo-
cal dependence. Therefore, despite that not all the future costs are available, it is still possible to
compute the partial gradient of the total cost by using only a finite look-ahead window of the cost
functions. This observation motivates the design of our receding horizon gradient-based control
(RHGC) methods, which are the online implementation of gradient methods, such as the vanilla gra-
dient descent, Nesterov’s accelerated gradient, Triple Momentum, etc [38, 39]. For the space limit,
we only formally present the Receding Horizon Triple Momentum (RHTM) method in this paper,
c.f. Algorithm 1. Other RHGC methods can be designed in the same way.
In RHTM, jrefers to the iteration number of the corresponding gradient update of C(z). There are
two major steps to decide zt: i) initializing the decision variables z(0), ω
ω
ω(0),y(0) where ω
ω
ω(0),y(0)
Algorithm 1: Receding Horizon Triple Momentum (RHTM)
1: inputs: Canonical form (A, B),W1,K=W1
p, stepsizes γc, γz, γw, γy>0, oracle ϕ.
2: for t= 1 W:N1do
3: Step 1: initialize zt+W(0) by oracle ϕ, then let ωt+W(1), ωt+W(0), yt+W(0) be zt+W(0)
4: for j= 1,...,K do
5: Step 2: update ωt+Wjp (j), yt+Wjp (j), zt+Wjp (j)by Triple Momentum.
ωt+Wjp (j) = (1 + γw)ωt+Wj p (j1) γwωt+Wjp (j2)
γc
∂C
∂yt+Wj p
(yt+W(j+1)p:t+W(j1)p(j1))
yt+Wjp (j) = (1 + γy)ωt+Wjp (j)γyωt+Wjp(j1)
zt+Wjp (j) = (1 + γz)ωt+Wj p(j)γzωt+Wjp (j1)
6: end for
7: Step 3: compute utby zt+1(K)and the observed state xt:ut=zt+1(K)A(I,:)xt
8: end for
5
are auxiliary variables used in triple momentum methods to accelerate the convergence. We do
not restrict the initialization algorithm ϕ, i.e., it can be any oracle/online algorithm that does not
use prediction: zt+W(0) = ϕ({˜
fs,˜gs}t+W1
s=0 ). In Section 4, we will provide one initialization ϕ.
ii) using the look-ahead window of predicted cost to conduct gradient updates. We note that the
gradient update for (zτ(j), ωτ(j), yτ(j)) to (zτ(j+ 1), ωτ(j+ 1), yτ(j+ 1)) is implemented in a
backward order, i.e., from τ=t+Wto τ=t. Moreover, since the partial gradient of ∂C
∂ztneeds
the local variables ztp:t+p1, given W-step predictions, the algorithm RHTM can only conduct
K=W1
piterations of TM for the total cost C(z). For more intuitive introduction of the RHGC
methods, we refer readers to [36] for the simple case where p= 1 due to the space limit.
Though it appears that RHTM does not fully exploit the prediction since only a few gradient updates
are used, in section 5, we show that RHTM achieves nearly-optimal performance with respect to W,
which means that our algorithm successfully extracts and utilizes the prediction information.
Finally, we briefly introduce MPC[48] and suboptimal MPC[23], and compare them with our algo-
rithm. MPC tries to solve a W-stage optimization at each time tand implements the first control
input. Suboptimal MPC, as a variant of MPC aiming at reducing computation, conducts an optimiza-
tion method only for a few iterations without solving the optimization completely. Our algorithm’s
computation requirement is similar to suboptimal MPC with a few gradient iterations. Nevertheless,
the major difference between our algorithm and suboptimal MPC is that suboptimal MPC conducts
gradient updates for a truncated W-stage optimal control problem, while our algorithm is able to
conducts the gradient updates of the total cost only using W-step predictions, which solves the
complete N-stage optimal control problem but in an online fashion based on the reformulation (8).
4 Regret upper bound
Because our RHTM is designed in the way of exactly implementing the triple momentum of C(z)
for Kiterations, it is straightforward to have the following regret guarantee that connects the the
regret of RHTM and the initialization oracle ϕ,
Theorem 1. Consider W1and let ζ=lccdenote the condition number of C(z). For any
initialization oracle ϕ, given step sizes γc=1+φ
lc, γw=φ2
2φ, γy=φ2
(1+φ)(2φ), γz=φ2
1φ2, and
φ= 1 1/ζ, we have
Regret(RH T M)ζ2ζ1
ζ2K
Regret(ϕ)
where K=W1
p,Regret(ϕ)is the regret of the initial controller: ut(0) = zt+1(0)A(I,:)xt(0).
Theorem 1 suggests that for any online algorithm ϕwithout prediction, RHTM can use prediction
to lower the regret by a factor of ζ2(ζ1
ζ)2Kthrough additional K=W1
pgradient updates.
Moreover, the factor decays exponentially with K=W1
pwhich is almost a linear increasing
function with W. This indicates that our RHTM can improve the performance exponentially fast
with an increase in the prediction window Wfor any initialization method. In addition, K=W1
p
decreases with p, indicating that the regret increases with the controllability index p. This is intuitive
because proughly indicates how fast the controller can influence the system state effectively: the
larger the pis, the longer it takes (c.f. Definition 1). To see this, consider Example 2. Since ut1
does not directly affect x1
t, it takes at least p= 2 steps to change x1
tto a desirable value.
One initialization method: Follow the Optimal Steady State (FOSS). To complete the regret
analysis for RHTM, we provide a simple initialization method, FOSS. As mentioned before, any
online control algorithm without predictions, e.g., [42, 41] can be applied as an initialization oracle
ϕ. However, these papers mostly focus on the static regret analysis rather than dynamic regrets.
Definition 2 (Follow the Optimal Steady State (FOSS)).The optimal steady state for stage cost
f(x) + g(u)refers to (xe, ue):= arg minx=Ax+Bu (f(x) + g(u)). The Follow the Optimal Steady
State method (FOSS) solves the optimal steady state (xe
t, ue
t)based on cost function ft(x) + gt(u)
and outputs zt+1 that follows xe
t’s elements in I:zt+1(F OS S) = xe,I
t, where I={k1,...,km}.
FOSS is motivated by the fact that the optimal steady state cost is the optimal limiting average cost
for LTI systems [49] and thus FOSS should give acceptable performance at least for slowly changing
6
ft, gt. Nevertheless, we admit that the FOSS is proposed mainly for analytical purposes and other
online algorithms may outperform FOSS in various perspectives. Next, we provide a regret bound
for FOSS, which relies on the solution to the Bellman equation.
Definition 3 (Solution to the Bellman equation [50]).Let λebe the optimal steady state cost, which
is also the optimal limiting average cost (c.f. [49]). The Bellman equation for the optimal limiting
average-cost control problem is he(x) + λe= minu(f(x) + g(u) + he(Ax +Bu)). The solution
of the Bellman equation, denoted by he(x), is sometimes called as a bias function [50]. To ensure
the uniqueness of the solution, some extra conditions, e.g. he(0) = 0, are usually imposed.
Theorem 2 (Regret Bound of FOSS).Let (xe
t, ue
t)and he
t(x)denote the optimal steady state and
the bias function with respect to cost ft(x) + gt(u)respectively for 0tN1. Suppose he
t(x)
exists for 0tN1, then, the regret of FOSS can be bounded by
Regret(FOSS) = O N
X
t=0
(kxe
t1xe
tk+he
t1(x
t)he
t(x
t))!
where {x
t}N
t=0 denotes the optimal states, xe
1=x
0=x0,he
1(x) = 0, he
N(x) = fN(x),
xe
N=θN. Consequently, by Theorem 1, the regret bound of RHTM with initialization FOSS is
Regret(RHTM) = O(ζ1
ζ)2KPN
t=0(kxe
t1xe
tk+he
t1(x
t)he
t(x
t)).
Theorem 2 bounds the regret by the variation of the optimal steady states xe
tand the bias functions
he
t. If ft, gtdo not change, xe
t, he
tdo not change, resulting in 0 regret, which matches our intu-
ition. Though Theorem 2 requires the existence of he
t, the existence is guaranteed for many control
problems, e.g. LQ tracking and control problems with turnpike properties [51, 22].
5 Linear quadratic tracking: regret upper bounds and a fundamental limit
To provide more intuitive meaning for our regret analysis in Theorem 1 and Theorem 2, we ap-
ply RHTM on the LQ tracking problem in Example 1. Results on the time varying Qt, Rt, θtare
provided in the appendix; whereas here we focus on a special case which gives clean expressions
for regret bounds, both an upper bound for RHTM with initialization FOSS and a lower bound for
any online algorithm. These clean expressions make it easy to see that the lower bound and upper
bound almost match each other, implying that our online algorithm RHTM uses the prediction in a
nearly-optimal way even though it only conducts a few gradient updates at each time step .
The special case of LQ tracking problems is in the following form,
1
2
N1
X
t=0
h(xtθt)Q(xtθt) + u
tRuti+1
2x
NPexN(10)
where Q > 0,R > 0, and Peis the solution to the algebraic Riccati equation with respect to Q, R
[52]. Basically, in this special case, Qt=Q,Rt=Rfor 0tN1,QN=Pe,θN= 0, and
only θt, t = 1,...,N 1changes. The LQ (10) tracking problem means to follow a time-varying
trajectory θwith constant weights on the tracking cost and control cost.
Regret upper bound. Firstly, based on Theorem 1 and Theorem 2, we have the following bound.
Corollary 1. Then, the regret of RHTM with FOSS as initialization rule can be bounded by
Regret(RHT M ) = O(( ζ1
ζ)2K
N
X
t=0 kθtθt1k)
where K=(W1)/p,ζis the condition number of the corresponding C(z),θ1= 0.
This corollary shows that the regret can be bounded by the total variation of θtfor constant Q, R.
Fundamental limit. For any online algorithm, we have the following lower bound.
Theorem 3 (Lower Bound).Consider 1WN/3. Consider any condition number ζ > 1,
any variation budget 2¯
θLN(2N+ 1)¯
θand any controllability index p1. For any online
algorithm A, there exists an LQT problem in form (10) such that the canonical-form system (A, B )
7
has controllability index p, the sequence {θt}satisfies the variation budget PN
t=1 kθtθt1k ≤ LN,
and the corresponding C(z)has condition number ζ, such that the following lower bound holds
J(A)J= Ω(( ζ1
ζ+ 1 )2KLN) = Ω(( ζ1
ζ+ 1 )2K
N
X
t=0 kθtθt1k)(11)
where K=(W1)/pand θ1= 0.
Surprisingly, the lower bound in Theorem 3 and the upper bound in Corollary 1 almost match each
other, especially when ζis large. This demonstrates that RHTM utilizes the prediction information in
a near-optimal way. The major conditions in Theorem 3 require that the prediction is short compared
with the horizon: WN/3and the variation of the cost functions should not be too small: LN
2¯
θ, otherwise the online control problem is too easy and the regret can be very small.
6 Numerical experiments
2 4 6 8 10 12 14
Prediction W
-10
-5
0
5
9
log(regret)
RHGD
RHAG
RHTM
subMPC Iter = 1
subMPC Iter = 3
subMPC Iter = 5
Figure 1: Regret for LQ tracking.
-20 -10 0 10 20
X
-20
-10
0
10
20
Y
W = 40
reference robot path
-20 -10 0 10 20
X
-20
-10
0
10
20
Y
W = 80
reference robot path
Figure 2: Two-wheel robot tracking with nonlinear dynam-
ics.
LQ tracking problem in Example 1. The experiment settings are provided in the appendix. The
LTI system order is n= 2 and the controller is a scalar; thus p= 2 for this system. We compare
our algorithm with one suboptimal MPC algorithm, fast gradient MPC (subMPC) [23]. Roughly
speaking, the algorithm solves the W-stage truncated optimal control from tto t+W1and then
solves it by Nesterov’s gradient descent. One gradient update in this subMPC requires Wtimes
of partial gradient calculations since there are Wstages of variables. This means that our RHTM
is corresponding to subMPC with 1 Nesterov iteration. Figure 1 also plots subMPC with 3 and 5
Nesterov iteration. Figure 1 shows that all our algorithms RHGD, RHAG, RHTM achieve exponen-
tial decaying regret with respect to W, and the decay is piecewise constant, matching Theorem 1.
It is observed that RHTM and RHAG perform better than RHGD, which is intuitive because TM
and AG are accelerated versions of GD. Moreover, our algorithms are much better than the subopti-
mal MPC with one iteration. It is also observed that suboptimal MPC achieves better performance
by increasing the iteration number but the improvement saturates as Wgets large, contrast to our
RHTM.
Path tracking for a two-wheel mobile robot. Though we presented our online algorithms on
a LTI system, our RHGC methods are applicable to nonlinear systems. Here we consider a
two-wheel mobile robot with nonlinear kinematic dynamics ˙x=vcos δ, ˙y=vsin θ, ˙
δ=w
where (x, y)is the robot location, vand ware the tangential and angular velocities respec-
tively, δdenotes the tangent angle between vand the X-axis [53]. The control is directly on
the vand ω, e.g., through pulse-width modulation (PWM) of the motor [54]. Given a refer-
ence path (xr(t), yr(t)), the objective is to balance the tracking performance and control cost, i.e.,
min PN
t=0 ce
t·(xtxr(t))2+ (ytyr(t))2+cv
t·v2
t+cw
t·w2
t. We discretize the dynamics
with time interval t= 0.025s; then follow similar ideas in this paper to reformulate the optimal
path tracking problem to an unconstrained optimization with respect to (xt, yt)and apply RHGC
methods. See the appendix for details. Figure 2 plots the tracking results with window W= 40 and
W= 80 corresponding to look-ahead time 1s and 2s. A video showing the dynamic processes with
different Wis provided at https://youtu.be/fal56LTBD1s. It is observed that the robot follows
the reference trajectory well especially when the path is smooth but has some deviations when the
path has sharp turns, and a longer look-ahead window leads to better tracking performance. These
results confirm that our RHGC work effectively on nonlinear systems.
8
7 Conclusion
This paper studies the role of prediction on dynamic regret of online control problems with linear
dynamics. We design RHTM algorithm and provide a regret upper bound. We also provide a
fundamental limit and show the fundamental limit almost matches RHTM’s upper bound. Future
work includes the study of 1) nonlinear systems, 2) systems with disturbances and noises, 3) system
with state and control constraints, 4) unknown system dynamics.
References
[1] Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle.
Data center cooling using model-predictive control. In Advances in Neural Information Pro-
cessing Systems, pages 3814–3823, 2018.
[2] Wei Xu, Xiaoyun Zhu, Sharad Singhal, and Zhikui Wang. Predictive control for dynamic
resource allocation in enterprise data centers. In 2006 IEEE/IFIP Network Operations and
Management Symposium NOMS 2006, pages 115–126. IEEE, 2006.
[3] Tomas Baca, Daniel Hert, Giuseppe Loianno, Martin Saska, and Vijay Kumar. Model predic-
tive trajectory tracking and collision avoidance for reliable outdoor deployment of unmanned
aerial vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pages 6753–6760. IEEE, 2018.
[4] Jackeline Rios-Torres and Andreas A Malikopoulos. A survey on the coordination of connected
and automated vehicles at intersections and merging at highway on-ramps. IEEE Transactions
on Intelligent Transportation Systems, 18(5):1066–1077, 2016.
[5] Kyoung-Dae Kim and Panganamala Ramana Kumar. An mpc-based approach to provable
system-wide safety and liveness of autonomous ground traffic. IEEE Transactions on Auto-
matic Control, 59(12):3341–3356, 2014.
[6] Samir Kouro, Patricio Cortés, René Vargas, Ulrich Ammann, and José Rodríguez. Model pre-
dictive control—a simple and powerful method to control power converters. IEEE Transactions
on industrial electronics, 56(6):1826–1838, 2008.
[7] Edgar Perea-Lopez, B Erik Ydstie, and Ignacio E Grossmann. A model predictive control
strategy for supply chain optimization. Computers & Chemical Engineering, 27(8-9):1201–
1218, 2003.
[8] Wenlin Wang, Daniel E Rivera, and Karl G Kempf. Model predictive control strategies for sup-
ply chain management in semiconductor manufacturing. International Journal of Production
Economics, 107(1):56–77, 2007.
[9] Moritz Diehl, Rishi Amrit, and James B Rawlings. A lyapunov function for economic opti-
mizing model predictive control. IEEE Transactions on Automatic Control, 56(3):703–707,
2010.
[10] Matthias A Müller and Frank Allgöwer. Economic and distributed model predictive control:
Recent developments in optimization-based control. SICE Journal of Control, Measurement,
and System Integration, 10(2):39–52, 2017.
[11] Matthew Ellis, Helen Durand, and Panagiotis D Christofides. A tutorial review of economic
model predictive control methods. Journal of Process Control, 24(8):1156–1178, 2014.
[12] Antonio Ferramosca, James B Rawlings, Daniel Limón, and Eduardo F Camacho. Economic
mpc for a changing economic criterion. In 49th IEEE Conference on Decision and Control
(CDC), pages 6131–6136. IEEE, 2010.
[13] Matthew Ellis and Panagiotis D Christofides. Economic model predictive control with time-
varying objective function for nonlinear process systems. AIChE Journal, 60(2):507–519,
2014.
[14] David Angeli, Alessandro Casavola, and Francesco Tedesco. Theoretical advances on eco-
nomic model predictive control with time-varying costs. Annual Reviews in Control, 41:218–
224, 2016.
[15] Rishi Amrit, James B Rawlings, and David Angeli. Economic optimization using model pre-
dictive control with a terminal cost. Annual Reviews in Control, 35(2):178–186, 2011.
9
[16] Lars Grüne. Economic receding horizon control without terminal constraints. Automatica,
49(3):725–734, 2013.
[17] David Angeli, Rishi Amrit, and James B Rawlings. On average performance and stability of
economic model predictive control. IEEE transactions on automatic control, 57(7):1615–1626,
2012.
[18] Lars Grüne and Marleen Stieler. Asymptotic stability and transient optimality of economic
mpc without terminal conditions. Journal of Process Control, 24(8):1187–1196, 2014.
[19] Lars Grüne and Anastasia Panin. On non-averaged performance of economic mpc with ter-
minal conditions. In 2015 54th IEEE Conference on Decision and Control (CDC), pages
4332–4337. IEEE, 2015.
[20] Antonio Ferramosca, Daniel Limon, and Eduardo F Camacho. Economic mpc for a changing
economic criterion for linear systems. IEEE Transactions on Automatic Control, 59(10):2657–
2667, 2014.
[21] Lars Grüne and Simon Pirkelmann. Closed-loop performance analysis for economic model
predictive control of time-varying systems. In 2017 IEEE 56th Annual Conference on Decision
and Control (CDC), pages 5563–5569. IEEE, 2017.
[22] Lars Grüne and Simon Pirkelmann. Economic model predictive control for time-varying sys-
tem: Performance and stability results. Optimal Control Applications and Methods, 2018.
[23] Melanie Nicole Zeilinger, Colin Neil Jones, and Manfred Morari. Real-time suboptimal model
predictive control using a combination of explicit mpc and online optimization. IEEE Trans-
actions on Automatic Control, 56(7):1524–1534, 2011.
[24] Yang Wang and Stephen Boyd. Fast model predictive control using online optimization. IEEE
Transactions on Control Systems Technology, 18(2):267–278, 2010.
[25] Knut Graichen and Andreas Kugi. Stability and incremental improvement of suboptimal mpc
without terminal constraints. IEEE Transactions on Automatic Control, 55(11):2576–2580,
2010.
[26] Douglas A Allan, Cuyler N Bates, Michael J Risbeck, and James B Rawlings. On the inherent
robustness of optimal and suboptimal nonlinear mpc. Systems & Control Letters, 106:68–78,
2017.
[27] E. Hazan. Introduction to Online Convex Optimization. Foundations and Trends(r) in Opti-
mization Series. Now Publishers, 2016.
[28] S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and
Trends(r) in Machine Learning. Now Publishers, 2012.
[29] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online opti-
mization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages
398–406, 2015.
[30] Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. Dynamic right-
sizing for power-proportional data centers. IEEE/ACM Transactions on Networking (TON),
21(5):1378–1391, 2013.
[31] Minghong Lin, Zhenhua Liu, Adam Wierman, and Lachlan LH Andrew. Online algorithms
for geographical load balancing. In Green Computing Conference (IGCC), 2012 International,
pages 1–10. IEEE, 2012.
[32] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In
Conference on Learning Theory, pages 993–1019, 2013.
[33] Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan LH Andrew.
Online convex optimization using predictions. In ACM SIGMETRICS Performance Evaluation
Review, volume 43, pages 191–204. ACM, 2015.
[34] Masoud Badiei, Na Li, and Adam Wierman. Online convex optimization with ramp constraints.
In Decision and Control (CDC), 2015 IEEE 54th Annual Conference on, pages 6730–6736.
IEEE, 2015.
[35] Niangjun Chen, Joshua Comden, Zhenhua Liu, Anshul Gandhi, and Adam Wierman. Using
predictions in online optimization: Looking forward with an eye on the past. In Proceedings
of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of
Computer Science, pages 193–206. ACM, 2016.
10
[36] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching
costs: Fast algorithms and the fundamental limit. arXiv preprint arXiv:1801.07780, 2018.
[37] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and lqr con-
trol. In The 22nd International Conference on Artificial Intelligence and Statistics, pages
2504–2513, 2019.
[38] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
Springer Science & Business Media, 2013.
[39] Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally con-
vergent first-order method for minimizing strongly convex functions. IEEE Control Systems
Letters, 2(1):49–54, 2017.
[40] David Luenberger. Canonical forms for linear multivariable systems. IEEE Transactions on
Automatic Control, 12(3):290–293, 1967.
[41] Yasin Abbasi-Yadkori, Peter Bartlett, and Varun Kanade. Tracking adversarial targets. In
International Conference on Machine Learning, pages 369–377, 2014.
[42] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal
Talwar. Online linear quadratic control. In International Conference on Machine Learning,
pages 1028–1037, 2018.
[43] Lian Lu, Jinlong Tu, Chi-Kin Chau, Minghua Chen, and Xiaojun Lin. Online energy genera-
tion scheduling for microgrids with intermittent energy sources and co-generation, volume 41.
ACM, 2013.
[44] Allan Borodin, Nathan Linial, and Michael E Saks. An optimal on-line algorithm for metrical
task system. Journal of the ACM (JACM), 39(4):745–763, 1992.
[45] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimiza-
tion in dynamic environments: Improved regret rates for strongly convex problems. In 2016
IEEE 55th Conference on Decision and Control (CDC), pages 7195–7201. IEEE, 2016.
[46] Lachlan Andrew, Siddharth Barman, Katrina Ligett, Minghong Lin, Adam Meyerson, Alan
Roytman, and Adam Wierman. A tale of two metrics: Simultaneous bounds on competitive-
ness and regret. In Conference on Learning Theory, pages 741–763, 2013.
[47] Joao P Hespanha. Linear systems theory. Princeton university press, 2018.
[48] JB Rawlings and DQ Mayne. Postface to model predictive control: Theory and design. Nob
Hill Pub, pages 155–158, 2012.
[49] David Angeli, Rishi Amrit, and James B Rawlings. Receding horizon cost optimization for
overly constrained nonlinear plants. In Proceedings of the 48h IEEE Conference on Decision
and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 7972–7977.
IEEE, 2009.
[50] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, 2014.
[51] Tobias Damm, Lars Grüne, Marleen Stieler, and Karl Worthmann. An exponential turnpike
theorem for dissipative discrete time optimal control problems. SIAM Journal on Control and
Optimization, 52(3):1935–1957, 2014.
[52] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. 2011.
[53] Gregor Klancar, Drago Matko, and Saso Blazic. Mobile robot control on a reference path.
In Proceedings of the 2005 IEEE International Symposium on, Mediterrean Conference on
Control and Automation Intelligent Control, 2005., pages 1343–1348. IEEE, 2005.
[54] Pololu Corporation. Pololu m3pi User’s Guide. Available at
https://www.pololu.com/docs/pdf/0J48/m3pi.pdf.
[55] Paul Concus, Gene H Golub, and Gérard Meurant. Block preconditioning for the conjugate
gradient method. SIAM Journal on Scientific and Statistical Computing, 6(1):220–252, 1985.
[56] Frank L Lewis, Draguna Vrabie, and Vassilis L Syrmos. Optimal control. John Wiley & Sons,
2012.
11
Appendices
In Appendix A, we will discuss the canonical-form transformation. In Appendix B, we will intro-
duce Triple Momentum [39] and proof of Theorem 1. In Appendix C, we will provide a proof of
Lemma 1. In Appendix D, we will present proof of Theorem 2. In Appendix E, we will provide the
regret analysis for LQT. In Appendix F, we will provide the proof of Theorem 3. In Appendix G, we
will provide technical proofs for LQT. In Appendix F, we will provide more a detailed description
of simulation.
A Canonical form
In this section, we introduce the linear transformation from a general LTI system to a canonical-form
LTI system, and then discuss how to convert a general online optimal control problem to an online
optimal control problem with a canonical-form system .
Firstly, consider a general LTI system: xt+1 =Axt+Butand two invertible matrices Sx
Rn, SuRm. Under linear transformation on state and control: ˆxt=Sxxt,ˆut=Suut, the
equivalent LTI system under the new state ˆxtand new control ˆutis
ˆxt+1 =SxAS1
xˆxt+SxBS1
uˆut
By Theorem 1 in [40], for any controllable (A, B), there exists Sx, Susuch that ˆ
A=SxAS1
xand
ˆ
B=SxBS1
uare in the canonical form defined in Definition 1. The computation method of Sx, Su
is also provided in [40].
In an online optimal control problem, since A, B are known a priori, Sx, Sucan be computed offline.
When stage cost functions ft(xt), gt(ut)are received online, the new cost functions ˆ
fx),ˆgtut)for
the canonical-form system can be computed online by applying Sx, Su:
ˆ
ftxt) = ft(xt) = ft(S1
xˆxt),ˆgt(ˆut) = gt(ut) = gt(S1
uˆut)
Therefore, it is without loss of generality to only consider online optimal control with canonical-
form systems.
B Triple Momentum and proof of Theorem 1
Triple Momentum (TM) is an accelerated version of gradient descent proposed in [39]. When opti-
mizing an unconstrained optimization minzC(z), at each iteration j0, TM conducts
ω
ω
ω(j+ 1) = (1 + δω)ω
ω
ω(j)δωω
ω
ω(j1) δcC(y(j))
y(j+ 1) = (1 + δy)ω
ω
ω(j+ 1) δyω
ω
ω(k)
z(j+ 1) = (1 + δz)ω
ω
ω(j+ 1) δzω
ω
ω(j)
where ω
ω
ω(j),y(j)are auxiliary variables to accelerate the convergence, z(j)is the decision variable,
ω
ω
ω(0) = ω
ω
ω(1) = z(0) = y(0) are given initial values.
Suppose z= (z
1,...,z
N). Zooming in to each coordinate zt, the update of zt(j)by TM is
provided below
ωt(j+ 1) = (1 + δω)ωt(j)δωωt(j1) δc
∂C
∂yt
(y(j))
yt(j+ 1) = (1 + δy)ωt(j+ 1) δyωt(j)
zt(j+ 1) = (1 + δz)ωt(j+ 1) δzωt(j)
By Section 3, C
∂yt(y(j)) only depends on stage cost functions and stage variables across a finite
neighboring stages, allowing the online implementation based on the finite-lookahead window.
TM enjoys faster convergence rate than gradient descent for µcstrongly convex and lcsmooth func-
tions under proper stepsizes. In particular, when γc=1+φ
lc, γw=φ2
2φ, γy=φ2
(1+φ)(2φ), γz=
φ2
1φ2, and φ= 1 1/ζ,ζ=lcc, by [39], the convergence rate satisfies:
C(z(j)) C(z)ζ2(ζ1
ζ)2j(C(z(0)) C(z)) (12)
12
In the following, we will apply the convergence rate to the proof of Theorem 1.
B.1 Proof of Theorem 1
By comparing TM with RHTM, it can be verified that zt+1 (K)computed by RHTM is the same as
zt+1(K)computed by Triple Momentum after Kiterations. Moreover, by the equivalence between
the optimization minzC(z)and the optimal control J(x,u)in Lemma 1, we have J(RH T M ) =
C(z(K)),J(ϕ) = C(z(0)) and J=C(z), which concludes the proof.
C Proof of Lemma 1
Property ii) and iii) can be directly verified by definition. Thus, it suffices to prove i): the convexity
and smoothness of C(z).
Notice that xt, utare linear with respect to zby (6) (7). For ease of reference, we define matrix
Mxt, M utto represent the relation between xt, utand z, i.e, xt=Mxtzand ut=Mutz. Similarly,
we write ˜
ft(ztp+1,...,zt)and ˜gt(ztp+1 ,...,zt+1)in terms of zfor simplicity of notation:
˜
ft(ztp+1,...,zt) = ˜
ft(z) = ft(Mxtz)
˜gt(ztp+1,...,zt+1 ) = ˜gt(z) = gt(Mutz)
A direct consequence of the linear relations is that ˜
ft(z)and ˜gt(z)are convex with respect to z
because ft(xt), gt(ut)are convex and linear transformation preserves convexity.
In the following, we will focus on the proof of strong convexity and smoothness. For simplicity, in
the following, we only consider cost function ft, gtwith minimum value as zero: ft(θt) = 0, and
gt(ξt) = 0 for all t. This is without loss of generality because by strong convexity and smoothness,
ft,gthave minimum value, and by subtracting the minimum value, we can let ft, gthave minimum
value 0.
Strong convexity. Since ˜gtis convex, we only need to prove that Pt˜
ft(z)is strongly convex
then the sum C(z)is strongly convex because the sum of convex functions and a strongly convex
function is strongly convex.
In particular, by the strong convexity of ft(xt), we have the following result for any z,zRNm
and xt=Mxtz,x
t=Mxtz:
˜
ft(z)˜
ft(z)− h∇ ˜
ft(z),zzi − µf
2kz
tztk2
=˜
ft(z)˜
ft(z)− h(Mxt)ft(xt),zzi − µf
2kz
tztk2
=˜
ft(z)˜
ft(z)− h∇ft(xt), Mxt(zz)i − µf
2kz
tztk2
=˜
ft(z)˜
ft(z)− h∇ft(xt), x
txti − µf
2kz
tztk2
ft(x
t)ft(xt)− h∇ft(xt), x
txti − µf
2kx
txtk20
where the first equality is by the chain rule, the second equality is by the definition of inner product,
the third equality is by the definition of xt, x
t, the first inequality is by ˜
ft(z) = ft(x)and zt=
(xk1
t,...,xkm
t), and the last inequality is by ft(xt)is µfstrongly convex.
Summing over ton both sides of the inequality results in the strong convexity of Pt˜
ft(z):
N
X
t=1 h˜
ft(z)˜
ft(z)− h∇ ˜
ft(z),zzi − µf
2kz
tztk2i
=
N
X
t=1
˜
ft(z)
N
X
t=1
˜
ft(z)− h∇
N
X
t=1
˜
ft(z),zzi − µf
2kzzk20
Consequently, C(z)is µcstrongly convex with parameter at least µfby the convexity of ˜gt.
13
Smoothness. We will prove the smoothness by considering ˜
ft(z)and ˜gt(z)respectively.
Firstly, let’s consider ˜
ft(z). Similar to the proof for strong convexity, we use the smoothness of
ft(xt). For any z,z, and xt=Mxtz,x
t=Mxtz, we can show that
˜
ft(z) = ft(x
t)ft(xt) + h∇ft(xt), x
txti+lf
2kx
txtk2
˜
ft(z) + h∇ ˜
ft(z),zzi+lf
2(kz
tp+1 ztp+1k2+···+kz
tztk2)
where the inequality is by xt=Mxtzand the chain rule and (6).
Secondly, we consider ˜gt(z)in a similar way. For any z,z, and ut=Mutz,u
t=Mutz, we have
˜gt(z) = gt(u
t)gt(ut) + h∇gt(ut), u
tuti+lg
2ku
tutk2(by gtis lgsmooth)
= ˜gt(z) + h(Mut)gt(ut),zzi+lg
2ku
tutk2(by ˜gt’s def)
= ˜gt(z) + h∇˜gt(z),zzi+lg
2ku
tutk2(by ˜gt’s derivative)
Since ut=zt+1 A(I,:)xt= (I, A(I,:))(z
t+1, x
t), we have that
lg
2ku
tutk2lg
2k(I, A(I,:)) ((z
t+1),(x
t))(z
t+1, x
t)k2
lg
2k(I, A(I,:))k2(kzt+1 z
t+1k2+kxtx
tk2)
lg
2k(I, A(I,:))k2(kzt+1 z
t+1k2+···+kztp+1 z
tp+1k2)
Finally, by summing over t, we have
C(z)C(z) + h∇C(z),zzi+ (plf+ (p+ 1)lgκ)/2kzzk2
where κ=k(I, A(I,:))k2
2. Thus we have proved the smoothness of C(z).
D Proof of Theorem 2
To prove the bound, we consider the sum of the optimal steady state cost, PN1
t=0 λe
t, as a middle
ground and bound J(ϕ)PN1
t=0 λe
tand PN1
t=0 λe
tJin Lemma 2 and Lemma 3 respectively.
Then, the regret bound can be obtained by combining the two bounds.
Lemma 2 (Bound of J(ϕ)PN1
t=0 λt).Let the initialization ϕbe the following-the-optimal-steady-
state method. Let xt(0) denote the state determined by the initialization. For any initial state x0,
J(ϕ)
N1
X
t=0
λe
tc1
N1
X
t=0 kxe
txe
t1k+fN(xN(0)) = O(
N
X
t=0 kxe
txe
t1k)
where xe
N:=θN,xe
1=x0= 0 for simplicity of the notation, c1is a constant that does not depend
on N.
Lemma 3 (Bound of PN1
t=0 λtJ).Let he
t(x)denote the solution to the average-cost Bellman
equation under cost ft(x) + gt(u). Let x
tdenote the optimal state trajectory.
N1
X
t=0
λtJ
N
X
t=1
(he
t1(x
t)he
t(x
t)) he
0(x0) =
N
X
t=0
(he
t1(x
t)he
t(x
t))
where he
N(x):=fN(x),he
1(x):= 0 and x
0=x0for simplicity of the notation.
Then, we can complete the proof by Lemma 2 and 3:
J(ϕ)J=J(ϕ)
N1
X
t=0
λe
t+
N1
X
t=0
λe
tJ=O(
N
X
t=0
(kxe
t1xe
tk+he
t1(x
t)he
t(x
t)))
In the following, we will prove Lemma 2 and 3 respectively. For simplicity, we only consider cost
function ft, gtwith minimum value as zero: ft(θt) = 0, and gt(ξt) = 0 for all t. This is without
loss of generality because by strong convexity and smoothness, ft,gthave minimum value, and by
subtracting the minimum value, we can let ft, gthave minimum value 0.
14
D.1 Proof of Lemma 2.
The proof relies on the convexity of cost functions and the uniform upper bounds of xt(0), ut(0)
resulted from the uniform upper bounds of θt, ξtin Assumption 3.
Notice that J(ϕ) = PN1
t=0 (ft(xt(0)) + gt(ut(0))) + fN(xN(0)) and PN1
t=0 λe
t=PN1
t=0 (ft(xe
t)+
gt(ue
t)). It suffices to bound ft(xt(0)) ft(xe
t)and gt(ut(0)) gt(ue
t)for 0tN1. We will
first focus on ft(xt(0)) ft(xe
t), then bound gt(ut(0)) gt(ue
t)in the same way.
For 0tN1, by convexity of ft, and the property of L2norm,
ft(xt(0)) ft(xe
t)≤ h∇ft(xt(0)), xt(0) xe
ti ≤ k∇ft(xt(0))kkxt(0) xe
tk(13)
In the following, we will bound k∇ft(xt(0))kand kxt(0) xe
tk.
Firstly, we provide a bound for k∇ft(xt(0))k.
k∇ft(xt(0))k=k∇ft(xt(0)) − ∇ft(θt)k ≤ lfkxt(0) θtk ≤ lf(n¯xe+¯
θ)(14)
where the first equality is because θtis the global minimizer of ft, and first inequality is by Lipschitz
smoothness, the second inequality is by kθtk ≤ ¯
θaccording to Assumption 3 and the following
lemma that provides a uniform bound on xt(0). The proof is technical and is deferred to the end of
this section.
Lemma 4 (Uniform upper bounds of xe
t, ue
t, xt(0), ut(0)).There exists ¯xeand ¯uethat are inde-
pendent of N , W , such that kxe
tk2¯xeand kue
tk2¯uefor all 0tN1. Moreover,
kxt(0)k2n¯xefor 0tNand kut(0)k2n¯uefor 0tN1, where xt(0), ut(0)
denote the state and control at tdetermined by the initialization and consider x0= 0 for simplicity.
Secondly, we provide a bound for kxt(0) xe
tk. The proof relies on a characterization of the steady
state and the initialized state based on the canonical form.
Lemma 5 (Steady state and initialized state of canonical-form systems).Consider a canonical form
system: xt+1 =Axt+But.
(a) Any steady state (x, u)is in the form of
x= (z1,...,z1
|{z }
p1
, z2,...,z2
|{z }
p2
,...,zm,...,zm
|{z }
pm
)
u= (z1,...,zm)A(I,:)x
Let z= (z1, . . . , zm). For the optimal steady state with respect to cost ft+gt, we
denote the corresponding zas ze
t, and the optimal steady state can be represented as xe
t=
(ze,1
t,...,ze,1
t, ze,2
t,...,ze,2
t,...,ze,m
t,...,ze,m
t)and ue
t=ze
tA(I,:)xe
tfor 0t
N1.
(b) By follow-the-optimal-steady-state initialization, xt(0) and ut(0) satisfies
xt(0) = (ze,1
tp1,...,ze,1
t1
|{z }
p1
, ze,2
tp2,...,ze,2
t1
|{z }
p2
,...,ze,m
tpm,...,ze,m
t1
|{z }
pm
),0tN
ut(0) = ze
tA(I,:)xt(0) 0 tN1
where ze
t= 0 for t≤ −1.
Proof. (a) This is by the definition of the canonical form and the definition of the steady state.
(b) By the initialization, zt(0) = xe,I
t1=ze
t1. By the relation between zt(0) and xt(0),ut(0),
we have xI
t(0) = zt(0) = ze
t1, and xI−1
t(0) = zt1(0) = ze
t2, so on and so forth. This
proves the structure of xt(0). The structure of ut(0) is because ut(0) = zt+1 (0) A(I,:
)xt(0) = ze
tA(I,:)xt(0)
15
By Lemma 5, we can bound kxt(0) xe
tkfor 0tN1by
kxt(0) xe
tk ≤ qkze
t1ze
tk2+···+kze
tpze
tk2qkxe
t1xe
tk2+···+kxe
tpxe
tk2
≤ kxe
t1xe
tk+···+kxe
tpxe
tk ≤ p(kxe
t1xe
tk+···+kxe
tpxe
tp+1k)
(15)
Combining (13) (14) and (15) yields
N1
X
t=0
ft(xt(0)) ft(xe
t)
N1
X
t=0 k∇ft(xt(0))kkxt(0) xe
tk
N1
X
t=0
lf(n¯xe+¯
θ)p(kxe
t1xe
tk+···+kxe
tpxe
tp+1k)
p2lf(n¯xe+¯
θ)
N1
X
t=0 kxe
t1xe
tk(16)
Notice that the constant term p2lf(n¯xe+¯
θ)does not depend on N , W .
Similarly, we can provide a bound for gt(ut(0)) gt(ue
t).
N1
X
t=0
gt(ut(0)) gt(ue
t)
N1
X
t=0 k∇gt(ut(0))kkut(0) ue
tk
N1
X
t=0
lgkut(0) ξtkkut(0) ue
tk
N1
X
t=0
lg(n¯ue+¯
ξ)kA(I,:)xt(0) A(I,:)xe
tk
N1
X
t=0
lg(n¯ue+¯
ξ)kA(I,:)kkxt(0) xe
tk
p2lg(n¯ue+¯
ξ)kA(I,:)k
N1
X
t=0 kxe
t1xe
tk(17)
where the first inequality is by the convexity, the second inequality is because ξtis the global min-
imizer of gtand gtis lg-smooth, the third inequality is by Assumption 3, Lemma 4 and Lemma 5,
the fourth inequality is by matrix norm’s property, the fifth inequality is by (15). Notice that the
constant term p2lg(n¯ue+¯
ξ)kA(I,:)kdoes not depend on N , W .
By (16) and (17), we complete the first inequality in the statement of Lemma 2.
J(ϕ)
N1
X
t=0
λe
tc1
N1
X
t=0 kxe
t1xe
tk+fN(xN(0))
where c1does not depend on N.
By defining xe
N=θN, we can bound fN(xN(0)) by kxN(0) xe
Nkup to some constants because
fN(xN(0)) = fN(xN(0)) fN(θN)lterm
2(n¯xe+¯
θ)kxN(0) xe
Nk. By the same argument as
in (15), we have kxN(0) xe
Nk=O(PN
t=0 kxe
t1xe
tk). Consequently, we have shown that
J(ϕ)
N1
X
t=0
λe
t=O(
N
X
t=0 kxe
t1xe
tk)
16
D.2 Proof of Lemma 3.
The proof heavily relies on dynamic programming and the Bellman equation. For simplicity, we
introduce a Bellman operator B(f+g, h):B(f+g, h)(x) = minu(f(x) + g(u) + h(Ax +Bu)).
Now the Bellman equation can be written as B(f+g, he)(x) = he(x) + λe.
We define a sequence of auxiliary functions Sk: when 0kN1, let Sk(x) = he
k(x) +
PN1
t=kλe
t, when k=N, let SN(x) = fN(x). For simplicity of notation, let he
N(x) = fN(x).
By Bellman equation, we have he
k(x) + λe
k=B(fk+gk, he
k)(x)for 0kN1. Let πe
k
be corresponding optimal control policy that solves the Bellman equation. We have the following
recursive relation for Skby the Bellman equation for 0kN1:
Sk(x) = B(fk+gk, Sk+1 he
k+1 +he
k)(x)
=fk(x) + gk(πe
k(x)) + Sk+1(Ax +Bπe
k(x)) he
k+1(Ax +Bπe
k(x)) + he
k(Ax +e
k(x))
Besides, let Vk(x)denote the optimal cost-to-go function from tto N, where VN(x) = fN(x). Let
π
kdenote the optimal control policy, by dynamic programming, for 0kN1
Vk(x) = B(fk+gk, Vk+1)(x)
=fk(x) + gk(π
k(x)) + Vk+1(Ax +Bπ
k(x))
Let x
kdenote the optimal trajectory, then x
k+1 =Ax
k+
k(x
k). For any k= 0,...,N 1,
Sk(x
k)Vk(x
k) = B(fk+gk, Sk+1 he
k+1 +he
k)(x
k)− B(fk+gk, Vk+1 )(x
k)
fk(x
k) + gk(π
k(x
k)) + Sk+1(x
k+1)he
k+1(x
k+1) + he
k(x
k+1)
(fk(x
k) + gk(π
k(x
k)) + Vk+1(x
k+1)
=Sk+1(x
k+1)he
k+1(x
k+1) + he
k(x
k+1)Vk+1 (x
k+1)
where the first inequality is because π
kis not optimal for the Bellman operator B(fk+gk, Sk+1
he
k+1 +he
k)(x
k).
Summing over k= 0, . . . , N 1on both sides yields
S0(x0)V0(x0)
N1
X
k=0
(he
k(x
k+1)he
k+1(x
k+1))
By subtracting he
0(x0)on both sides,
N1
X
t=0
λtJ
N1
X
k=0
(he
k(x
k+1)he
k+1(x
k+1)) he
0(x0)
For the simplicity of notation, we define he
1(x0) = 0 and x
0=x0, then the bound can be written
as
N1
X
t=0
λtJ
N
X
k=0
(he
k1(x
k)he
k(x
k))
D.3 Proof of Lemma 4
The proof relies on the (strong) convexity and smoothness of cost functions and the uniform upper
bounds on θt, ξt.
First of all, let’s suppose we have kxe
tk2¯xefor all 0tN1. We will bound ue
t, xt(0), ut(0)
by using ¯xe. Notice that the optimal steady state and the corresponding steady control satisfy: ue
t=
xe,I
tA(I,:)xe
t. If we can bound xe
tby kxe
tk ≤ ¯xefor all t,ue
tcan be bounded accordingly:
kue
tk ≤ kxe,I
tk2+kA(I,:)xe
tk ≤ kxe
tk2+kA(I,:)k2kxe
tk2(1 + kA(I,:)kxe=:¯ue
17
Moreover, xt(0) can also be bounded by ¯xemultiplied by some factors because by Lemma 5, xt(0)’s
each entry is determined by some entry of xe
sfor st. As a result, for 0tN
kxt(0)k2nkxt(0)knmax
stkxe
sknmax
stkxe
sk2n¯xe
We can bound ut(0) by xt(0)’s bound in a similar way to ue
t’s bound by noticing that ut(0) =
xt+1(0)IA(I,:)xt(0) and
kut(0)k ≤ kxI
t+1(0)k2+kA(I,:)xt(0)k ≤ kxt+1 (0)k2+kA(I,:)k2kxt(0)k2
(1 + kA(I,:)k)n¯xe=n¯ue
Next, it suffices to prove kxe
tk2¯xefor all tfor some ¯xe. To prove this bound, we construct another
(suboptimal) steady state: ˆxt= (θ1
t,...,θ1
t). Let ˆut= ˆxI
tA(I,:)ˆxt. It can be easily verified
that xt,ˆut)is indeed a steady state. Moreover, ˆxtand ˆutcan be bounded by similar arguments as
above:
kˆxtk2n|θ1
t| ≤ nkθtknkθtk2n¯
θ(by ˆxt,¯
θ’s def.)
kˆutk2(1 + kA(I,:)k)kˆxtk2(1 + kA(I,:)k)n¯
θ(by the same argument for bounding ue
t)
By strong convexity of ftand smoothness of ft, gtand by θt,ξtbeing the global minimizer of ft, gt
respectively, for 0tN1, we have
µ
2kxe
tθtk2ft(xe
t)ft(θt) + gt(ue
t)gt(ξt)(by strong convexity)
ftxt)ft(θt) + gt(ˆut)gt(ξt)(by (xe
t, ue
t)is optimal steady state)
lf
2kˆxtθtk2+lg
2kˆutξtk2(by smoothness and ft(θt) = gt(ξt) = 0)
lf(kˆxtk2+kθtk2) + lg(kˆutk2+kξtk2)(by Cauchy-Schwarz inequality)
lf(n¯
θ2+¯
θ2) + lg(((1 + kA(I,:)k)n¯
θ)2+k¯
ξk2)
(by kˆxtk2,kˆutk’s bounds above)
:=c7
As a result, we have kxe
tθtk ≤ p2c7. Then, we can bound xe
tby kxe
tk ≤ kθtk+p2c7
¯
θ+p2c7=:¯xefor all t. It can be verified that ¯xedoes not depend on N , W .
E Linear quadratic tracking
In this section, we will provide a regret bound for general LQT, based on which we prove Corollary
1 which considers a special case when Q, R are not changing.
E.1 Regret bound for general online LQT
Firstly, it can be shown that the solution to the Bellman equation associated with a linear quadratic
tracking cost has an explicit form.
Lemma 6. One solution to the Bellman equation with stage cost 1
2(xθ)Q(xθ) + 1
2uRu can
be represented by
he(x) = 1
2(xβe)Pe(xβe)(18)
where Pedenotes the solution to the discrete-time algebraic Riccati equation (DARE) with respect
to Q, R, A, B
Pe=Q+A(PePeB(BPeB+R)1BPe)A(19)
and βe=F θ where Fis a matrix determined by A, B , Q, R.
For simplicity of notations, we will let Pe(Q, R)denote the solution to the DARE with Q, R, A, B
and F(Q, R)denote the matrix in βe=F θ related with Q, R, A, B. Here we omit A, B in the
arguments of the functions because they will not change in this paper.
By applying Theorem 2, the regret bound of the general LQT problem is provided below.
18
Corollary 2 (Bound of general LQT).Consider the LQT problem in Example 1. Suppose the ter-
minal cost function satisfies PQN¯
Pwhere ¯
P P e(lfIn, lgIm)and P=Pe(µfIn, µgIm).3
Then, the regret of RHTM with initialization FOSS can be bounded by
Regret(RH T M) = O(( ζ1
ζ)2K(
N
X
t=1 kPe
tPe
t1k+kβe
tβe
t1k+
N
X
t=0 kxe
t1xe
tk))
where K=(W1)/p,xe
1=x0,xe
N=θN,ζis the condition number of the corresponding
C(z),(xe
t, ue
t)is the optimal steady state under cost Qt, Rt, θt.Pe
t=Pe(Qt, Rt)and βe
t=
F(Qt, Rt)θt.
Proof. Before the proof, we introduce some notations and some useful lemmas. Firstly, we define
the sets of Q, R, P considered in this section.
Q={Q|µfInQlfIn}
R={R|µgImRlgIm}
P={P|PP¯
P}
Moreover, we will define Q=µfIn,¯
Q=lfIn, R =µgIm,¯
R=lgIm.
Secondly, we introduce some supportive lemmas on the bounds of Pe
t, βe
t, x
trespectively. The
intuition on why they can be bounded is that Qt, Rt, θtall uniformly bounded by Assumption 2 and
3. The proof is technical and deferred to Appendix G.
Lemma 7 (Upper bound of x
t).For any N, any 0tN, any Qt∈ Q, Rt∈ R, QN∈ P, there
exists ¯xthat does not depend on N , W , such that
kx
tk2¯x
Lemma 8 (Upper bound of βe).For any Q∈ Q, R R, any kθk ≤ ¯
θthere exists ¯
β¯
θthat does
not depend on Nand only depends on A, B, lf, µf, lg, µg,¯
θ, such that kβek ≤ ¯
β.
Lemma 9 (Upper bound of Pe).For any Q∈ Q, R R, we have Pe=Pe(Q, R)∈ P. Conse-
quently, kPek2υmax(¯
P)
Next, we are ready for the proof.
By Theorem 2, we only need to bound PN
t=0(he
t1(x
t)he
t(x
t)). Let Pe
N=QN, βe
N=θN, then
we can write he
t(x) = 1
2(xβe
t)Pe
t(xβe
t)for 0tN.
For 0tN1, we split he
t(x
t+1)he
t+1(x
t+1)into two parts.
he
t(x
t+1)he
t+1(x
t+1) = 1
2(x
t+1 βe
t)Pe
t(x
t+1 βe
t)1
2(x
t+1 βe
t+1)Pe
t+1(x
t+1 βe
t+1)
=1
2(x
t+1 βe
t)Pe
t(x
t+1 βe
t)1
2(x
t+1 βe
t+1)Pe
t(x
t+1 βe
t+1)(Part 1)
+1
2(x
t+1 βe
t+1)Pe
t(x
t+1 βe
t+1)1
2(x
t+1 βe
t+1)Pe
t+1(x
t+1 βe
t+1)(Part 2)
Part 1 can be bounded by the following when 0tN1,
Part 1 =1
2(x
t+1 βe
t+x
t+1 βe
t+1)Pe
t(x
t+1 βe
t(x
t+1 βe
t+1))
1
2kx
t+1 βe
t+x
t+1 βe
t+1k2kPe
tk2kβe
t+1 βe
tk2(by L2-norm def.)
x+¯
β)υmax(¯
P)kβe
t+1 βe
tk2(by Lemma 8 9, 7.)
Part 2 can be bounded by the following when 0tN1,
Part 2 =1
2(x
t+1 βe
t+1)(Pe
tPe
t+1)(x
t+1 βe
t+1)
3This additional condition is for technical simplicity and can be removed.
19
1
2kx
t+1 βe
t+1k2
2kPe
tPe
t+1k21
2x+¯
β)2kPe
tPe
t+1k2
Therefore, we have
N
X
t=0
(he
t1(x
t)he
t(x
t))
N1
X
t=0
(he
t(x
t+1)he
t+1(x
t+1))
=O(
N1
X
t=0
(kβe
t+1 βe
tk2+kPe
tPe
t+1k2)) (20)
where the first inequality is by he
0(x)0and he
1(x) = 0. Consequently, by applying theorem 2,
we proved the regret bound of RHTM in LQ tracking problems.
J(RH T M)J=O(( ζ1
ζ)2K(
N
X
t=1 kPe
tPe
t1k+kβe
tβe
t1k+
N
X
t=0 kxe
t1xe
tk))
E.2 Proof of Corollary 1
Proof sketch: Consider the bound in Corollary 2. When Q, R are not changing, kPe
tPe
t1k= 0.
Moreover, by (29), βe
t=F θtfor some matrix Ffor all t, so kβe
tβe
t1kcan be bounded by
kθtθt1k. Finally, we can also show that xe
t=F1F2θtfor some matrices F1, F2with the help of
Lemma 5, leading to kxe
txe
t1k=O(kθtθt1k). Combining the discussions above, the regret
bound can be proved.
Formal proof: Directly applying the results in Theorem 2 and Corollary 2 will result in some extra
constant terms because some inequalities used to derive the bounds in Theorem 2 and Corollary 2
are not necessary when Q, R are not changing. Therefore, we will apply some intermediate results
in the proofs of Theorem 2 and Corollary 2 to prove Corollary 1, but the main idea is the same as
the proof sketch.
Firstly, by the first inequality bounds of Lemma 2 and Lemma 3, we have
J(ϕ)J=J(ϕ)
N1
X
t=0
λe
t+
N1
X
t=0
λe
tJ
c1
N1
X
t=0 kxe
t1xe
tk
|{z }
Part I
+
N1
X
t=0
(he
t(x
t+1)he
t+1(x
t+1))
|{z }
Part II
+fN(xN(0)) he
0(x0)
| {z }
Part III
We are going to bound each part by Ptkθtθt1kin the following.
Part I: We will bound Part I by Ptkθtθt1kby showing that xe
t=F1F2θtfor some matrices
F1, F2. The representation of xe
trelies on Lemma 5.
By Lemma 5, we know that the steady state (x, u)can be represented as a matrix multiplied with z:
x= (z1,...,z1
|{z }
p1
, z2,...,z2
|{z }
p2
,...,zm,...,zm
|{z }
pm
)=:F1z(21)
u= (z1,...,zm)A(I,:)x= (ImA(I,:)F1)z
where F1Rn,m is a binary matrix with full column rank.
Consider cost function 1
2(xθ)Q(xθ) + 1
2uRu. By the steady-state representation above, the
optimal steady state can be solved by the following unconstrained optimization:
min
z(F1zθ)Q(F1zθ) + z(IA(I,:)F1)R(IA(I,:)F1)z
Since F1is full column rank, the function is strongly convex and has the unique solution
ze=F2θ(22)
20
where F2= (F
1QF1+ (IA(I,:)F1)R(IA(I,:)F1))1F
1Q. Accordingly, the optimal
steady state can be represented as xe=F1F2θ,ue= (ImA(I,:)F1)F2z. Consequently,
kxe
txe
t1k ≤ kF1F2kkθtθt1k
Now, we consider t= 0. Since xe
1=x0= 0, by letting θ1= 0,kxe
0xe
1k ≤ kF1F2kkθ0θ1k.
Combining the upper bounds above, we have
Part I =O(
N1
X
t=0 kxe
txe
t1k) = O(
N1
X
t=0 kθtθt1k)(23)
Part II: By (20) in the proof of Corollary 2, we have
N1
X
t=0
(he
t(x
t+1)he
t+1(x
t+1)) = O(
N1
X
t=0 kβe
t+1 βe
tk2)(by Penot changing)
By Lemma 14 (29), βe
t= (Pe)1(I(ABKe))1t=:F2θt, so for 1tN,
kβe
tβe
t1k=kF2θtF2θt1k ≤ kF2kkθtθt1k
Thus,
Part II =
N1
X
t=0
(kβe
t+1 βe
tk2)≤ kF2k
N1
X
t=0 kθt+1 θtk(24)
Part III: By our condition for the terminal cost function, we have fN(xN(0)) = 1
2(xN(0)
βe
N)Pe(xN(0) βe
N). By Lemma 14, we know he
0(x0) = 1
2(x0βe
0)Pe(x0βe
0). So Part III
can be bounded by
Part III =1
2(xN(0) βe
N)Pe(xN(0) βe
N)1
2(x0βe
0)Pe(x0βe
0)
=1
2(xN(0) βe
N+x0βe
0)Pe(xN(0) βe
N(x0βe
0))
1
2kxN(0) βe
N+x0βe
0k2kPek2kxN(0) βe
N(x0βe
0)k2
1
2(n¯xe+¯
β+¯
β)kPek(kxN(0) x0k+kβe
Nβe
0k)
where the last inequality is by Lemma 4, Lemma 8 and Assumption 3 and the triangle inequality.
Next we will bound kxN(0) x0kand kβe
Nβe
0krespectively. Firstly, kβe
Nβe
0kcan be bounded
by triangle inequality and (24)
kβe
Nβe
0k ≤
N1
X
t=0 kβe
t+1 βe
tk2≤ kF2k
N1
X
t=0 kθt+1 θtk2
Secondly, we will bound kxN(0) x0k. Notice that by triangle inequality, we have kxN(0)x0k ≤
kxN(0) xe
N1k+kxe
N1x0kand kxe
N1x0kcan be bounded by triangle inequality and (23):
kxe
N1x0k ≤
N1
X
t=0 kxe
txe
t1k ≤ kF1F2k
N1
X
t=0 kθtθt1k
Next, we will focus on kxN(0) xe
N1k. By Lemma 5, xN(0) satisfies
xN(0) = (ze,1
Np1,...,ze,1
N1, ze,2
Np2,...,ze,2
N1,...,ze,m
Npm,...,ze,m
N1)
As a result,
kxN(0) xe
N1k2≤ kze
N2ze
N1k2+···+kze
Npze
N1k2
=kF2k2(kθN2θN1k2+···+kθNpθN1k2)
21
where the equality is by (22). Taking square root on both sides yields
kxN(0) xe
N1k ≤ kF1kqkθN2θN1k2+···+kθNpθN1k2
≤ kF2k(kθN2θN1k+···+kθNpθN1k)
≤ kF2k(p1)
N2
X
t=Npkθt+1 θtk
Combining the bounds above, we have
Part III =O(
N1
X
t=0 kθt+1 θtk)(25)
The proof is completed by summing up bounds of Part I, II, III.
F Proof of Theorem 3
Proof sketch: We will focus on explaining the term (ζ1
ζ+1 )2K. Firstly, the fundamental limit of
the online control problem is equivalent to the fundamental limit of the online convex optimization
problem with objective C(z). Therefore, we will focus on C(z). Secondly, since the lower bound is
on the worst case scenario, we only need to construct some {θt}for Theorem 3 to hold. However, it
is generally difficult to construct the tracking trajectory, so we consider randomly generated θtand
show that the regret in expectation can be lower bounded. Then, there must exist some realization
of the randomly generated {θt}such that the regret lower bound holds.
Thanks to the quadratic structure, we have closed-form solution to z, which is linear in θt, that is,
z
t+1 =PN
s=1 vt+1,sθs. Since any online algorithm only has access to finite predictions, the online
output zt+1(A)only depends on θ1,...,θt+W1. As a result, the difference between the optimal
solution and the online solution can be roughly captured by kPN
s=t+Wvt+1,sθsk. With proper
construction of A, B , Q, R, we can roughly show that v2
t+1,i decays at most at a rate of (ζ1
ζ+1 )2K.
This explains the exponential decaying term (ζ1
ζ+1 )2Kin the lower bound of Theorem 3.
Formal proof:
Step 1: construct LQ tracking. For simplicity, we construct a single-input system with n=pand
ARn,n and BRn×1as follows: 4
A=
0 1 ··· 0
.
.
.......
0 1
1 0 ··· 0
, B =
0
.
.
.
0
1
(A, B)is controllable because (B,AB,...,Ap1B)is full rank. A’s controllability index is p=n.
Next, we construct Q, R. For any ζand p, define δ=4
(ζ1)p. Let Q=δInand R= 1 for
0tN1. Let Pe=Pe(Q, R)be the solution to the DARE. We can show that Peis diagonal
with some additional properties.
Lemma 10 (Form of Pe).Let Pedenote the solution to the DARE determined by A, B , Q, R defined
above. Then Pesatisfies the form
Pe=
q10··· 0
0q2··· 0
...
0··· qn
where qi=q1+ (i1)δfor 1inand δ < q1< δ + 1.
4It is easy to generalize the construction to multi-input case by constructing mdecoupled subsystems.
22
Proof of Lemma 10. By Proposition 4.4.1 in [52], there exists a unique positive definite solution. So
we suppose the solution is diagonal and substitute it in the DARE. If we can find a positive definite
solution, then the solution must be Pe.
Pe=Q+A(PePeB(BPeB+R)1BPe)A
q10··· 0
0q2··· 0
...
0··· qn
=
qn/(1 + qn) + δ0··· 0
0q1+δ··· 0
...
0··· qn1+δ
So we have, qi=qi1+δfor 1in1, and qn/(1 + qn) + δ=q1=qn(n1)δ. The
solution is qn=+n2δ2+4
2> nδ, so q1=qn(n1)δ > δ > 0. So the solution is positive
definite. Moreover, by qn/(1 + qn)<1, we have q1< δ + 1.
Next, we will construct θt. Let θ0=θN=βe
N= 0 for simplicity. Let E=LN/(2¯
θ). For
simplicity, we only consider an integer E.5Since 2¯
θLN(2N+ 1)¯
θand Eis an integer, we
have 1EN.
We provide two constructions for two different values of E. When E= 1, let J={W}. Let
θ1=···=θW1= 0. Let θWfollow the distribution below:
θi=σwith prob 1/2
σwith prob 1/2,i.i.d. for all i[n](26)
where σ=¯
θ
n. It can be easily verified that kθk=¯
θfor any realization of this distribution. Let the
rest θtbe equal to θW, i.e. θW=θW+1 =···=θN1. It can be shown that the total variation of
the constructed θtis no more than the variation budget LN:
N
X
t=0 kθtθt1k=kθWθW1k+kθN1θNk= 2¯
θ=LN
where the last equality is because E= 1.
When E2, we divide the stages {1,...,N 1}into E1epochs, each epoch with size
∆ = N1
E1.6Let Jbe the first stage of each epoch: J={1,∆ + 1,...,(E2)∆ + 1}. Let θt
for t∈ J i.i.d. follows the distribution (26). Let the rest of θtbe equal to the value at the start of
their corresponding epochs, i.e., θt=θk∆+1 for k=t/. Now, we verify that the constructed θt
satisfies the variation budget:
N
X
t=0 kθtθt1k=kθ1θ0k+
E2
X
k=1 kθk∆+1 θkk+kθN1θNk
¯
θ+ 2(E2)¯
θ+¯
θLN
by θ0=θ1=θN= 0.
The tracking loss of our LQ tracking problem is
J(x, u) =
N1
X
t=0
(δ
2kxtθtk2+1
2u2
t) + 1
2x
NPexN
We will verify that C(z)’s condition number is ζin Step 2.
Step 2: convert LQ tracking to min C(z)and find zThe corresponding unconstrained optimiza-
tion’s objective function C(z)of our LQ tracking constructed above has an explicit form as below:
C(z) =
N1
X
t=0
(δ
2
n
X
i=1
(ztn+iθi
t)2+1
2(zt+1 ztn+1)2) + 1
2
n
X
i=1
qiz2
Nn+i
5The proof can be generalized to the case when LN/(2¯
θ)is not an integer by using floor and ceiling
operators.
6The last epoch may contain less than stages.
23
and zt= 0 and θt= 0 for t0.
Since C(z)is strongly convex, min C(z)admits a unique optimal solution, denoted as z, which
is determined by the first-order optimality condition: C(z) = 0. In addition, our constructed
C(z)is a quadratic function, so there exists a matrix HRN×Nand a vector ηRNsuch that
C(z) = Hzη= 0. By partial gradients of C(z)below,
∂C
∂zt
=δ(ztθn
t+ztθn1
t+1 +···+ztθ1
t+n1+ (ztzt+n) + (ztztn),1tNn
∂C
∂zt
=δ(ztθn
t+···+ztθn+tN+1
N1) + qn+tNzt+ztztn, N n+ 1 tN
For simplicity and without loss of generality, we assume N/p is an integer. Then, Hcan be repre-
sented as the block matrix below
H=
(δn + 2)InIn···
In(δn + 2)In
...
......In
In(qn+ 1)In
and ηis a linear combination of θ:ηt=δ(θn
t+···+θ1
t+n1) = δ(e
nθt+···+e
1θt+n1)where
e1,...,enRnare standard basis vectors and θt= 0 for tN.
By Gergoskin’s Disc Theorem, H’s condition number is (δn + 4)/δn =ζby our choice of δin Step
1 and p=n.
Since His strictly diagonally dominant with positive diagonal entries and nonpositive off-diagonal
entries, His invertible and its inverse, denoted by Y, is nonnegative. Consequently, the optimal
solution can be represented as z=Y η. We will use Yij to denote the Y’s entry in the ith row and
jth column.
It will be helpful to write z
t+1 in terms of θtdirectly since later we will analyze the dependence of
the optimal solution on the target trajectory, so we derive
z
t+1 =
N
X
i=1
Yt+1,iηi=δ
N
X
i=1
Yt+1,i
n1
X
j=0
e
njθi+j(by ηi’s def)
=δ
N1
X
k=1
vt+1,kθk(27)
by θt= 0 for tN, where vt+1,k =Yt+1,k e
n+···+Yt+1,k+1ne
1R1×nand Yt+1,i = 0 for
i0.
In addition, we are able to show in the next lemma that Yhas decaying row entries starting at the
diagonal entries. The proof is technical and deferred to the appendix.
Lemma 11. When N/p is an integer, the inverse of H, denoted by Y, can be represented as a block
matrix
Y=
y1,1Iny1,2In··· y1,N/pIn
y2,1Iny2,2In··· y2,N/pIn
.
.
........
.
.
yN/p,1InyN/p,2In··· yN/p,N/p In
where yt,t+τ1ρ
δn+2 ρτ>0for τ0and ρ=ζ1
ζ+1 .
Step 3: characterize zt+1(Az).For any online control algorithm A, we can define an equivalent on-
line algorithm for z, denoted as Az, which outputs zt+1(Az)at each time step tbased on prediction
and history, i.e.,
zt+1(Az) = Az({θs}t+W1
s=0 ), t 0
For simplicity, we consider online deterministic algorithm.7Notice that zt+1 is a random variable
because θ1,...,θt+W1are random. Based on this observation and Lemma 11, we are able to
provide a regret lower bound.
7The proof can be easily generalized to random algorithms
24
Step 4: prove the regret lower bound for A.Roughly speaking, the regret occurs when something
unexpected happens beyond the prediction window, that is, at each t, the prediction window goes
as far as t+W1, but if θt+Wchanges from θt+W1, the online algorithm cannot prepare for it,
resulting in poor control and positive regret.
By our construction of θt, the changes happen at t∈ J . To study the stage twith unexpected
changes at t+W, we define a set containing all such t:J1={0tNW1|t+W∈ J}.
By our construction, it can be shown that the cardinality of J1can be lower bounded by LNup to
some constants:
|J1| ≥ 1
12¯
θLN(28)
The proof of (28) is provided below. When E= 1,J1={0}so |J1|= 1 = LN
2¯
θ1
12¯
θLN.
When E2, notice that |J1|=|J | − |{1tW1|t∈ J}|. Since |J| =E1,
|{1tW1|t∈ J }| =W1
, we have
|J1|=E1− ⌊W1
⌋ ≥ E1N/31
E1(N1)/3
=E1(N1)/3
N1
E1
E1(N1)/3
N1
2(E1)
=1
3(E1) 1
6E=LN
12¯
θ
where the first inequality is by WN/3, the second equality is by substituting the definition of ,
the third inequality is by N1
E11and N1
E1⌋ ≥ 1
2
N1
E1, then last inequality is by E2.
Moreover, we can show in Lemma 12, for all t∈ J1, the online decision zt+1(Az)is different from
the optimal solution z
t+1 and the difference is lower bounded,
Lemma 12. For t∈ J1,
Ekzt+1(Az)z
t+1k2c10 σ2ρ2K
where c10 is a constant determined by A, B , n, Q, R.
The lower bound on the difference between the online decision and the optimal decision results in a
lower bound for the regret. By -strong convexity of C(z),
E(C(z(Az)) C(z)) δn
2X
t∈J1
Ekzt+1(Az)z
t+1k2
LN
12¯
θc10σ2ρ2K=LN
12¯
θc10 ¯
θ2/nρ2K= Ω(LNρ2K)
By the equivalence between Aand Az, we have EJ(A)EJ= Ω(ρ2KLN). By the property of ex-
pectation, there must exist some realization of the random {θt}such that J(A)J= Ω(ρ2KLN),
which completes the proof.
Proof of Lemma 12. By our construction, θtis random, and zA
t+1 is also random and its randomness
is provided by θ1,...,θt+W1, while z
t+1 is determined by all θt. By i.i.d. construction of θt,
EkzA
t+1 z
t+1k2=EkzA
t+1 δ
N1
X
i=1
vt+1,iθik2(by (27))
=EkzA
t+1 δ
t+W1
X
i=1
vt+1,iθik2+δ2Ek
N1
X
i=t+W
vt+1,iθik2
δ2Ek
N1
X
i=t+W
vt+1,iθik2
For t∈ J1,t+WN1and t+W∈ J, so by the construction of θtwe have
θt+W=··· =θt+W+∆1,...,θ(E2)∆+1 =··· =θN1and θN= 0. In addition,
25
θt+W, θt+W+∆,...,θ(E2)∆+1 are i.i.d. with zero mean and variance σ2In. Thus,
Ek
N1
X
i=t+W
vt+1,iθik2=Ek
t+W+∆1
X
i=t+W
vt+1,iθt+Wk2+···+Ek
N1
X
i=(E2)∆+1
vt+1,iθ(E2)∆+1 k2
≥ k
t+W+∆1
X
i=t+W
vt+1,ik2σ2+k
N1
X
i=(E2)∆+1
vt+1,ik2σ2
σ2
N1
X
i=t+Wkvt+1,ik2=σ2
N1
X
i=t+W
(
n1
X
k=0
Y2
t+1,ik)(by vt+1,i’s def.)
σ2
N1
X
i=t+1+Wn
Y2
t+1,i =σ2
N
X
i=t+1+Wn
Y2
t+1,i
where the second inequality is by vt+1,i having nonnegative entries, the last equality is because
when t∈ J1,Yt+1,N = 0.
When 1Wn,PN
i=t+1+WnY2
t+1,i Y2
t+1,t+1. When W > n,PN
i=t+1+WnY2
t+1,i
Y2
t+1,t+1+nWn
n. Moreover, when W1,Wn
n=W1
n. Therefore, for any W1,
N
X
i=t+1+Wn
Y2
t+1,i Y2
t+1,t+1+nW1
n
ρ2K(1ρ
δn + 2 )2
where the last inequality is by Lemma 11 and p=n.
F.1 Proof of Lemma 11
Proof. Since His a block matrix
H=
(δn + 2)InIn···
In(δn + 2)In
...
......In
In(qn+ 1)In
its inverse matrix Wcan also be represented as a block matrix. Moreover, let
H1=
δn + 2 1··· 0
1δn + 2 ...0
.
.
........
.
.
0··· −1qn+ 1
¯
Y= (H1)1= (yij )i,jRN/p . Then the inverse matrix Ycan be represented as (yij In).
Now, it suffices to provide a lower bound on yij .
Since H1is a symmetric positive definite tridiagonal matrix, by [55], the inverse has an explicit
formula given by (H1)1
ij =aibjand
ai=ρ
1ρ21
ρiρi
and
bt=c3
1
ρNt+c4ρNt
26
c3=bN(qn+ 1)ρρ2
1ρ2
c4=bN
1(qn+ 1)ρ
1ρ2
bN=1
aN1+ (qn+ 1)aN
In the following, we will show atbt+τ1ρ2
δn+2 ρτ. Firstly, it is easy to verify that
ρat=ρ
1ρ2(1 ρ2t)ρ
since t1and ρ < 1.
Secondly, we bound bNin the following way:
ρNbN=1
(qn+ 1)(1 ρ2N)(ρρ2N1)
1ρ2
ρ
1
(δn + 2)
1ρ2
ρ
because 0<(qn+ 1)(1 ρ2N)(ρρ2N1)(δn + 2).
Thirdly, we bound bt+τ. When 1(qn+ 1)ρ0
ρNtτbt+τ=bN(qn+ 1)ρρ2
1ρ2+bN
1(qn+ 1)ρ
1ρ2ρ2(Ntτ)
bN(qn+ 1)ρρ2
1ρ2(by 1(qn+ 1)ρ0)
bN(δn + 1)ρρ2
1ρ2(by qn+ 1)
=1ρ
1ρ2bN
where the last equality is by ρ2(δn + 2)ρ+ 1 = 0.
When 1(qn+ 1)ρ < 0
ρNtτbt+τ=bN(qn+ 1)ρρ2
1ρ2+bN
1(qn+ 1)ρ
1ρ2ρ2(Ntτ)
bN(qn+ 1)ρρ2
1ρ2+bN
1(qn+ 1)ρ
1ρ2(by 1(qn+ 1)ρ < 0, ρ 1)
bN(by ρ2(Ntτ)1)
Combining three parts together:
yt,t+τ=atbt+τρbN
1ρ
1ρ2ρτN1ρ
(δn + 2) ρτ
G Proofs of properties of LQT in Appendix E
In this section, we provide proofs for the properties of LQ tracking (LQT) provided in Appendix E.
27
G.1 Preliminaries: dynamic programming for finite-horizon LQT
In this section, we consider a discrete time LQ tracking problem with time-varying cost functions
and time-invariant dynamical system:
min
xt,ut
1
2
N1
X
t=0 (xtθt)Qt(xtθt) + u
tRtut+1
2(xNθN)QN(xNθN)
s.t. xt+1 =Axt+But, t = 0,...,N 1
where x0= 0 for simplicity.
The problem can be solved by dynamic programming.
Theorem 4 (Dynamic programming for the finite-horizon LQT).Consider a nite-horizon time-
varying LQ tracking problem. Let Vt(xt)be the cost to go from k=tto k=N, then
Vt(xt) = 1
2(xtβt)Pt(xtβt) + 1
2
N1
X
k=t
(kβk+1)Hk(kβk+1 )
for t= 0,...,N. The parameters can be obtained by
Pt=Qt+AMtA, t = 0,...,N 1, QN=QN
Mt=Pt+1 Pt+1B(Rt+BPt+1 B)1BTPt+1, t = 0,...,N 1
βt= (Qt+AMtA)1(Qtθt+AMtβt+1), t = 0,...,N 1
βN=θN
Ht=MtMtA(Qt+AMtA)1AMt, t = 0,...,N 1
The optimal controller is
u
t=Ktxt+K
tβt+1, t = 0,...,N 1
where the parameters are
Kt= (Rt+BPt+1B)1BPt+1 A
K
t= (Rt+BPt+1B)1BPt+1
There is another way to write the optimal controller:
u
t=Ktxt+Kα
tαt+1 t= 0,...,N 1
where the parameters are
Kα
t= (Rt+BPt+1B)1B
αt=Ptβt
αt=Qtθt+ (ABKt)αt+1 , t = 0,...,N 1
αN=PNθN
The proof is by dynamic programming [56].
G.2 Proof of Lemma 9
In the following, we first prove that the recursive solution Ptto the finite-horizon LQR is bounded.
Then, by taking limit, we can prove Pe
tis bounded.
Lemma 13 (Bounded Ptfor finite-horizon LQT).Consider a finite-horizon time-varying LQT prob-
lem. For any N, any 0tN, any Qt∈ Q, Rt∈ R, QN∈ P, we have Pt∈ P where Ptis
defined in Proposition 4.
28
Proof. Since Ptdoes not depend on θt, when proving Fact 3, we let θt= 0 and consider the LQR
problem for simplicity. Since QQt¯
Q, R Rt¯
R, for 0tN1and PQN¯
P,
we have for any xt, ut,k,Qt, Rt, QN,
N1
X
t=k
(x
tQtxt+u
tRtut) + x
NQNxN
N1
X
t=k
(x
t¯
Qxt+u
t¯
Rut) + x
N¯
P xN
N1
X
t=k
(x
tQtxt+u
tRtut) + x
NQNxN
N1
X
t=k
(x
tQxt+u
tRut) + x
NP xN
Taking minimum over all feasible trajectories on both sides, we have
min
xt+1=Axt+But
N1
X
t=k
(x
tQtxt+u
tRtut) + x
NQNxNmin
xt+1=Axt+But
N1
X
t=k
(x
t¯
Qxt+u
t¯
Rut) + x
N¯
P xN
min
xt+1=Axt+But
N1
X
t=k
(x
tQtxt+u
tRtut) + x
NQNxNmin
xt+1=Axt+But
N1
X
t=k
(x
tQxt+u
tRut) + x
NP xN
Notice that LHS = x
kPkxk. Moreover, notice that
x
k¯
P xk= min
xt+1=Axt+But
N1
X
t=k
(x
t¯
Qxt+u
t¯
Rut) + x
N¯
P xN
because ¯
P=Pe(¯
Q, ¯
R). The same holds for P. Therefore, we have
x
kPxkx
kPkxkx
k¯
P xk
for any xk, so PPk¯
P, so Pk∈ P.
Proof of Lemma 9. Consider the finite-horizon time-invariant LQR problem with stage cost Q, R,
i.e. the total cost function is PN1
k=0 (x
kQxk+u
kRuk). By Lemma 13, we have PPk¯
P.
Since PkPeas k→ −∞, we have PPe¯
P, consequently, kPek2υmax(¯
P).
G.3 Proof of Lemma 6
Based on the dynamic programming solution in Theorem 4, we can provide a more complete charac-
terization of the solution to the Bellman equation, including the formula for λe, heand the optimal
controller.
Lemma 14 (Optimal solution to average-cost LQ tracking).Suppose (A, B)is controllable, Q, R
are positive definite. The optimal average cost λedoes not depend on the initial state x0and is
equal to
λe=1
2(βe)He(βe),
the solution to the Bellman equation he(x) + λe= minu(f(x) + g(u) + he(Ax +Bu)) can be
represented by
he(x) = 1
2(xβe)P(xβe),
and the optimal controller is
u=Kex+Kβe
where Pe=Pe(Q, R),αe=+ (ABKe)αe,
βe=F θ (29)
and F= (Pe)1αe= (Pe)1(I(ABKe))1Qonly depends on A, B, Q, R, and Me=
PePeB(R+BPeB)1BPeand He=MeMeA(Q+AMeA)1AMeand Ke=
(R+BPeB)1BPeA,K= (R+BPeB)1BPeand αe=+ (ABK e)αeand
βe= (Pe)1αe.
29
Proof of Lemma 14. Proof outline.
optimal average cost formula
bias function he(x)’s formula
optimal controller formula
Step 1: Optimal average cost formula. Consider a finite horizon LQT problem:
min
xt,ut
1
2
N1
X
t=0 (xtθ)Q(xtθ) + u
tRut
s.t. xt+1 =Axt+But, t = 0,...,N 1
Given initial state x0, by Theorem 4, the total optimal cost in Ntime steps is
J
N(x0) = 1
2(x0β0)P0(x0β0) + 1
2
N1
X
k=0
(βk+1)Hk(βk+1 )
The proof is by first showing that βkβeand PkPeand HkHeas k→ −, and
consequently 1
2(βk+1)Hk(βk+1 )1
2(βe)He(βe)as k→ −∞. Then
the optimal average cost in infinite horizon is
λe= lim
N+
1
N(1
2(x0β0)P0(x0β0) + 1
2
N1
X
k=0
(βk+1)Hk(βk+1 ))
=1
2(βe)He(βe),
Now, we prove βkβe,PkPeand HkHeas k→ −∞. The convergence of Pkis
from Proposition 4.4.1 [52]. Since matrix inverse is continuous when the matrix is invertible, we
have MkMeand HkHeas k→ −∞. Similarly, we have KkKe, and Kα
kKαand
K
kKas k→ −∞. Notice that βk=P1
kzk, so we can prove the convergence of βkby proving
αkαeas k→ −∞. The backward recursive equation for αtis αt=+ (ABKt)αt+1
and we have (ABKk)(ABKe)as k→ −∞. Based on the lemma below, we can show
αkαeas k→ −∞ where αe=+ (ABKe)αe.
Lemma 15 (Convergence of time-varying system).If AtAand Ais stable, then system xt+1 =
Atxt+ηwill converge to xssuch that xs=Axs+ηfor any bounded initial value x0
The proof of this lemma is provided later in this subsection.
Step 2: he(x)’s formula. The proof is by plugging in he(x)and λe’s formula to both sides of the
Bellman equation and show the equality holds. The right-hand-side (RHS) of the Bellman equation
is
RHS = min
u
1
2(xθ)Q(xθ) + 1
2uRu +1
2(Ax +Bu βe)Pe(Ax +Bu β)
=1
2(xθ)Q(xθ) + 1
2(Ax β)Me(Ax β)
=1
2(βe)He(β) + 1
2(xβe)Pe(xβ) = LHS
where Me=PePeB(R+BPeB)1BPeand the optimal control input is ue=Kex+Kβe,
and the last two inequalities are based on the following fact.
Fact: Consider a function
g(u) = 1
2(uξ)R(uξ) + 1
2(Cu +η)P(C u +η)
where P, R are pd, u, ξ , η are vectors, Cis matrix. Then,
g(u) = 1
2(uu)(R+CP C)(uu) + 1
2(+η)M(C ξ +η)
30
u= (R+CP C)1(CP η )
M=PP C(R+CP C )1CP
Step 3: optimal controller’s formula. We prove u=Kex+Kβeis the optimal controller by
showing that the average cost by implementing this controller is no more than the optimal average
cost λe. Let xt, utbe the state and control at tby implementing u=Kex+Kβe.
1
N
1
2
N1
X
t=0 (xtθ)Q(xtθ) + u
tRut
1
N 1
2
N1
X
t=0 (xtθ)Q(xtθ) + u
tRut+1
2(xNβe)Pe(xNβe)!
=1
N 1
2(x0βe)Pe(x0βe) + 1
2
N1
X
k=0
(βe)He(βe)!
where the last equality is by dynamic programming and step 2. Taking N+on both sides,
lim
N+
1
N
1
2
N1
X
t=0 (xtθ)Q(xtθ) + u
tRut
1
2(βe)He(βe)
Therefore, the total cost by implementing u=Kex+Kβeis no greater than 1
2(βe)He(
βe).
G.3.1 Proof of Lemma 15
Since we consider general At, it is difficult to construct a Lyapunov function. So we will prove it by
proving the error term dt=xtxsgoes to zero. We rewrite the system as
dt+1 =Atdt+η+Atxsxs
=Adt+ (AtA)dt+η+ (AtI)(IA)1η(by xs= (IA)1η)
=Adt+ (AtA)(dt+ (IA)1η)
Define wt= (AtA)(dt+ (IA)1η). Then
dt+1 =Adt+wt(30)
The proof has two steps. First, we will prove dtis bounded, then we will prove dt0.
Bound dt.First, we provide a supportive lemma which is based on the fact that exponential stability
implies BIBO stability in LTI system.
Lemma 16. Let Sk=Pk1
t=0 Aut. When Ais stable, kutk2Mfor any t, then there exists a
constant c3>0such that
kSkk2c3M, k= 1,2...
Proof. Consider a system xt+1 =Axt+utwith x0= 0. Since Ais stable, the system is expo-
nentially stable. By Theorem 9.4 [47], exponential stability implies bounded-input-bounded-output
stability, so
kxtk2c3M
for any t. Since xk=Sk=Pk1
t=0 Aut, we have kSkk2c3M, k= 1,2....
Next, we will prove dtis bounded by induction.
Lemma 17. There exists M > 0that does not depend on t, such that kdtk2Mfor any t.
31
Proof. By AtA, we have for any ǫ1, there exists N1, such that when tN1,kAtAk2ǫ1.
Let ǫ1= 1/4c3. By Lis stable, we have L0, so for any ǫ2, there exists N2, such that when
t > N2,kLk2ǫ2. Let ǫ2= 1/2. Let M= max(kd0k2,...,kdN1+N2k2,k(IA)1ηk2). Notice
that kdtk2Mfor tN1+N2. We will show that kdN1+N2+1k2M. By (30), let t=N1+N2,
dt+1 =AN2+1dN1+wt+Awt1+···+AN2wN1
kdt+1k2≤ kAN2+1 k2M+kwt+Awt1+···+AN2wN1k2
ǫ2M+c3max
N1ktkwkk2(by Lemma 16)
ǫ2M+ 2c3ǫ1M(by wk= (AkA)(dk+ (IA)1η)and def. of ǫ1, M and kN1)
= (1/2 + 1/2)M+M
Next consider any tN1+N2+ 1 and kdkk2Mfor any kt. We can show kdt+1k2Min
a similar way. Thus we have proved that kdtk2Mfor any t.
Prove dt0.It suffices to prove that for any ǫ3, there exists N3, such that when t > N3, we
have kdtk2ǫ3. By AtA, let ǫ
1=ǫ3/(4c3M), there exists N
1, such that when tN
1,
kAtAk2ǫ1, where Mis defined in Lemma 17. By Lis stable, we have L0, so let ǫ
2=
ǫ3/(2M)where Mis defined in Lemma 17, there exists N
2, such that when t > N
2,kLk2ǫ2.
Let N3=N
1+N
2. By (30),
dt+1 =AN
2+1dN
1+wt+Awt1+···+AN
2wN
1
kdt+1k2≤ kAN
2+1k2M+kwt+Awt1+···+AN
2wN
1k2
ǫ2M+c3max
N
1ktkwkk2(by Lemma 16)
ǫ
2M+ 2c3ǫ
1M(by wk= (AkA)(dk+ (IA)1η)and def. of ǫ
1, M and kN
1)
= (1/2 + 1/2)ǫ3=ǫ3
G.4 Proof of Lemma 7.
Let Dt=ABKtwhere Ktis defined in Appendix G.1, then x
tfollows the system
x
t+1 =Dtx
t+BK α
tαt+1
We will prove x
tis bounded by three steps: 1) show that system xt+1 =Dtxtis exponential
stable, 2) show that BKα
tαt+1 is bounded, 3) show x
tis bounded by the fact that exponential stable
systems are bounded-input-bounded-output stable.
Step 1: show xt+1 =Dtxtis exponential stable by Lyapunov function.
Lemma 18 (Lyapunov function).Define L(t, xt) = x
tPtxt. For any N, any 0tN, any
Qt∈ Q, Rt∈ R, QN∈ P, and for any xt, we have
υmin(P)kxtk2
2L(t, xt)υmax(¯
P)kxtk2
L(t+ 1, Dtxt)L(t, xt)≤ −µfkxtk2
2
L(t, xt)is called the Lyapunov function for the system xt+1 =Dtxt.
Proof. By Lemma 13,
υmin(P)InPP¯
Pυmax(¯
P)
so for any xt, we have
υmin(P)kxtk2
2L(t, xt) = x
tPtxtυmax(¯
P)kxtk2
Notice that
L(t+ 1, Dtxt)L(t, xt) = x
tD
tPt+1Dtxtx
tPtxt
=x
t(D
tPt+1DtPt)xt
=x
t(QtK
tRtKt)xt(by definition)
32
≤ −x
tQxt(by Qt+K
tRtKtQtQ)
=µfkxtk2
2(by Q=µfIn)
By the Lyapunov function above, we can show xt+1 =Dtxtis exponential stable. To provide a
formula for the exponential decay rate, we introduce a technical lemma below before proving the
exponential stability.
Lemma 19. 0µflfυmax(¯
P).
Proof. If QN= 0,Qt=¯
Q, Rt=¯
R. then PN1=¯
Q. By Propostion 4.4.1’s proof (Bert vol I), we
have ¯
P=P(¯
Q, ¯
R)PN1. So done.
Next, we prove exponential stability.
Proposition 1 (Exponential stability).Define the state transition matrix:
Φ(t, t0) = Dt1···Dt0
for tt0, and Φ(t, t) = I. For any N, any 0t0N t0tN, any Qt∈ Q, Rt∈ R, QN
P, and for any xt0, we have
kxtk2c1ctt0
2kxt0k2(31)
kΦ(t, t0)k2c1ctt0
2kxt0k2(32)
where c1=qυmax(¯
P)
υmin(P),c2=q1q
υmax(¯
P)(0,1).
Proof. For any xt0, we denote as xtthe solution to the system xt+1 =Dtxtstarting at xt0. By
Lemma 18
L(t+ 1, xt+1)L(t, xt)≤ −µfkxtk2
2≤ − q
υmax(¯
P)L(t, xt)
So for any tt0,
L(t+ 1, xt+1)(1 q
υmax(¯
P))L(t, xt)
As a result,
υmin(P)kxtk2
2L(t, xt)(1 q
υmax(¯
P))tt0L(t0, xt0)(1 q
υmax(¯
P))tt0υmax(¯
P)kxt0k2
2
This completes the proof.
As for the state transition matrix, the bound is proved by noticing that xt= Φ(t, t0)xt0and
kΦ(t, t0)k2= maxxt06=0 kxtk
kxt0k.
Step 2: show that BKα
tαt+1 is bounded. We will first show that αtis bounded, then show that
BK α
tαt+1 is bounded
Lemma 20 (Bound αt).For any N, any 0tN, any Qt∈ Q, Rt∈ R, QN∈ P, we have
kαtk2c1
1c2
υmax(¯
P)¯
θ=:¯α
where c1=qυmax(¯
P)
υmin(P),c2=q1q
υmax(¯
P)(0,1).
Consequently,
kBK α
tαtk2≤ kBk2
2
¯α
µg
33
Proof. Consider system αt=D
tαt+1 +Qtθt. First of all, we bound the input:
kQtθtk2≤ kQtk2kθtk2υmax(Qt)kθtk2(by Qtis pd)
lf¯
θ(by kθtk2¯
θ,QtlfI)
The initial is αN=QNθNυmax(¯
P)¯
θ. By Lemma 19, lfυmax(¯
P).
Next, by αt=D
tαt+1 +Qtθtand def of transition matrix Φ(t, t0), we have
αt=Qtθt+D
tQt+1θt+1 +···+D
t...D
N2QN1θN1+D
t...D
N1PNθN
= Φ(t, t)Qtθt+ Φ(t+ 1, t)Qt+1θt+1 +···+ Φ(N1, t)QN1θN1+ Φ(N, t)PNθN
By the exp decay of Φ(t, t0)established in Proposition 1, we have
kαtk2≤ kΦ(t, t)k2kQtθtk2+···+kΦ(N1, t)k2kQN1θN1k2+kΦ(N , t)k2kPNθNk
≤ kΦ(t, t)k2kQtθtk2+···+kΦ(N1, t)k2kQN1θN1k2+kΦ(N , t)k2kPNθNk
(by kAk2=kAk2)
c1c0
2lf¯
θ+···+c1cNt1
2lf¯
θ+c1cNt
2υmax(¯
P)¯
θ(bylfυmax(¯
P).)
c1υmax(¯
P)¯
θ1
1c2
= ¯α
Consequently,
kBK α
tαtk2=kB(Rt+BPt+1B)1Bαtk ≤ kBk2
2k(Rt+BPt+1B)1kkαtk
(by kBk2=kBk)
≤ kBk2
2
¯α
µg
(by Rt+BPt+1BµgIm)
Step 3: bound x
t
Proof of Lemma 7. For simplicity, let ωt=BKα
tαt+1, and let ¯ω=kBk2
2¯α
µg. By definition, we
have
x
t= Φ(t, t)ωt1+ Φ(t, t 1)ωt2+...Φ(t, 1)ω0+ Φ(t, 0)x
0
By Proposition 1,
kx
tk2≤ kΦ(t, t)k2kωt1k+···+kΦ(t, 1)kkω0k+kΦ(t, 0)kkx
0k
c1c0
2¯ω+···+c1ct1
2¯ω+c1c
2kx0k2
c1
1
1c2
max(¯ω , kx0k2) =:¯x
G.5 Proof of Lemma 8
Consider the finite-horizon time-invariant LQR problem with stage cost Q, R, i.e. the total cost
function is PN1
k=0 (x
kQxk+u
kRuk). By Lemma 20, we have kαkk ≤ ¯α. By Lemma 13, we have
PPk¯
P. So kβkk=kP1
kαkk ≤ 1
υmin(P)¯α. By the proof of Lemma 14, we know βkβe
as k→ −∞, so kβek ≤ 1
υmin(P)¯α.
34
H Simulation descriptions
H.1 LQT
The experiment settings are as follows. Let A= [0,1; 1/, 5/6], B = [0; 1],N= 30. Consider
diagonal Qt, Rtwith diagonal entries i.i.d. from Unif[1,2]. Let θti.i.d. from Unif[10,10]. We
will apply RHTM, and RHGD based on gradient descent, and RHAG based on Nesterov’s gradient
descent. The stepsizes of RHTM are provided in Theorem 1. The stepsizes of RHGD can be viewed
as RHTM with stepsize δc= 1/lc, δw=δy=δz= 0, and the the stepsizes of RHAG can be viewed
as RHTM with δc= 1/lc, δy=δw=ζ1
ζ+1 and δz= 0.
H.2 Robotics tracking
Consider the following discrete-time counterpart of the kinematic model
xt+1 =xt+ ∆t·cos θt·vt(33a)
yt+1 =yt+ ∆t·sin θt·vt(33b)
θt+1 =θt+ ∆t·ωt(33c)
Thus we have
θt= arctan( yt+1 yt
xt+1 xt
)(34a)
vt=1
t·p(xt+1 xt)2+ (yt+1 yt)2(34b)
wt=θt+1 θt
t=1
t·arctan( yt+2 yt+1
xt+2 xt+1
)arctan( yt+1 yt
xt+1 xt
)(34c)
So that (θt, vt, wt)can be expressed by the state variables (xt, yt).
In the simulation, the given reference trajectory is
xr(t) = 16 sin3(t6) (35a)
yr(t) = 13 cos(t)5 cos(2t12) 2 cos(3t18) cos(4t24) (35b)
As for the objective function, we set the cost coefficients as
ce
t=0, t = 0
1,otherwise cv
t=0, t =N
15∆t2,otherwise cw
t=0, t =N
15∆t2,otherwise
The discrete-time resolution for online control is 0.025 second, i.e., t= 0.025s. When imple-
menting each control decision, a much smaller time resolution of 0.001sis used to simulate the real
motion dynamics of the robot.
35
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We propose a novel approach for optimal trajectory tracking for unmanned aerial vehicles (UAV), using a linear model predictive controller (MPC) in combination with non-linear state feedback. The solution relies on fast onboard simulation of the translational dynamics of the UAV, which is guided by a linear MPC. By sampling the states of the virtual UAV, we create a control command for fast non-linear feedback, which is capable of performing agile maneuvers with high precision. In addition, the proposed pipeline provides an interface for a decentralized collision avoidance system for multi-UAV scenarios. Our solution makes use of the long prediction horizon of the linear MPC and allows safe outdoors execution of multi-UAV experiments without the need for in-advance collision-free planning. The practicality of the tracking mechanism is shown in combination with priority-based collision resolution strategy, which performs sufficiently in experiments with up to 5 UAVs. We present a statistical and experimental evaluation of the platform in both simulation and real-world examples, demonstrating the usability of the approach.
Conference Paper
We consider Online Convex Optimization (OCO) in the setting where the costs are mm-strongly convex and the online learner pays a switching cost for changing decisions between rounds. We show that the recently proposed Online Balanced Descent (OBD) algorithm is constant competitive in this setting, with competitive ratio 3+O(1/m), irrespective of the ambient dimension. Additionally, we show that when the sequence of cost functions is ϵϵ-smooth, OBD has near-optimal dynamic regret and maintains strong per-round accuracy. We demonstrate the generality of our approach by showing that the OBD framework can be used to construct competitive algorithms for a variety of online problems across learning and control, including online variants of ridge regression, logistic regression, maximum likelihood estimation, and LQR control.
Article
We consider economic model predictive control (MPC) without terminal conditions for time‐varying optimal control problems. Under appropriate conditions, we prove that MPC yields initial pieces of approximately infinite horizon optimal trajectories and that the optimal infinite horizon trajectory is practically asymptotically stable. The results are illustrated by numerical examples motivated by energy‐efficient heating and cooling of a building.
Article
This paper studies an online optimization problem with switching costs and a finite prediction window. We propose two computationally efficient algorithms: Receding Horizon Gradient Descent (RHGD), and Receding Horizon Accelerated Gradient (RHAG). Both algorithms only require a finite number of gradient evaluations at each time. We show that both the dynamic regret and the competitive ratio of the proposed algorithms decay exponentially fast with the length of the prediction window, and the decay rate of RHAG is larger than RHGD. Moreover, we provide a fundamental lower bound on the dynamic regret for general online algorithms with a finite prediction window. The lower bound matches the dynamic regret of our RHAG, meaning that the performance can not improve significantly even with more computation. Lastly, we present simulation results to test our algorithms numerically.
Article
Suboptimal model predictive control (MPC) is a control algorithm that uses suboptimal solutions to optimal control problems to provide control actions quickly. In MPC, terminal control laws and terminal region constraints are frequently used to ensure recursive feasibility and stability. Suboptimal MPC was proven to be inherently robust for systems with soft terminal region constraints in Pannocchia et al. (2011). We extend that work to systems with hard terminal region constraints. If these hard constraints are defined as sublevel sets of appropriate terminal cost functions, a well-chosen initial guess (warm start) for the optimization algorithm is robustly feasible. As a result, the system controlled by suboptimal MPC admits an ISS-Lyapunov function and is therefore inherently robust. The authors of Yu et al. (2014) noted that the result in Pannocchia et al. (2011) applied only to systems with continuous optimal cost functions. However, discontinuous optimal cost functions may be present in systems with hard terminal region constraints. We include a simple example of a continuous dynamical system with a provably discontinuous optimal value function that, as a consequence of the main result of this work, is inherently robust. This example is to our knowledge the first such system reported in the literature.
Article
We design and analyze a novel gradient-based algorithm for unconstrained convex optimization. When the objective function is m-strongly convex and its gradient is L-Lipschitz continuous, the iterates and function values converge linearly to the optimum at rates p and p 2 , respectively, where p = 1 - √m/L. These are the fastest known guaranteed linear convergence rates for globally convergent first-order methods, and for high desired accuracies the corresponding iteration complexity is within a factor of two of the theoretical lower bound. We use a simple graphical design procedure based on integral quadratic constraints to derive closed-form expressions for the algorithm parameters. The new algorithm, which we call the triple momentum method, can be seen as an extension of methods such as gradient descent, Nesterov's accelerated gradient descent, and the heavy-ball method.
Article
The goal of this paper is to give an overview of some recent developments in the field of model predictive control. After a brief introduction to the basic concepts and available stability results, we in particular set our focus on the areas of distributed and economic model predictive control, where more general control objectives than setpoint stabilization are typically of interest.