Content uploaded by Xin Chen
Author content
All content in this area was uploaded by Xin Chen on Sep 05, 2019
Content may be subject to copyright.
arXiv:1906.11378v1 [math.OC] 26 Jun 2019
Online Optimal Control with Linear Dynamics and
Predictions: Algorithms and Regret Analysis
Yingying Li
SEAS
Harvard University
Cambridge, MA, 02138
yingyingli@g.harvard.edu
Xin Chen
SEAS
Harvard University
Cambridge, MA, 02138
chen_xin@g.harvard.edu
Na Li
SEAS
Harvard University
Cambridge, MA, 02138
nali@seas.harvard.edu
Abstract
This paper studies the online optimal control problem with time-varying convex
stage costs for a time-invariant linear dynamical system, where a finite look-ahead
window with accurate predictions of the stage costs is available at each time.
We design online algorithms, Receding Horizon Gradient-based Control (RHGC),
that utilizes the predictions through finite steps of gradient computations. We
study the algorithm performance measured by dynamic regret: the online perfor-
mance minus the optimal performance in hindsight. It is shown that the dynamic
regret of RHGC decays exponentially with the size of the look-ahead window. In
addition, we provide a fundamental limit of the dynamic regret for any online
algorithms by considering linear quadratic tracking problems. The regret upper
bound of one RHGC method almost reaches the fundamental limit, demonstrating
the effectiveness of the algorithm. Finally, we numerically test our algorithms for
both linear and nonlinear systems to show the effectiveness and generality of our
RHGC.
1 Introduction
In this paper, we consider a N-horizon discrete-time sequential decision-making problem. At each
time t= 0,...,N −1, the decision maker observes a state xtof a dynamical system and re-
ceives a W-step look-ahead window of future cost functions on states and control actions, i.e.
ft(x) + gt(u),...,ft+W−1(x) + gt+W−1(u); then decides the control input utwhich drives the
system to a new state xt+1 following some known dynamics. For simplicity, we consider a linear
time-invariant (LTI) system xt+1 =Axt+Butwith (A, B)known in advance. The goal is to
minimize the overall cost over the Ntime steps. This problem finds many applications in sequential
decision making problems, e.g., data center management [1, 2], robotics [3], autonomous driving
[4, 5], energy systems [6], manufacturing [7, 8]. Therefore, there has been a growing interest on the
problem, from both control and online optimization communities.
In control community, studies on the above problem focus on Economic Model Predictive Control
(EMPC), which is a variant of Model Predictive Control (MPC) with a primary goal on optimizing
economic costs [9, 10, 11, 12, 13, 14, 15, 16]. Recent years have seen a lot of attention on the
optimality performance analysis of EMPC, under both time-invariant costs [17, 18, 19] and time-
varying costs [20, 12, 14, 21, 22]. However, most studies focus on asymptotic performance and there
is still limited understanding on the non-asymptotic performance, especially under time-varying
costs. Moreover, for computationally efficient algorithms, e.g. suboptimal MPC and inexact MPC
[23, 24, 25, 26], there is limited work on the optimality performance guarantee.
In online optimization, on the contrary, there are many papers on the nonasymptotic performance
analysis, which is measured by regret, e.g., static regrets[27, 28], dynamic regrets[29], etc, but most
Preprint. Under review.
work does not consider predictions and/or dynamical systems. Motivated by the applications with
predictions, e.g. predictions of electricity prices in data center management problems [30, 31], there
is a growing interest on studying the effect of predictions for the online problems [32, 33, 30, 34, 31,
35, 36]. However, though some papers consider switching costs which can be viewed as a simple
and special dynamical model [37, 36], there is a lack of study on the general dynamical systems and
on how predictions affect the online problem with dynamical systems.
In this paper, we propose novel gradient-based online algorithms, receding horizon gradient-based
control (RHGC), and provide nonasymptotic optimality guarantees by dynamic regrets. RHGC can
be based on any gradient methods, such as the vanilla gradient descent, Nesterov gradient, triple
momentum, etc [38, 39]. Due to space limit, this paper only presents the receding horizon triple
momentum (RHTM). For the theoretical analysis, we assume the cost functions are strongly convex
and smooth, whereas applying our RHGC does not require these conditions. Specifically, we show
that the regret bound of RHTM decays exponentially fast with the prediction window’s size W,
demonstrating that our algorithm efficiently utilizes the prediction. Besides, our regret bound also
decreases when the system becomes more controllable in the sense of a controllability index [40].
Moreover, we provide a fundamental limit for any online control algorithms and show that the
fundamental lower bound almost matches the regret upper bound of our RHTM. This indicates
that our RHTM achieves near-optimal performance at least in the worst case. We also provide
some discussion on the linear quadratic tracking problems, a widely considered control problem
in literature to provide more intuitive interpretation of our results. Finally, we numerically test
our algorithms. In addition to linear systems, we also apply our RHGC to a nonlinear dynamical
system, a two-wheeled robot, for path tracking. Results show that our algorithm works effectively for
nonlinear systems although we only present our algorithm and theoretical analysis on LTI systems.
Lastly, we would like to mention that there have been some recent work on online linear quadratic
control (LQR) problems, but most papers focus on the no-prediction cases [41, 42, 37]. As we
show later in this paper, these algorithms can be used in our RHGC methods as initialization oracles.
Moreover, our regret analysis show that RHGC can reduce the regret of these no-prediction online
algorithms by a factor exponential decaying with the prediction window size W.
Notations. Consider matrices Aand B,A≥Bmeans A−Bis positive semidefinite. The norm
k · k refers to L2norm. Let xidenote the ith entry of the vector. Consider a set I={k1,...,km},
then xI= (xk1,...,xkm)⊤and A(I,:) denotes the Irows of matrix Astacked together.
2 Problem formulation and preliminaries
Consider a finite-horizon discrete-time optimal control problem with time-varying cost functions
ft(xt) + gt(ut)and a linear time-invariant (LTI) dynamical system:
min
x,u
J(x,u) =
N−1
X
t=0
[ft(xt) + gt(ut)] + fN(xN)
s.t. xt+1 =Axt+But, t ≥0
(1)
where xt∈Rn,ut∈Rmfor all t,x= (x⊤
1,...,x⊤
N)⊤,u= (u⊤
0,...,u⊤
N−1)⊤,x0is given, Nis
the problem horizon, fN(xN)is the terminal cost. Solving the optimal control problem (1) requires
information of all the cost functions from t= 0 to t=N. However, at each time t, usually only
a finite look-ahead window of cost functions is available and the decision maker needs to make an
online decision utusing the available information.
In particular, we consider a simplified prediction model: at each time t, the decision maker is pro-
vided with accurate predictions for the next Wtime steps, ft, gt,...,ft+W−1, gt+W−1, but no
further prediction beyond these Wtime steps, which means that ft+W, gt+W,... can even be ad-
versarially generated. Although this prediction model may be too optimistic in the short term and
over pessimistic in the long term, this model i) is able to capture a commonly observed phenomenon
in predictions that short-term predictions are usually much more accurate than the long-term predic-
tions; ii) allows researchers to derive insights for the role of prediction and possibly to extend to
more complicated cases [31, 30, 43, 44].
The online optimal control problem is described as follows: at each time step t= 0,1,...,
•The agent observes state xtand receives prediction ft, gt, . . . , ft+W−1, gt+W−1.
2
•The agent decides and implements a control utand suffers the cost ft(xt) + gt(ut).
•The system evolves to the next state xt+1 =Axt+But.1
An online control algorithm, denoted as A, can be defined as a mapping from the prediction infor-
mation and history information to the control action, denoted by ut(A):
ut(A) = A(xt(A),...,x0(A),{fs, gs}t+W−1
s=0 ), t ≥0(2)
where xt(A)is the state generated by implementing Aand x0(A) = x0is given.
This paper evaluates the performance of online control algorithms by comparing against the optimal
control cost J∗in hindsight:
J∗:= min
(x,u): xt+1=Axt+But
J(x,u).(3)
The performance concerned in this paper for an online algorithm Ais measured by 2
Regret(A) := J(A)−J∗=J(x(A),u(A)) −J∗(4)
which is sometimes called as dynamic regret [29, 45] or competitive difference [46]. Another popular
regret notion is the static regret, which compares the online performance with the optimal static
controller/policy [42, 41]. The benchmark in static regret is weaker than that in dynamic regret
because the optimal controller may be far from being static, and it has been shown in literature that
o(N)static regret can be achieved even without predictions (i.e., W= 0). Thus, we will focus on
the dynamic regret analysis and study how prediction can improve the dynamic regret.
Example 1 (Linear quadratic (LQ) tracking.).Consider a discrete time tracking problem for a sys-
tem xt+1 =Axt+But. The goal is to minimize the quadratic tracking loss of a trajectory {θt}N
t=0
J(x,u) = 1
2
N−1
X
t=0 (xt−θt)⊤Qt(xt−θt) + u⊤
tRtut+1
2(xN−θN)⊤QN(xN−θN)
In practice, it is usually difficult to know the complete trajectory {θt}N
t=0 in prior, what are revealed
are usually the next few steps, making it an online control problem with predictions.
Assumptions and some useful concepts. Firstly, we introduce a standard assumption in control
theory: controllability of the system, which roughly means that the system can be steered to any
state by appropriate control inputs [47].
Assumption 1. The LTI system xt+1 =Axt+Butis controllable.
It is well-known that any controllable LTI system can be linearly transformed to a canonical form
[40] and the linear transformation can be computed efficiently in prior using Aand B, which can
further be used to reformulate the cost functions ft, gt. Thus, without loss of generality, this paper
only considers LTI systems in the canonical form, defined as follows.
Definition 1 (Canonical form).A system xt+1 =Axt+Butis said to be in the canonical form if
A=
0 1 0
.
.
.......
0 1
∗ ∗ ··· ∗ ∗ ∗ ... ∗ ··· ∗
0 1 0
.
.
.......
0 1
∗ ∗ ··· ∗ ∗ ∗ ··· ∗ ··· ∗ ··· ∗
··· ··· 0 1 ··· 0
.
.
.......
0 1
∗ ∗ ··· ∗ ∗ ∗ ··· ∗ ··· ∗ ∗ ··· ∗
, B =
0 0 ...
.
.
..
.
..
.
.
0
1 0 ···
0 0
.
.
..
.
.···
0 1 ···
··· ···
0··· ···
.
.
....
0 0 ···1
where each * represents a (possibly) nonzero entry, and the rows of Bwith 1are the same rows of A
with * and the indices of these rows are denoted as {k1,...,km}=:I. Moreover, let pi=ki−ki−1
for 1≤i≤m, where k0= 0. The controllability index of a canonical-form (A, B)is defined as
p= max{p1,...,pm}.
1Different from many learning based control papers, we assume A, B are known to the agent. We also
assume the full state xtis observable. Relaxing the information requirement is left as future work.
2The optimality gap depends on the initial state x0, but we omit x0for the simplicity of notation.
3
Next, we introduce assumptions on cost functions and their optimal solutions.
Assumption 2. Assume ftis µfstrongly convex and lfLipschitz smooth for 0≤t≤N, and gtis
µgstrongly convex and lgLipschitz smooth for 0≤t≤N−1for some µf, µg, lf, lg>0.
Assumption 3. Assume the minimizers to ft, gt, denoted as θt= arg minxft(x), ξt=
arg minugt(u), are uniformly bounded, i.e. there exist ¯
θ, ¯
ξsuch that kθtk ≤ ¯
θ,kξtk ≤ ¯
ξ, ∀t.
These assumptions are commonly adopted in convex analysis. The uniform bounds rule out extreme
cases. Notice that the LQ tracking problem in Example 1 satisfies Assumption 2 and 3 if Qt, Rtare
positive definite with uniform bounds on eigenvalues and θtare uniformly bounded for all t.
3 Online control algorithms: Receding horizon gradient-based control
This section introduces our online control algorithms, receding horizon gradient-based control
(RHGC). The design is by first converting the online control problem to an equivalent online op-
timization problem with finite temporal-coupling costs and then designing gradient-based online
optimization algorithms by utilizing this finite temporal-coupling property.
3.1 Problem transformation
Firstly, we notice that the offline optimal control problem (1) can be viewed as an optimization with
equality constraints over xand u. The individual stage cost ft(xt) + gt(ut)only depends on the
current xtand utbut the equality constraints couple xt,utwith xt+1 for each t. In the following,
we will rewrite (1) in an equivalent form of an unconstrained optimization problem on some entries
of xt, but the new stage cost at each time twill depend on these new entries across a few nearby
time steps. We will harness this structure to design our online algorithm.
In particular, the entries of xtadopted in the reformulation are: xk1
t,...,xkm
t, where I=
{k1,...,km}is defined in Definition 1. For ease of notation, we define
zt:= (xk1
t,...,xkm
t)⊤, t ≥0(5)
and zj
t=xkj
twhere j= 1,...,m. Let z:= (z⊤
1,...,...,z⊤
N)⊤. By the canonical-form equality
constraint xt=Axt−1+But−1, we have xi
t=xi+1
t−1for i6∈ I, so xtcan be represented by
zt−p+1,...,ztin the following way:
xt= (z1
t−p1+1,...,z1
t
|{z }
p1
, z2
t−p2+1,...,z2
t
|{z }
p2
,...,zm
t−pm+1,...,zm
t
|{z }
pm
)⊤, t ≥0,(6)
where ztfor t≤0is determined by x0in a way to let (6) hold for t= 0. For the ease of mathematical
exposition and without loss of generality, we consider x0= 0 in this paper; then we have zt= 0 for
t≤0. Similarly, utcan be determined by zt−p+1 ,...,zt, zt+1 by
ut=zt+1 −A(I,:)xt=zt+1 −A(I,:)(z1
t−p1+1,...,z1
t,...,zm
t−pm+1,...,zm
t)⊤, t ≥0(7)
where A(I,:) consists of k1,...,kmrows of A.
Notice that equations (5, 6, 7) describe a one-to-one transformation between (x,u)and z. There-
fore, we can transform the constrained optimization problem (1) on (x,u)to be an optimization
problem on z. Furthermore, because the LTI constraint xt+1 =Axt+Butis naturally em-
bedded in the relation (6) and (7), the resulting optimization problem on zbecomes an uncon-
strained one. Specifically, the new cost functions can be obtained by substituting (6, 7) into
ft(xt)and gt(ut). We denote the corresponding cost functions as ˜
ft(zt−p+1,...,zt):=ft(xt)
and ˜gt(zt−p+1,...,zt, zt+1 ):=gt(ut). Then the unconstrained optimization problem’s objective
function can be written as
C(z):=
N
X
t=0
˜
ft(zt−p+1,...,zt) +
N−1
X
t=0
˜gt(zt−p+1 ,...,zt+1 )(8)
C(z)has many nice properties, some of which are formally stated as below.
Lemma 1. C(z)has the following properties:
4
i) C(z)is µc=µfstrongly convex and lcsmooth for lc= (plf+ (p+ 1)lgkIm,−A(I,:)k2);
ii) ∀(x,u)s.t. xt+1 =xt+ut,C(z) = J(x,u)where zis defined in (5). Conversely, ∀z,
the corresponding (x,u)defined in (6) (7) satisfies xt+1 =xt+utand J(x,u) = C(z);
iii) Each stage cost ˜
ft+ ˜gtin (8) only depends on zt−p+1,...,zt+1.
Property ii) implies that any online algorithm for deciding zcan be translated to an online algorithm
for xand uby (6, 7) with the same costs. Property iii) highlights one nice property, local temporal-
coupling, of C(z), which serves as a foundation for our online algorithm design.
Example 2. For illustration, consider the following dynamical system with n= 2, m = 1:
x1
t+1
x2
t+1 =0 1
a1a2 x1
t
x2
t+0
1ut(9)
Here, k1= 2,I={2},A(I,:) = (a1, a2), and zt=x2
t.(9) leads to x1
t=x2
t−1and xt=
(zt−1, zt)⊤. Similarly, ut=x2
t+1 −A(I,:)xt=zt+1 −A(I,:)(zt−1, zt)⊤. Hence, ˜
ft(zt−1, zt) =
ft(xt) = ft((zt−1, zt)⊤),˜gt(zt−1, zt, zt+1) = gt(ut) = gt(zt+1 −A(I,:)(zt−1, zt)⊤).
3.2 Online algorithm design: RHGC
This section introduces our RHGC algorithm based on the reformulation (8) and inspired by the
online algorithm RHGD in [36]. As mentioned earlier, any online algorithm on ztcan be translated
to be an online algorithm on xt, ut. So we focus on designing an online algorithm on ztnow. By the
finite temporal-coupling property of C(z), the partial gradient of the total cost C(z)only depends
on the finite local stage costs {˜
fτ,˜gτ}t+p−1
τ=tand finite local stage variables (zt−p,...,zt+p) =:
zt−p:t+p.
∂C
∂zt
(z) =
t+p−1
X
s=t
∂˜
fs
∂zt
(zs−p+1(k),...,zs(k)) +
t+p−1
X
s=t−1
∂˜gs
∂zt
(zs−p+1(k),...,zs+1 (k))
Without causing any confusion, we use ∂C
∂zt(zt−p:t+p)to denote ∂C
∂zt(z)for highlighting the lo-
cal dependence. Therefore, despite that not all the future costs are available, it is still possible to
compute the partial gradient of the total cost by using only a finite look-ahead window of the cost
functions. This observation motivates the design of our receding horizon gradient-based control
(RHGC) methods, which are the online implementation of gradient methods, such as the vanilla gra-
dient descent, Nesterov’s accelerated gradient, Triple Momentum, etc [38, 39]. For the space limit,
we only formally present the Receding Horizon Triple Momentum (RHTM) method in this paper,
c.f. Algorithm 1. Other RHGC methods can be designed in the same way.
In RHTM, jrefers to the iteration number of the corresponding gradient update of C(z). There are
two major steps to decide zt: i) initializing the decision variables z(0), ω
ω
ω(0),y(0) where ω
ω
ω(0),y(0)
Algorithm 1: Receding Horizon Triple Momentum (RHTM)
1: inputs: Canonical form (A, B),W≥1,K=⌊W−1
p⌋, stepsizes γc, γz, γw, γy>0, oracle ϕ.
2: for t= 1 −W:N−1do
3: Step 1: initialize zt+W(0) by oracle ϕ, then let ωt+W(−1), ωt+W(0), yt+W(0) be zt+W(0)
4: for j= 1,...,K do
5: Step 2: update ωt+W−jp (j), yt+W−jp (j), zt+W−jp (j)by Triple Momentum.
ωt+W−jp (j) = (1 + γw)ωt+W−j p (j−1) −γwωt+W−jp (j−2)
−γc
∂C
∂yt+W−j p
(yt+W−(j+1)p:t+W−(j−1)p(j−1))
yt+W−jp (j) = (1 + γy)ωt+W−jp (j)−γyωt+W−jp(j−1)
zt+W−jp (j) = (1 + γz)ωt+W−j p(j)−γzωt+W−jp (j−1)
6: end for
7: Step 3: compute utby zt+1(K)and the observed state xt:ut=zt+1(K)−A(I,:)xt
8: end for
5
are auxiliary variables used in triple momentum methods to accelerate the convergence. We do
not restrict the initialization algorithm ϕ, i.e., it can be any oracle/online algorithm that does not
use prediction: zt+W(0) = ϕ({˜
fs,˜gs}t+W−1
s=0 ). In Section 4, we will provide one initialization ϕ.
ii) using the look-ahead window of predicted cost to conduct gradient updates. We note that the
gradient update for (zτ(j), ωτ(j), yτ(j)) to (zτ(j+ 1), ωτ(j+ 1), yτ(j+ 1)) is implemented in a
backward order, i.e., from τ=t+Wto τ=t. Moreover, since the partial gradient of ∂C
∂ztneeds
the local variables zt−p:t+p−1, given W-step predictions, the algorithm RHTM can only conduct
K=⌊W−1
p⌋iterations of TM for the total cost C(z). For more intuitive introduction of the RHGC
methods, we refer readers to [36] for the simple case where p= 1 due to the space limit.
Though it appears that RHTM does not fully exploit the prediction since only a few gradient updates
are used, in section 5, we show that RHTM achieves nearly-optimal performance with respect to W,
which means that our algorithm successfully extracts and utilizes the prediction information.
Finally, we briefly introduce MPC[48] and suboptimal MPC[23], and compare them with our algo-
rithm. MPC tries to solve a W-stage optimization at each time tand implements the first control
input. Suboptimal MPC, as a variant of MPC aiming at reducing computation, conducts an optimiza-
tion method only for a few iterations without solving the optimization completely. Our algorithm’s
computation requirement is similar to suboptimal MPC with a few gradient iterations. Nevertheless,
the major difference between our algorithm and suboptimal MPC is that suboptimal MPC conducts
gradient updates for a truncated W-stage optimal control problem, while our algorithm is able to
conducts the gradient updates of the total cost only using W-step predictions, which solves the
complete N-stage optimal control problem but in an online fashion based on the reformulation (8).
4 Regret upper bound
Because our RHTM is designed in the way of exactly implementing the triple momentum of C(z)
for Kiterations, it is straightforward to have the following regret guarantee that connects the the
regret of RHTM and the initialization oracle ϕ,
Theorem 1. Consider W≥1and let ζ=lc/µcdenote the condition number of C(z). For any
initialization oracle ϕ, given step sizes γc=1+φ
lc, γw=φ2
2−φ, γy=φ2
(1+φ)(2−φ), γz=φ2
1−φ2, and
φ= 1 −1/√ζ, we have
Regret(RH T M)≤ζ2√ζ−1
√ζ2K
Regret(ϕ)
where K=⌊W−1
p⌋,Regret(ϕ)is the regret of the initial controller: ut(0) = zt+1(0)−A(I,:)xt(0).
Theorem 1 suggests that for any online algorithm ϕwithout prediction, RHTM can use prediction
to lower the regret by a factor of ζ2(√ζ−1
√ζ)2Kthrough additional K=⌊W−1
p⌋gradient updates.
Moreover, the factor decays exponentially with K=⌊W−1
p⌋which is almost a linear increasing
function with W. This indicates that our RHTM can improve the performance exponentially fast
with an increase in the prediction window Wfor any initialization method. In addition, K=⌊W−1
p⌋
decreases with p, indicating that the regret increases with the controllability index p. This is intuitive
because proughly indicates how fast the controller can influence the system state effectively: the
larger the pis, the longer it takes (c.f. Definition 1). To see this, consider Example 2. Since ut−1
does not directly affect x1
t, it takes at least p= 2 steps to change x1
tto a desirable value.
One initialization method: Follow the Optimal Steady State (FOSS). To complete the regret
analysis for RHTM, we provide a simple initialization method, FOSS. As mentioned before, any
online control algorithm without predictions, e.g., [42, 41] can be applied as an initialization oracle
ϕ. However, these papers mostly focus on the static regret analysis rather than dynamic regrets.
Definition 2 (Follow the Optimal Steady State (FOSS)).The optimal steady state for stage cost
f(x) + g(u)refers to (xe, ue):= arg minx=Ax+Bu (f(x) + g(u)). The Follow the Optimal Steady
State method (FOSS) solves the optimal steady state (xe
t, ue
t)based on cost function ft(x) + gt(u)
and outputs zt+1 that follows xe
t’s elements in I:zt+1(F OS S) = xe,I
t, where I={k1,...,km}.
FOSS is motivated by the fact that the optimal steady state cost is the optimal limiting average cost
for LTI systems [49] and thus FOSS should give acceptable performance at least for slowly changing
6
ft, gt. Nevertheless, we admit that the FOSS is proposed mainly for analytical purposes and other
online algorithms may outperform FOSS in various perspectives. Next, we provide a regret bound
for FOSS, which relies on the solution to the Bellman equation.
Definition 3 (Solution to the Bellman equation [50]).Let λebe the optimal steady state cost, which
is also the optimal limiting average cost (c.f. [49]). The Bellman equation for the optimal limiting
average-cost control problem is he(x) + λe= minu(f(x) + g(u) + he(Ax +Bu)). The solution
of the Bellman equation, denoted by he(x), is sometimes called as a bias function [50]. To ensure
the uniqueness of the solution, some extra conditions, e.g. he(0) = 0, are usually imposed.
Theorem 2 (Regret Bound of FOSS).Let (xe
t, ue
t)and he
t(x)denote the optimal steady state and
the bias function with respect to cost ft(x) + gt(u)respectively for 0≤t≤N−1. Suppose he
t(x)
exists for 0≤t≤N−1, then, the regret of FOSS can be bounded by
Regret(FOSS) = O N
X
t=0
(kxe
t−1−xe
tk+he
t−1(x∗
t)−he
t(x∗
t))!
where {x∗
t}N
t=0 denotes the optimal states, xe
−1=x∗
0=x0,he
−1(x) = 0, he
N(x) = fN(x),
xe
N=θN. Consequently, by Theorem 1, the regret bound of RHTM with initialization FOSS is
Regret(RHTM) = O(√ζ−1
√ζ)2KPN
t=0(kxe
t−1−xe
tk+he
t−1(x∗
t)−he
t(x∗
t)).
Theorem 2 bounds the regret by the variation of the optimal steady states xe
tand the bias functions
he
t. If ft, gtdo not change, xe
t, he
tdo not change, resulting in 0 regret, which matches our intu-
ition. Though Theorem 2 requires the existence of he
t, the existence is guaranteed for many control
problems, e.g. LQ tracking and control problems with turnpike properties [51, 22].
5 Linear quadratic tracking: regret upper bounds and a fundamental limit
To provide more intuitive meaning for our regret analysis in Theorem 1 and Theorem 2, we ap-
ply RHTM on the LQ tracking problem in Example 1. Results on the time varying Qt, Rt, θtare
provided in the appendix; whereas here we focus on a special case which gives clean expressions
for regret bounds, both an upper bound for RHTM with initialization FOSS and a lower bound for
any online algorithm. These clean expressions make it easy to see that the lower bound and upper
bound almost match each other, implying that our online algorithm RHTM uses the prediction in a
nearly-optimal way even though it only conducts a few gradient updates at each time step .
The special case of LQ tracking problems is in the following form,
1
2
N−1
X
t=0
h(xt−θt)⊤Q(xt−θt) + u⊤
tRuti+1
2x⊤
NPexN(10)
where Q > 0,R > 0, and Peis the solution to the algebraic Riccati equation with respect to Q, R
[52]. Basically, in this special case, Qt=Q,Rt=Rfor 0≤t≤N−1,QN=Pe,θN= 0, and
only θt, t = 1,...,N −1changes. The LQ (10) tracking problem means to follow a time-varying
trajectory θwith constant weights on the tracking cost and control cost.
Regret upper bound. Firstly, based on Theorem 1 and Theorem 2, we have the following bound.
Corollary 1. Then, the regret of RHTM with FOSS as initialization rule can be bounded by
Regret(RHT M ) = O(( √ζ−1
√ζ)2K
N
X
t=0 kθt−θt−1k)
where K=⌊(W−1)/p⌋,ζis the condition number of the corresponding C(z),θ−1= 0.
This corollary shows that the regret can be bounded by the total variation of θtfor constant Q, R.
Fundamental limit. For any online algorithm, we have the following lower bound.
Theorem 3 (Lower Bound).Consider 1≤W≤N/3. Consider any condition number ζ > 1,
any variation budget 2¯
θ≤LN≤(2N+ 1)¯
θand any controllability index p≥1. For any online
algorithm A, there exists an LQT problem in form (10) such that the canonical-form system (A, B )
7
has controllability index p, the sequence {θt}satisfies the variation budget PN
t=1 kθt−θt−1k ≤ LN,
and the corresponding C(z)has condition number ζ, such that the following lower bound holds
J(A)−J∗= Ω(( √ζ−1
√ζ+ 1 )2KLN) = Ω(( √ζ−1
√ζ+ 1 )2K
N
X
t=0 kθt−θt−1k)(11)
where K=⌊(W−1)/p⌋and θ−1= 0.
Surprisingly, the lower bound in Theorem 3 and the upper bound in Corollary 1 almost match each
other, especially when ζis large. This demonstrates that RHTM utilizes the prediction information in
a near-optimal way. The major conditions in Theorem 3 require that the prediction is short compared
with the horizon: W≤N/3and the variation of the cost functions should not be too small: LN≥
2¯
θ, otherwise the online control problem is too easy and the regret can be very small.
6 Numerical experiments
2 4 6 8 10 12 14
Prediction W
-10
-5
0
5
9
log(regret)
RHGD
RHAG
RHTM
subMPC Iter = 1
subMPC Iter = 3
subMPC Iter = 5
Figure 1: Regret for LQ tracking.
-20 -10 0 10 20
X
-20
-10
0
10
20
Y
W = 40
reference robot path
-20 -10 0 10 20
X
-20
-10
0
10
20
Y
W = 80
reference robot path
Figure 2: Two-wheel robot tracking with nonlinear dynam-
ics.
LQ tracking problem in Example 1. The experiment settings are provided in the appendix. The
LTI system order is n= 2 and the controller is a scalar; thus p= 2 for this system. We compare
our algorithm with one suboptimal MPC algorithm, fast gradient MPC (subMPC) [23]. Roughly
speaking, the algorithm solves the W-stage truncated optimal control from tto t+W−1and then
solves it by Nesterov’s gradient descent. One gradient update in this subMPC requires Wtimes
of partial gradient calculations since there are Wstages of variables. This means that our RHTM
is corresponding to subMPC with 1 Nesterov iteration. Figure 1 also plots subMPC with 3 and 5
Nesterov iteration. Figure 1 shows that all our algorithms RHGD, RHAG, RHTM achieve exponen-
tial decaying regret with respect to W, and the decay is piecewise constant, matching Theorem 1.
It is observed that RHTM and RHAG perform better than RHGD, which is intuitive because TM
and AG are accelerated versions of GD. Moreover, our algorithms are much better than the subopti-
mal MPC with one iteration. It is also observed that suboptimal MPC achieves better performance
by increasing the iteration number but the improvement saturates as Wgets large, contrast to our
RHTM.
Path tracking for a two-wheel mobile robot. Though we presented our online algorithms on
a LTI system, our RHGC methods are applicable to nonlinear systems. Here we consider a
two-wheel mobile robot with nonlinear kinematic dynamics ˙x=vcos δ, ˙y=vsin θ, ˙
δ=w
where (x, y)is the robot location, vand ware the tangential and angular velocities respec-
tively, δdenotes the tangent angle between vand the X-axis [53]. The control is directly on
the vand ω, e.g., through pulse-width modulation (PWM) of the motor [54]. Given a refer-
ence path (xr(t), yr(t)), the objective is to balance the tracking performance and control cost, i.e.,
min PN
t=0 ce
t·(xt−xr(t))2+ (yt−yr(t))2+cv
t·v2
t+cw
t·w2
t. We discretize the dynamics
with time interval ∆t= 0.025s; then follow similar ideas in this paper to reformulate the optimal
path tracking problem to an unconstrained optimization with respect to (xt, yt)and apply RHGC
methods. See the appendix for details. Figure 2 plots the tracking results with window W= 40 and
W= 80 corresponding to look-ahead time 1s and 2s. A video showing the dynamic processes with
different Wis provided at https://youtu.be/fal56LTBD1s. It is observed that the robot follows
the reference trajectory well especially when the path is smooth but has some deviations when the
path has sharp turns, and a longer look-ahead window leads to better tracking performance. These
results confirm that our RHGC work effectively on nonlinear systems.
8
7 Conclusion
This paper studies the role of prediction on dynamic regret of online control problems with linear
dynamics. We design RHTM algorithm and provide a regret upper bound. We also provide a
fundamental limit and show the fundamental limit almost matches RHTM’s upper bound. Future
work includes the study of 1) nonlinear systems, 2) systems with disturbances and noises, 3) system
with state and control constraints, 4) unknown system dynamics.
References
[1] Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle.
Data center cooling using model-predictive control. In Advances in Neural Information Pro-
cessing Systems, pages 3814–3823, 2018.
[2] Wei Xu, Xiaoyun Zhu, Sharad Singhal, and Zhikui Wang. Predictive control for dynamic
resource allocation in enterprise data centers. In 2006 IEEE/IFIP Network Operations and
Management Symposium NOMS 2006, pages 115–126. IEEE, 2006.
[3] Tomas Baca, Daniel Hert, Giuseppe Loianno, Martin Saska, and Vijay Kumar. Model predic-
tive trajectory tracking and collision avoidance for reliable outdoor deployment of unmanned
aerial vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pages 6753–6760. IEEE, 2018.
[4] Jackeline Rios-Torres and Andreas A Malikopoulos. A survey on the coordination of connected
and automated vehicles at intersections and merging at highway on-ramps. IEEE Transactions
on Intelligent Transportation Systems, 18(5):1066–1077, 2016.
[5] Kyoung-Dae Kim and Panganamala Ramana Kumar. An mpc-based approach to provable
system-wide safety and liveness of autonomous ground traffic. IEEE Transactions on Auto-
matic Control, 59(12):3341–3356, 2014.
[6] Samir Kouro, Patricio Cortés, René Vargas, Ulrich Ammann, and José Rodríguez. Model pre-
dictive control—a simple and powerful method to control power converters. IEEE Transactions
on industrial electronics, 56(6):1826–1838, 2008.
[7] Edgar Perea-Lopez, B Erik Ydstie, and Ignacio E Grossmann. A model predictive control
strategy for supply chain optimization. Computers & Chemical Engineering, 27(8-9):1201–
1218, 2003.
[8] Wenlin Wang, Daniel E Rivera, and Karl G Kempf. Model predictive control strategies for sup-
ply chain management in semiconductor manufacturing. International Journal of Production
Economics, 107(1):56–77, 2007.
[9] Moritz Diehl, Rishi Amrit, and James B Rawlings. A lyapunov function for economic opti-
mizing model predictive control. IEEE Transactions on Automatic Control, 56(3):703–707,
2010.
[10] Matthias A Müller and Frank Allgöwer. Economic and distributed model predictive control:
Recent developments in optimization-based control. SICE Journal of Control, Measurement,
and System Integration, 10(2):39–52, 2017.
[11] Matthew Ellis, Helen Durand, and Panagiotis D Christofides. A tutorial review of economic
model predictive control methods. Journal of Process Control, 24(8):1156–1178, 2014.
[12] Antonio Ferramosca, James B Rawlings, Daniel Limón, and Eduardo F Camacho. Economic
mpc for a changing economic criterion. In 49th IEEE Conference on Decision and Control
(CDC), pages 6131–6136. IEEE, 2010.
[13] Matthew Ellis and Panagiotis D Christofides. Economic model predictive control with time-
varying objective function for nonlinear process systems. AIChE Journal, 60(2):507–519,
2014.
[14] David Angeli, Alessandro Casavola, and Francesco Tedesco. Theoretical advances on eco-
nomic model predictive control with time-varying costs. Annual Reviews in Control, 41:218–
224, 2016.
[15] Rishi Amrit, James B Rawlings, and David Angeli. Economic optimization using model pre-
dictive control with a terminal cost. Annual Reviews in Control, 35(2):178–186, 2011.
9
[16] Lars Grüne. Economic receding horizon control without terminal constraints. Automatica,
49(3):725–734, 2013.
[17] David Angeli, Rishi Amrit, and James B Rawlings. On average performance and stability of
economic model predictive control. IEEE transactions on automatic control, 57(7):1615–1626,
2012.
[18] Lars Grüne and Marleen Stieler. Asymptotic stability and transient optimality of economic
mpc without terminal conditions. Journal of Process Control, 24(8):1187–1196, 2014.
[19] Lars Grüne and Anastasia Panin. On non-averaged performance of economic mpc with ter-
minal conditions. In 2015 54th IEEE Conference on Decision and Control (CDC), pages
4332–4337. IEEE, 2015.
[20] Antonio Ferramosca, Daniel Limon, and Eduardo F Camacho. Economic mpc for a changing
economic criterion for linear systems. IEEE Transactions on Automatic Control, 59(10):2657–
2667, 2014.
[21] Lars Grüne and Simon Pirkelmann. Closed-loop performance analysis for economic model
predictive control of time-varying systems. In 2017 IEEE 56th Annual Conference on Decision
and Control (CDC), pages 5563–5569. IEEE, 2017.
[22] Lars Grüne and Simon Pirkelmann. Economic model predictive control for time-varying sys-
tem: Performance and stability results. Optimal Control Applications and Methods, 2018.
[23] Melanie Nicole Zeilinger, Colin Neil Jones, and Manfred Morari. Real-time suboptimal model
predictive control using a combination of explicit mpc and online optimization. IEEE Trans-
actions on Automatic Control, 56(7):1524–1534, 2011.
[24] Yang Wang and Stephen Boyd. Fast model predictive control using online optimization. IEEE
Transactions on Control Systems Technology, 18(2):267–278, 2010.
[25] Knut Graichen and Andreas Kugi. Stability and incremental improvement of suboptimal mpc
without terminal constraints. IEEE Transactions on Automatic Control, 55(11):2576–2580,
2010.
[26] Douglas A Allan, Cuyler N Bates, Michael J Risbeck, and James B Rawlings. On the inherent
robustness of optimal and suboptimal nonlinear mpc. Systems & Control Letters, 106:68–78,
2017.
[27] E. Hazan. Introduction to Online Convex Optimization. Foundations and Trends(r) in Opti-
mization Series. Now Publishers, 2016.
[28] S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and
Trends(r) in Machine Learning. Now Publishers, 2012.
[29] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online opti-
mization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages
398–406, 2015.
[30] Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. Dynamic right-
sizing for power-proportional data centers. IEEE/ACM Transactions on Networking (TON),
21(5):1378–1391, 2013.
[31] Minghong Lin, Zhenhua Liu, Adam Wierman, and Lachlan LH Andrew. Online algorithms
for geographical load balancing. In Green Computing Conference (IGCC), 2012 International,
pages 1–10. IEEE, 2012.
[32] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In
Conference on Learning Theory, pages 993–1019, 2013.
[33] Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan LH Andrew.
Online convex optimization using predictions. In ACM SIGMETRICS Performance Evaluation
Review, volume 43, pages 191–204. ACM, 2015.
[34] Masoud Badiei, Na Li, and Adam Wierman. Online convex optimization with ramp constraints.
In Decision and Control (CDC), 2015 IEEE 54th Annual Conference on, pages 6730–6736.
IEEE, 2015.
[35] Niangjun Chen, Joshua Comden, Zhenhua Liu, Anshul Gandhi, and Adam Wierman. Using
predictions in online optimization: Looking forward with an eye on the past. In Proceedings
of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of
Computer Science, pages 193–206. ACM, 2016.
10
[36] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching
costs: Fast algorithms and the fundamental limit. arXiv preprint arXiv:1801.07780, 2018.
[37] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and lqr con-
trol. In The 22nd International Conference on Artificial Intelligence and Statistics, pages
2504–2513, 2019.
[38] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
Springer Science & Business Media, 2013.
[39] Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally con-
vergent first-order method for minimizing strongly convex functions. IEEE Control Systems
Letters, 2(1):49–54, 2017.
[40] David Luenberger. Canonical forms for linear multivariable systems. IEEE Transactions on
Automatic Control, 12(3):290–293, 1967.
[41] Yasin Abbasi-Yadkori, Peter Bartlett, and Varun Kanade. Tracking adversarial targets. In
International Conference on Machine Learning, pages 369–377, 2014.
[42] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal
Talwar. Online linear quadratic control. In International Conference on Machine Learning,
pages 1028–1037, 2018.
[43] Lian Lu, Jinlong Tu, Chi-Kin Chau, Minghua Chen, and Xiaojun Lin. Online energy genera-
tion scheduling for microgrids with intermittent energy sources and co-generation, volume 41.
ACM, 2013.
[44] Allan Borodin, Nathan Linial, and Michael E Saks. An optimal on-line algorithm for metrical
task system. Journal of the ACM (JACM), 39(4):745–763, 1992.
[45] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimiza-
tion in dynamic environments: Improved regret rates for strongly convex problems. In 2016
IEEE 55th Conference on Decision and Control (CDC), pages 7195–7201. IEEE, 2016.
[46] Lachlan Andrew, Siddharth Barman, Katrina Ligett, Minghong Lin, Adam Meyerson, Alan
Roytman, and Adam Wierman. A tale of two metrics: Simultaneous bounds on competitive-
ness and regret. In Conference on Learning Theory, pages 741–763, 2013.
[47] Joao P Hespanha. Linear systems theory. Princeton university press, 2018.
[48] JB Rawlings and DQ Mayne. Postface to model predictive control: Theory and design. Nob
Hill Pub, pages 155–158, 2012.
[49] David Angeli, Rishi Amrit, and James B Rawlings. Receding horizon cost optimization for
overly constrained nonlinear plants. In Proceedings of the 48h IEEE Conference on Decision
and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 7972–7977.
IEEE, 2009.
[50] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, 2014.
[51] Tobias Damm, Lars Grüne, Marleen Stieler, and Karl Worthmann. An exponential turnpike
theorem for dissipative discrete time optimal control problems. SIAM Journal on Control and
Optimization, 52(3):1935–1957, 2014.
[52] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. 2011.
[53] Gregor Klancar, Drago Matko, and Saso Blazic. Mobile robot control on a reference path.
In Proceedings of the 2005 IEEE International Symposium on, Mediterrean Conference on
Control and Automation Intelligent Control, 2005., pages 1343–1348. IEEE, 2005.
[54] Pololu Corporation. Pololu m3pi User’s Guide. Available at
https://www.pololu.com/docs/pdf/0J48/m3pi.pdf.
[55] Paul Concus, Gene H Golub, and Gérard Meurant. Block preconditioning for the conjugate
gradient method. SIAM Journal on Scientific and Statistical Computing, 6(1):220–252, 1985.
[56] Frank L Lewis, Draguna Vrabie, and Vassilis L Syrmos. Optimal control. John Wiley & Sons,
2012.
11
Appendices
In Appendix A, we will discuss the canonical-form transformation. In Appendix B, we will intro-
duce Triple Momentum [39] and proof of Theorem 1. In Appendix C, we will provide a proof of
Lemma 1. In Appendix D, we will present proof of Theorem 2. In Appendix E, we will provide the
regret analysis for LQT. In Appendix F, we will provide the proof of Theorem 3. In Appendix G, we
will provide technical proofs for LQT. In Appendix F, we will provide more a detailed description
of simulation.
A Canonical form
In this section, we introduce the linear transformation from a general LTI system to a canonical-form
LTI system, and then discuss how to convert a general online optimal control problem to an online
optimal control problem with a canonical-form system .
Firstly, consider a general LTI system: xt+1 =Axt+Butand two invertible matrices Sx∈
Rn, Su∈Rm. Under linear transformation on state and control: ˆxt=Sxxt,ˆut=Suut, the
equivalent LTI system under the new state ˆxtand new control ˆutis
ˆxt+1 =SxAS−1
xˆxt+SxBS−1
uˆut
By Theorem 1 in [40], for any controllable (A, B), there exists Sx, Susuch that ˆ
A=SxAS−1
xand
ˆ
B=SxBS−1
uare in the canonical form defined in Definition 1. The computation method of Sx, Su
is also provided in [40].
In an online optimal control problem, since A, B are known a priori, Sx, Sucan be computed offline.
When stage cost functions ft(xt), gt(ut)are received online, the new cost functions ˆ
f(ˆx),ˆgt(ˆut)for
the canonical-form system can be computed online by applying Sx, Su:
ˆ
ft(ˆxt) = ft(xt) = ft(S−1
xˆxt),ˆgt(ˆut) = gt(ut) = gt(S−1
uˆut)
Therefore, it is without loss of generality to only consider online optimal control with canonical-
form systems.
B Triple Momentum and proof of Theorem 1
Triple Momentum (TM) is an accelerated version of gradient descent proposed in [39]. When opti-
mizing an unconstrained optimization minzC(z), at each iteration j≥0, TM conducts
ω
ω
ω(j+ 1) = (1 + δω)ω
ω
ω(j)−δωω
ω
ω(j−1) −δc∇C(y(j))
y(j+ 1) = (1 + δy)ω
ω
ω(j+ 1) −δyω
ω
ω(k)
z(j+ 1) = (1 + δz)ω
ω
ω(j+ 1) −δzω
ω
ω(j)
where ω
ω
ω(j),y(j)are auxiliary variables to accelerate the convergence, z(j)is the decision variable,
ω
ω
ω(0) = ω
ω
ω(−1) = z(0) = y(0) are given initial values.
Suppose z= (z⊤
1,...,z⊤
N)⊤. Zooming in to each coordinate zt, the update of zt(j)by TM is
provided below
ωt(j+ 1) = (1 + δω)ωt(j)−δωωt(j−1) −δc
∂C
∂yt
(y(j))
yt(j+ 1) = (1 + δy)ωt(j+ 1) −δyωt(j)
zt(j+ 1) = (1 + δz)ωt(j+ 1) −δzωt(j)
By Section 3, ∂ C
∂yt(y(j)) only depends on stage cost functions and stage variables across a finite
neighboring stages, allowing the online implementation based on the finite-lookahead window.
TM enjoys faster convergence rate than gradient descent for µcstrongly convex and lcsmooth func-
tions under proper stepsizes. In particular, when γc=1+φ
lc, γw=φ2
2−φ, γy=φ2
(1+φ)(2−φ), γz=
φ2
1−φ2, and φ= 1 −1/√ζ,ζ=lc/µc, by [39], the convergence rate satisfies:
C(z(j)) −C(z∗)≤ζ2(√ζ−1
√ζ)2j(C(z(0)) −C(z∗)) (12)
12
In the following, we will apply the convergence rate to the proof of Theorem 1.
B.1 Proof of Theorem 1
By comparing TM with RHTM, it can be verified that zt+1 (K)computed by RHTM is the same as
zt+1(K)computed by Triple Momentum after Kiterations. Moreover, by the equivalence between
the optimization minzC(z)and the optimal control J(x,u)in Lemma 1, we have J(RH T M ) =
C(z(K)),J(ϕ) = C(z(0)) and J∗=C(z∗), which concludes the proof.
C Proof of Lemma 1
Property ii) and iii) can be directly verified by definition. Thus, it suffices to prove i): the convexity
and smoothness of C(z).
Notice that xt, utare linear with respect to zby (6) (7). For ease of reference, we define matrix
Mxt, M utto represent the relation between xt, utand z, i.e, xt=Mxtzand ut=Mutz. Similarly,
we write ˜
ft(zt−p+1,...,zt)and ˜gt(zt−p+1 ,...,zt+1)in terms of zfor simplicity of notation:
˜
ft(zt−p+1,...,zt) = ˜
ft(z) = ft(Mxtz)
˜gt(zt−p+1,...,zt+1 ) = ˜gt(z) = gt(Mutz)
A direct consequence of the linear relations is that ˜
ft(z)and ˜gt(z)are convex with respect to z
because ft(xt), gt(ut)are convex and linear transformation preserves convexity.
In the following, we will focus on the proof of strong convexity and smoothness. For simplicity, in
the following, we only consider cost function ft, gtwith minimum value as zero: ft(θt) = 0, and
gt(ξt) = 0 for all t. This is without loss of generality because by strong convexity and smoothness,
ft,gthave minimum value, and by subtracting the minimum value, we can let ft, gthave minimum
value 0.
Strong convexity. Since ˜gtis convex, we only need to prove that Pt˜
ft(z)is strongly convex
then the sum C(z)is strongly convex because the sum of convex functions and a strongly convex
function is strongly convex.
In particular, by the strong convexity of ft(xt), we have the following result for any z,z′∈RNm
and xt=Mxtz,x′
t=Mxtz′:
˜
ft(z)−˜
ft(z′)− h∇ ˜
ft(z),z′−zi − µf
2kz′
t−ztk2
=˜
ft(z′)−˜
ft(z)− h(Mxt)⊤∇ft(xt),z′−zi − µf
2kz′
t−ztk2
=˜
ft(z′)−˜
ft(z)− h∇ft(xt), Mxt(z′−z)i − µf
2kz′
t−ztk2
=˜
ft(z′)−˜
ft(z)− h∇ft(xt), x′
t−xti − µf
2kz′
t−ztk2
≥ft(x′
t)−ft(xt)− h∇ft(xt), x′
t−xti − µf
2kx′
t−xtk2≥0
where the first equality is by the chain rule, the second equality is by the definition of inner product,
the third equality is by the definition of xt, x′
t, the first inequality is by ˜
ft(z) = ft(x)and zt=
(xk1
t,...,xkm
t)⊤, and the last inequality is by ft(xt)is µfstrongly convex.
Summing over ton both sides of the inequality results in the strong convexity of Pt˜
ft(z):
N
X
t=1 h˜
ft(z′)−˜
ft(z)− h∇ ˜
ft(z),z′−zi − µf
2kz′
t−ztk2i
=
N
X
t=1
˜
ft(z′)−
N
X
t=1
˜
ft(z)− h∇
N
X
t=1
˜
ft(z),z′−zi − µf
2kz′−zk2≥0
Consequently, C(z)is µcstrongly convex with parameter at least µfby the convexity of ˜gt.
13
Smoothness. We will prove the smoothness by considering ˜
ft(z)and ˜gt(z)respectively.
Firstly, let’s consider ˜
ft(z). Similar to the proof for strong convexity, we use the smoothness of
ft(xt). For any z,z′, and xt=Mxtz,x′
t=Mxtz′, we can show that
˜
ft(z′) = ft(x′
t)≤ft(xt) + h∇ft(xt), x′
t−xti+lf
2kx′
t−xtk2
≤˜
ft(z) + h∇ ˜
ft(z),z′−zi+lf
2(kz′
t−p+1 −zt−p+1k2+···+kz′
t−ztk2)
where the inequality is by xt=Mxtzand the chain rule and (6).
Secondly, we consider ˜gt(z)in a similar way. For any z,z′, and ut=Mutz,u′
t=Mutz′, we have
˜gt(z′) = gt(u′
t)≤gt(ut) + h∇gt(ut), u′
t−uti+lg
2ku′
t−utk2(by gtis lgsmooth)
= ˜gt(z) + h(Mut)⊤∇gt(ut),z′−zi+lg
2ku′
t−utk2(by ˜gt’s def)
= ˜gt(z) + h∇˜gt(z),z−zi+lg
2ku′
t−utk2(by ˜gt’s derivative)
Since ut=zt+1 −A(I,:)xt= (I, −A(I,:))(z⊤
t+1, x⊤
t)⊤, we have that
lg
2ku′
t−utk2≤lg
2k(I, −A(I,:)) ((z′
t+1)⊤,(x′
t)⊤)⊤−(z⊤
t+1, x⊤
t)⊤k2
≤lg
2k(I, −A(I,:))k2(kzt+1 −z′
t+1k2+kxt−x′
tk2)
≤lg
2k(I, −A(I,:))k2(kzt+1 −z′
t+1k2+···+kzt−p+1 −z′
t−p+1k2)
Finally, by summing over t, we have
C(z)≤C(z) + h∇C(z),z′−zi+ (plf+ (p+ 1)lgκ)/2kz−zk2
where κ=k(I, −A(I,:))k2
2. Thus we have proved the smoothness of C(z).
D Proof of Theorem 2
To prove the bound, we consider the sum of the optimal steady state cost, PN−1
t=0 λe
t, as a middle
ground and bound J(ϕ)−PN−1
t=0 λe
tand PN−1
t=0 λe
t−J∗in Lemma 2 and Lemma 3 respectively.
Then, the regret bound can be obtained by combining the two bounds.
Lemma 2 (Bound of J(ϕ)−PN−1
t=0 λt).Let the initialization ϕbe the following-the-optimal-steady-
state method. Let xt(0) denote the state determined by the initialization. For any initial state x0,
J(ϕ)−
N−1
X
t=0
λe
t≤c1
N−1
X
t=0 kxe
t−xe
t−1k+fN(xN(0)) = O(
N
X
t=0 kxe
t−xe
t−1k)
where xe
N:=θN,xe
−1=x0= 0 for simplicity of the notation, c1is a constant that does not depend
on N.
Lemma 3 (Bound of PN−1
t=0 λt−J∗).Let he
t(x)denote the solution to the average-cost Bellman
equation under cost ft(x) + gt(u). Let x∗
tdenote the optimal state trajectory.
N−1
X
t=0
λt−J∗≤
N
X
t=1
(he
t−1(x∗
t)−he
t(x∗
t)) −he
0(x0) =
N
X
t=0
(he
t−1(x∗
t)−he
t(x∗
t))
where he
N(x):=fN(x),he
−1(x):= 0 and x∗
0=x0for simplicity of the notation.
Then, we can complete the proof by Lemma 2 and 3:
J(ϕ)−J∗=J(ϕ)−
N−1
X
t=0
λe
t+
N−1
X
t=0
λe
t−J∗=O(
N
X
t=0
(kxe
t−1−xe
tk+he
t−1(x∗
t)−he
t(x∗
t)))
In the following, we will prove Lemma 2 and 3 respectively. For simplicity, we only consider cost
function ft, gtwith minimum value as zero: ft(θt) = 0, and gt(ξt) = 0 for all t. This is without
loss of generality because by strong convexity and smoothness, ft,gthave minimum value, and by
subtracting the minimum value, we can let ft, gthave minimum value 0.
14
D.1 Proof of Lemma 2.
The proof relies on the convexity of cost functions and the uniform upper bounds of xt(0), ut(0)
resulted from the uniform upper bounds of θt, ξtin Assumption 3.
Notice that J(ϕ) = PN−1
t=0 (ft(xt(0)) + gt(ut(0))) + fN(xN(0)) and PN−1
t=0 λe
t=PN−1
t=0 (ft(xe
t)+
gt(ue
t)). It suffices to bound ft(xt(0)) −ft(xe
t)and gt(ut(0)) −gt(ue
t)for 0≤t≤N−1. We will
first focus on ft(xt(0)) −ft(xe
t), then bound gt(ut(0)) −gt(ue
t)in the same way.
For 0≤t≤N−1, by convexity of ft, and the property of L2norm,
ft(xt(0)) −ft(xe
t)≤ h∇ft(xt(0)), xt(0) −xe
ti ≤ k∇ft(xt(0))kkxt(0) −xe
tk(13)
In the following, we will bound k∇ft(xt(0))kand kxt(0) −xe
tk.
Firstly, we provide a bound for k∇ft(xt(0))k.
k∇ft(xt(0))k=k∇ft(xt(0)) − ∇ft(θt)k ≤ lfkxt(0) −θtk ≤ lf(√n¯xe+¯
θ)(14)
where the first equality is because θtis the global minimizer of ft, and first inequality is by Lipschitz
smoothness, the second inequality is by kθtk ≤ ¯
θaccording to Assumption 3 and the following
lemma that provides a uniform bound on xt(0). The proof is technical and is deferred to the end of
this section.
Lemma 4 (Uniform upper bounds of xe
t, ue
t, xt(0), ut(0)).There exists ¯xeand ¯uethat are inde-
pendent of N , W , such that kxe
tk2≤¯xeand kue
tk2≤¯uefor all 0≤t≤N−1. Moreover,
kxt(0)k2≤√n¯xefor 0≤t≤Nand kut(0)k2≤√n¯uefor 0≤t≤N−1, where xt(0), ut(0)
denote the state and control at tdetermined by the initialization and consider x0= 0 for simplicity.
Secondly, we provide a bound for kxt(0) −xe
tk. The proof relies on a characterization of the steady
state and the initialized state based on the canonical form.
Lemma 5 (Steady state and initialized state of canonical-form systems).Consider a canonical form
system: xt+1 =Axt+But.
(a) Any steady state (x, u)is in the form of
x= (z1,...,z1
|{z }
p1
, z2,...,z2
|{z }
p2
,...,zm,...,zm
|{z }
pm
)⊤
u= (z1,...,zm)⊤−A(I,:)x
Let z= (z1, . . . , zm)⊤. For the optimal steady state with respect to cost ft+gt, we
denote the corresponding zas ze
t, and the optimal steady state can be represented as xe
t=
(ze,1
t,...,ze,1
t, ze,2
t,...,ze,2
t,...,ze,m
t,...,ze,m
t)⊤and ue
t=ze
t−A(I,:)xe
tfor 0≤t≤
N−1.
(b) By follow-the-optimal-steady-state initialization, xt(0) and ut(0) satisfies
xt(0) = (ze,1
t−p1,...,ze,1
t−1
|{z }
p1
, ze,2
t−p2,...,ze,2
t−1
|{z }
p2
,...,ze,m
t−pm,...,ze,m
t−1
|{z }
pm
),0≤t≤N
ut(0) = ze
t−A(I,:)xt(0) 0 ≤t≤N−1
where ze
t= 0 for t≤ −1.
Proof. (a) This is by the definition of the canonical form and the definition of the steady state.
(b) By the initialization, zt(0) = xe,I
t−1=ze
t−1. By the relation between zt(0) and xt(0),ut(0),
we have xI
t(0) = zt(0) = ze
t−1, and xI−1
t(0) = zt−1(0) = ze
t−2, so on and so forth. This
proves the structure of xt(0). The structure of ut(0) is because ut(0) = zt+1 (0) −A(I,:
)xt(0) = ze
t−A(I,:)xt(0)
15
By Lemma 5, we can bound kxt(0) −xe
tkfor 0≤t≤N−1by
kxt(0) −xe
tk ≤ qkze
t−1−ze
tk2+···+kze
t−p−ze
tk2≤qkxe
t−1−xe
tk2+···+kxe
t−p−xe
tk2
≤ kxe
t−1−xe
tk+···+kxe
t−p−xe
tk ≤ p(kxe
t−1−xe
tk+···+kxe
t−p−xe
t−p+1k)
(15)
Combining (13) (14) and (15) yields
N−1
X
t=0
ft(xt(0)) −ft(xe
t)≤
N−1
X
t=0 k∇ft(xt(0))kkxt(0) −xe
tk
≤
N−1
X
t=0
lf(√n¯xe+¯
θ)p(kxe
t−1−xe
tk+···+kxe
t−p−xe
t−p+1k)
≤p2lf(√n¯xe+¯
θ)
N−1
X
t=0 kxe
t−1−xe
tk(16)
Notice that the constant term p2lf(√n¯xe+¯
θ)does not depend on N , W .
Similarly, we can provide a bound for gt(ut(0)) −gt(ue
t).
N−1
X
t=0
gt(ut(0)) −gt(ue
t)≤
N−1
X
t=0 k∇gt(ut(0))kkut(0) −ue
tk
≤
N−1
X
t=0
lgkut(0) −ξtkkut(0) −ue
tk
≤
N−1
X
t=0
lg(√n¯ue+¯
ξ)kA(I,:)xt(0) −A(I,:)xe
tk
≤
N−1
X
t=0
lg(√n¯ue+¯
ξ)kA(I,:)kkxt(0) −xe
tk
≤p2lg(√n¯ue+¯
ξ)kA(I,:)k
N−1
X
t=0 kxe
t−1−xe
tk(17)
where the first inequality is by the convexity, the second inequality is because ξtis the global min-
imizer of gtand gtis lg-smooth, the third inequality is by Assumption 3, Lemma 4 and Lemma 5,
the fourth inequality is by matrix norm’s property, the fifth inequality is by (15). Notice that the
constant term p2lg(√n¯ue+¯
ξ)kA(I,:)kdoes not depend on N , W .
By (16) and (17), we complete the first inequality in the statement of Lemma 2.
J(ϕ)−
N−1
X
t=0
λe
t≤c1
N−1
X
t=0 kxe
t−1−xe
tk+fN(xN(0))
where c1does not depend on N.
By defining xe
N=θN, we can bound fN(xN(0)) by kxN(0) −xe
Nkup to some constants because
fN(xN(0)) = fN(xN(0)) −fN(θN)≤lterm
2(√n¯xe+¯
θ)kxN(0) −xe
Nk. By the same argument as
in (15), we have kxN(0) −xe
Nk=O(PN
t=0 kxe
t−1−xe
tk). Consequently, we have shown that
J(ϕ)−
N−1
X
t=0
λe
t=O(
N
X
t=0 kxe
t−1−xe
tk)
16
D.2 Proof of Lemma 3.
The proof heavily relies on dynamic programming and the Bellman equation. For simplicity, we
introduce a Bellman operator B(f+g, h):B(f+g, h)(x) = minu(f(x) + g(u) + h(Ax +Bu)).
Now the Bellman equation can be written as B(f+g, he)(x) = he(x) + λe.
We define a sequence of auxiliary functions Sk: when 0≤k≤N−1, let Sk(x) = he
k(x) +
PN−1
t=kλe
t, when k=N, let SN(x) = fN(x). For simplicity of notation, let he
N(x) = fN(x).
By Bellman equation, we have he
k(x) + λe
k=B(fk+gk, he
k)(x)for 0≤k≤N−1. Let πe
k
be corresponding optimal control policy that solves the Bellman equation. We have the following
recursive relation for Skby the Bellman equation for 0≤k≤N−1:
Sk(x) = B(fk+gk, Sk+1 −he
k+1 +he
k)(x)
=fk(x) + gk(πe
k(x)) + Sk+1(Ax +Bπe
k(x)) −he
k+1(Ax +Bπe
k(x)) + he
k(Ax +Bπe
k(x))
Besides, let Vk(x)denote the optimal cost-to-go function from tto N, where VN(x) = fN(x). Let
π∗
kdenote the optimal control policy, by dynamic programming, for 0≤k≤N−1
Vk(x) = B(fk+gk, Vk+1)(x)
=fk(x) + gk(π∗
k(x)) + Vk+1(Ax +Bπ∗
k(x))
Let x∗
kdenote the optimal trajectory, then x∗
k+1 =Ax∗
k+Bπ∗
k(x∗
k). For any k= 0,...,N −1,
Sk(x∗
k)−Vk(x∗
k) = B(fk+gk, Sk+1 −he
k+1 +he
k)(x∗
k)− B(fk+gk, Vk+1 )(x∗
k)
≤fk(x∗
k) + gk(π∗
k(x∗
k)) + Sk+1(x∗
k+1)−he
k+1(x∗
k+1) + he
k(x∗
k+1)
−(fk(x∗
k) + gk(π∗
k(x∗
k)) + Vk+1(x∗
k+1)
=Sk+1(x∗
k+1)−he
k+1(x∗
k+1) + he
k(x∗
k+1)−Vk+1 (x∗
k+1)
where the first inequality is because π∗
kis not optimal for the Bellman operator B(fk+gk, Sk+1 −
he
k+1 +he
k)(x∗
k).
Summing over k= 0, . . . , N −1on both sides yields
S0(x0)−V0(x0)≤
N−1
X
k=0
(he
k(x∗
k+1)−he
k+1(x∗
k+1))
By subtracting he
0(x0)on both sides,
N−1
X
t=0
λt−J∗≤
N−1
X
k=0
(he
k(x∗
k+1)−he
k+1(x∗
k+1)) −he
0(x0)
For the simplicity of notation, we define he
−1(x0) = 0 and x∗
0=x0, then the bound can be written
as
N−1
X
t=0
λt−J∗≤
N
X
k=0
(he
k−1(x∗
k)−he
k(x∗
k))
D.3 Proof of Lemma 4
The proof relies on the (strong) convexity and smoothness of cost functions and the uniform upper
bounds on θt, ξt.
First of all, let’s suppose we have kxe
tk2≤¯xefor all 0≤t≤N−1. We will bound ue
t, xt(0), ut(0)
by using ¯xe. Notice that the optimal steady state and the corresponding steady control satisfy: ue
t=
xe,I
t−A(I,:)xe
t. If we can bound xe
tby kxe
tk ≤ ¯xefor all t,ue
tcan be bounded accordingly:
kue
tk ≤ kxe,I
tk2+kA(I,:)xe
tk ≤ kxe
tk2+kA(I,:)k2kxe
tk2≤(1 + kA(I,:)k)¯xe=:¯ue
17
Moreover, xt(0) can also be bounded by ¯xemultiplied by some factors because by Lemma 5, xt(0)’s
each entry is determined by some entry of xe
sfor s≤t. As a result, for 0≤t≤N
kxt(0)k2≤√nkxt(0)k∞≤√nmax
s≤tkxe
sk∞≤√nmax
s≤tkxe
sk2≤√n¯xe
We can bound ut(0) by xt(0)’s bound in a similar way to ue
t’s bound by noticing that ut(0) =
xt+1(0)I−A(I,:)xt(0) and
kut(0)k ≤ kxI
t+1(0)k2+kA(I,:)xt(0)k ≤ kxt+1 (0)k2+kA(I,:)k2kxt(0)k2
≤(1 + kA(I,:)k)√n¯xe=√n¯ue
Next, it suffices to prove kxe
tk2≤¯xefor all tfor some ¯xe. To prove this bound, we construct another
(suboptimal) steady state: ˆxt= (θ1
t,...,θ1
t). Let ˆut= ˆxI
t−A(I,:)ˆxt. It can be easily verified
that (ˆxt,ˆut)is indeed a steady state. Moreover, ˆxtand ˆutcan be bounded by similar arguments as
above:
kˆxtk2≤√n|θ1
t| ≤ √nkθtk∞≤√nkθtk2≤√n¯
θ(by ˆxt,¯
θ’s def.)
kˆutk2≤(1 + kA(I,:)k)kˆxtk2≤(1 + kA(I,:)k)√n¯
θ(by the same argument for bounding ue
t)
By strong convexity of ftand smoothness of ft, gtand by θt,ξtbeing the global minimizer of ft, gt
respectively, for 0≤t≤N−1, we have
µ
2kxe
t−θtk2≤ft(xe
t)−ft(θt) + gt(ue
t)−gt(ξt)(by strong convexity)
≤ft(ˆxt)−ft(θt) + gt(ˆut)−gt(ξt)(by (xe
t, ue
t)is optimal steady state)
≤lf
2kˆxt−θtk2+lg
2kˆut−ξtk2(by smoothness and ∇ft(θt) = ∇gt(ξt) = 0)
≤lf(kˆxtk2+kθtk2) + lg(kˆutk2+kξtk2)(by Cauchy-Schwarz inequality)
≤lf(n¯
θ2+¯
θ2) + lg(((1 + kA(I,:)k)√n¯
θ)2+k¯
ξk2)
(by kˆxtk2,kˆutk’s bounds above)
:=c7
As a result, we have kxe
t−θtk ≤ p2c7/µ. Then, we can bound xe
tby kxe
tk ≤ kθtk+p2c7/µ ≤
¯
θ+p2c7/µ =:¯xefor all t. It can be verified that ¯xedoes not depend on N , W .
E Linear quadratic tracking
In this section, we will provide a regret bound for general LQT, based on which we prove Corollary
1 which considers a special case when Q, R are not changing.
E.1 Regret bound for general online LQT
Firstly, it can be shown that the solution to the Bellman equation associated with a linear quadratic
tracking cost has an explicit form.
Lemma 6. One solution to the Bellman equation with stage cost 1
2(x−θ)⊤Q(x−θ) + 1
2u⊤Ru can
be represented by
he(x) = 1
2(x−βe)⊤Pe(x−βe)(18)
where Pedenotes the solution to the discrete-time algebraic Riccati equation (DARE) with respect
to Q, R, A, B
Pe=Q+A⊤(Pe−PeB(B⊤PeB+R)−1B⊤Pe)A(19)
and βe=F θ where Fis a matrix determined by A, B , Q, R.
For simplicity of notations, we will let Pe(Q, R)denote the solution to the DARE with Q, R, A, B
and F(Q, R)denote the matrix in βe=F θ related with Q, R, A, B. Here we omit A, B in the
arguments of the functions because they will not change in this paper.
By applying Theorem 2, the regret bound of the general LQT problem is provided below.
18
Corollary 2 (Bound of general LQT).Consider the LQT problem in Example 1. Suppose the ter-
minal cost function satisfies P≤QN≤¯
Pwhere ¯
P P e(lfIn, lgIm)and P=Pe(µfIn, µgIm).3
Then, the regret of RHTM with initialization FOSS can be bounded by
Regret(RH T M) = O(( √ζ−1
√ζ)2K(
N
X
t=1 kPe
t−Pe
t−1k+kβe
t−βe
t−1k+
N
X
t=0 kxe
t−1−xe
tk))
where K=⌊(W−1)/p⌋,xe
−1=x0,xe
N=θN,ζis the condition number of the corresponding
C(z),(xe
t, ue
t)is the optimal steady state under cost Qt, Rt, θt.Pe
t=Pe(Qt, Rt)and βe
t=
F(Qt, Rt)θt.
Proof. Before the proof, we introduce some notations and some useful lemmas. Firstly, we define
the sets of Q, R, P considered in this section.
Q={Q|µfIn≤Q≤lfIn}
R={R|µgIm≤R≤lgIm}
P={P|P≤P≤¯
P}
Moreover, we will define Q=µfIn,¯
Q=lfIn, R =µgIm,¯
R=lgIm.
Secondly, we introduce some supportive lemmas on the bounds of Pe
t, βe
t, x∗
trespectively. The
intuition on why they can be bounded is that Qt, Rt, θtall uniformly bounded by Assumption 2 and
3. The proof is technical and deferred to Appendix G.
Lemma 7 (Upper bound of x∗
t).For any N, any 0≤t≤N, any Qt∈ Q, Rt∈ R, QN∈ P, there
exists ¯xthat does not depend on N , W , such that
kx∗
tk2≤¯x
Lemma 8 (Upper bound of βe).For any Q∈ Q, R ∈R, any kθk ≤ ¯
θthere exists ¯
β≥¯
θthat does
not depend on Nand only depends on A, B, lf, µf, lg, µg,¯
θ, such that kβek ≤ ¯
β.
Lemma 9 (Upper bound of Pe).For any Q∈ Q, R ∈R, we have Pe=Pe(Q, R)∈ P. Conse-
quently, kPek2≤υmax(¯
P)
Next, we are ready for the proof.
By Theorem 2, we only need to bound PN
t=0(he
t−1(x∗
t)−he
t(x∗
t)). Let Pe
N=QN, βe
N=θN, then
we can write he
t(x) = 1
2(x−βe
t)⊤Pe
t(x−βe
t)for 0≤t≤N.
For 0≤t≤N−1, we split he
t(x∗
t+1)−he
t+1(x∗
t+1)into two parts.
he
t(x∗
t+1)−he
t+1(x∗
t+1) = 1
2(x∗
t+1 −βe
t)⊤Pe
t(x∗
t+1 −βe
t)−1
2(x∗
t+1 −βe
t+1)⊤Pe
t+1(x∗
t+1 −βe
t+1)
=1
2(x∗
t+1 −βe
t)⊤Pe
t(x∗
t+1 −βe
t)−1
2(x∗
t+1 −βe
t+1)⊤Pe
t(x∗
t+1 −βe
t+1)(Part 1)
+1
2(x∗
t+1 −βe
t+1)⊤Pe
t(x∗
t+1 −βe
t+1)−1
2(x∗
t+1 −βe
t+1)⊤Pe
t+1(x∗
t+1 −βe
t+1)(Part 2)
Part 1 can be bounded by the following when 0≤t≤N−1,
Part 1 =1
2(x∗
t+1 −βe
t+x∗
t+1 −βe
t+1)⊤Pe
t(x∗
t+1 −βe
t−(x∗
t+1 −βe
t+1))
≤1
2kx∗
t+1 −βe
t+x∗
t+1 −βe
t+1k2kPe
tk2kβe
t+1 −βe
tk2(by L2-norm def.)
≤(¯x+¯
β)υmax(¯
P)kβe
t+1 −βe
tk2(by Lemma 8 9, 7.)
Part 2 can be bounded by the following when 0≤t≤N−1,
Part 2 =1
2(x∗
t+1 −βe
t+1)⊤(Pe
t−Pe
t+1)(x∗
t+1 −βe
t+1)
3This additional condition is for technical simplicity and can be removed.
19
≤1
2kx∗
t+1 −βe
t+1k2
2kPe
t−Pe
t+1k2≤1
2(¯x+¯
β)2kPe
t−Pe
t+1k2
Therefore, we have
N
X
t=0
(he
t−1(x∗
t)−he
t(x∗
t)) ≤
N−1
X
t=0
(he
t(x∗
t+1)−he
t+1(x∗
t+1))
=O(
N−1
X
t=0
(kβe
t+1 −βe
tk2+kPe
t−Pe
t+1k2)) (20)
where the first inequality is by he
0(x)≥0and he
−1(x) = 0. Consequently, by applying theorem 2,
we proved the regret bound of RHTM in LQ tracking problems.
J(RH T M)−J∗=O(( √ζ−1
√ζ)2K(
N
X
t=1 kPe
t−Pe
t−1k+kβe
t−βe
t−1k+
N
X
t=0 kxe
t−1−xe
tk))
E.2 Proof of Corollary 1
Proof sketch: Consider the bound in Corollary 2. When Q, R are not changing, kPe
t−Pe
t−1k= 0.
Moreover, by (29), βe
t=F θtfor some matrix Ffor all t, so kβe
t−βe
t−1kcan be bounded by
kθt−θt−1k. Finally, we can also show that xe
t=F1F2θtfor some matrices F1, F2with the help of
Lemma 5, leading to kxe
t−xe
t−1k=O(kθt−θt−1k). Combining the discussions above, the regret
bound can be proved.
Formal proof: Directly applying the results in Theorem 2 and Corollary 2 will result in some extra
constant terms because some inequalities used to derive the bounds in Theorem 2 and Corollary 2
are not necessary when Q, R are not changing. Therefore, we will apply some intermediate results
in the proofs of Theorem 2 and Corollary 2 to prove Corollary 1, but the main idea is the same as
the proof sketch.
Firstly, by the first inequality bounds of Lemma 2 and Lemma 3, we have
J(ϕ)−J∗=J(ϕ)−
N−1
X
t=0
λe
t+
N−1
X
t=0
λe
t−J∗
≤c1
N−1
X
t=0 kxe
t−1−xe
tk
|{z }
Part I
+
N−1
X
t=0
(he
t(x∗
t+1)−he
t+1(x∗
t+1))
|{z }
Part II
+fN(xN(0)) −he
0(x0)
| {z }
Part III
We are going to bound each part by Ptkθt−θt−1kin the following.
Part I: We will bound Part I by Ptkθt−θt−1kby showing that xe
t=F1F2θtfor some matrices
F1, F2. The representation of xe
trelies on Lemma 5.
By Lemma 5, we know that the steady state (x, u)can be represented as a matrix multiplied with z:
x= (z1,...,z1
|{z }
p1
, z2,...,z2
|{z }
p2
,...,zm,...,zm
|{z }
pm
)⊤=:F1z(21)
u= (z1,...,zm)⊤−A(I,:)x= (Im−A(I,:)F1)z
where F1∈Rn,m is a binary matrix with full column rank.
Consider cost function 1
2(x−θ)⊤Q(x−θ) + 1
2u⊤Ru. By the steady-state representation above, the
optimal steady state can be solved by the following unconstrained optimization:
min
z(F1z−θ)⊤Q(F1z−θ) + z⊤(I−A(I,:)F1)⊤R(I−A(I,:)F1)z
Since F1is full column rank, the function is strongly convex and has the unique solution
ze=F2θ(22)
20
where F2= (F⊤
1QF1+ (I−A(I,:)F1)⊤R(I−A(I,:)F1))−1F⊤
1Q. Accordingly, the optimal
steady state can be represented as xe=F1F2θ,ue= (Im−A(I,:)F1)F2z. Consequently,
kxe
t−xe
t−1k ≤ kF1F2kkθt−θt−1k
Now, we consider t= 0. Since xe
−1=x0= 0, by letting θ−1= 0,kxe
0−xe
−1k ≤ kF1F2kkθ0−θ−1k.
Combining the upper bounds above, we have
Part I =O(
N−1
X
t=0 kxe
t−xe
t−1k) = O(
N−1
X
t=0 kθt−θt−1k)(23)
Part II: By (20) in the proof of Corollary 2, we have
N−1
X
t=0
(he
t(x∗
t+1)−he
t+1(x∗
t+1)) = O(
N−1
X
t=0 kβe
t+1 −βe
tk2)(by Penot changing)
By Lemma 14 (29), βe
t= (Pe)−1(I−(A−BKe)⊤)−1Qθt=:F2θt, so for 1≤t≤N,
kβe
t−βe
t−1k=kF2θt−F2θt−1k ≤ kF2kkθt−θt−1k
Thus,
Part II =
N−1
X
t=0
(kβe
t+1 −βe
tk2)≤ kF2k
N−1
X
t=0 kθt+1 −θtk(24)
Part III: By our condition for the terminal cost function, we have fN(xN(0)) = 1
2(xN(0) −
βe
N)⊤Pe(xN(0) −βe
N). By Lemma 14, we know he
0(x0) = 1
2(x0−βe
0)⊤Pe(x0−βe
0). So Part III
can be bounded by
Part III =1
2(xN(0) −βe
N)⊤Pe(xN(0) −βe
N)−1
2(x0−βe
0)⊤Pe(x0−βe
0)
=1
2(xN(0) −βe
N+x0−βe
0)⊤Pe(xN(0) −βe
N−(x0−βe
0))
≤1
2kxN(0) −βe
N+x0−βe
0k2kPek2kxN(0) −βe
N−(x0−βe
0)k2
≤1
2(√n¯xe+¯
β+¯
β)kPek(kxN(0) −x0k+kβe
N−βe
0k)
where the last inequality is by Lemma 4, Lemma 8 and Assumption 3 and the triangle inequality.
Next we will bound kxN(0) −x0kand kβe
N−βe
0krespectively. Firstly, kβe
N−βe
0kcan be bounded
by triangle inequality and (24)
kβe
N−βe
0k ≤
N−1
X
t=0 kβe
t+1 −βe
tk2≤ kF2k
N−1
X
t=0 kθt+1 −θtk2
Secondly, we will bound kxN(0) −x0k. Notice that by triangle inequality, we have kxN(0)−x0k ≤
kxN(0) −xe
N−1k+kxe
N−1−x0kand kxe
N−1−x0kcan be bounded by triangle inequality and (23):
kxe
N−1−x0k ≤
N−1
X
t=0 kxe
t−xe
t−1k ≤ kF1F2k
N−1
X
t=0 kθt−θt−1k
Next, we will focus on kxN(0) −xe
N−1k. By Lemma 5, xN(0) satisfies
xN(0) = (ze,1
N−p1,...,ze,1
N−1, ze,2
N−p2,...,ze,2
N−1,...,ze,m
N−pm,...,ze,m
N−1)⊤
As a result,
kxN(0) −xe
N−1k2≤ kze
N−2−ze
N−1k2+···+kze
N−p−ze
N−1k2
=kF2k2(kθN−2−θN−1k2+···+kθN−p−θN−1k2)
21
where the equality is by (22). Taking square root on both sides yields
kxN(0) −xe
N−1k ≤ kF1kqkθN−2−θN−1k2+···+kθN−p−θN−1k2
≤ kF2k(kθN−2−θN−1k+···+kθN−p−θN−1k)
≤ kF2k(p−1)
N−2
X
t=N−pkθt+1 −θtk
Combining the bounds above, we have
Part III =O(
N−1
X
t=0 kθt+1 −θtk)(25)
The proof is completed by summing up bounds of Part I, II, III.
F Proof of Theorem 3
Proof sketch: We will focus on explaining the term (√ζ−1
√ζ+1 )2K. Firstly, the fundamental limit of
the online control problem is equivalent to the fundamental limit of the online convex optimization
problem with objective C(z). Therefore, we will focus on C(z). Secondly, since the lower bound is
on the worst case scenario, we only need to construct some {θt}for Theorem 3 to hold. However, it
is generally difficult to construct the tracking trajectory, so we consider randomly generated θtand
show that the regret in expectation can be lower bounded. Then, there must exist some realization
of the randomly generated {θt}such that the regret lower bound holds.
Thanks to the quadratic structure, we have closed-form solution to z∗, which is linear in θt, that is,
z∗
t+1 =PN
s=1 vt+1,sθs. Since any online algorithm only has access to finite predictions, the online
output zt+1(A)only depends on θ1,...,θt+W−1. As a result, the difference between the optimal
solution and the online solution can be roughly captured by kPN
s=t+Wvt+1,sθsk. With proper
construction of A, B , Q, R, we can roughly show that v2
t+1,i decays at most at a rate of (√ζ−1
√ζ+1 )2K.
This explains the exponential decaying term (√ζ−1
√ζ+1 )2Kin the lower bound of Theorem 3.
Formal proof:
Step 1: construct LQ tracking. For simplicity, we construct a single-input system with n=pand
A∈Rn,n and B∈Rn×1as follows: 4
A=
0 1 ··· 0
.
.
.......
0 1
1 0 ··· 0
, B =
0
.
.
.
0
1
(A, B)is controllable because (B,AB,...,Ap−1B)is full rank. A’s controllability index is p=n.
Next, we construct Q, R. For any ζand p, define δ=4
(ζ−1)p. Let Q=δInand R= 1 for
0≤t≤N−1. Let Pe=Pe(Q, R)be the solution to the DARE. We can show that Peis diagonal
with some additional properties.
Lemma 10 (Form of Pe).Let Pedenote the solution to the DARE determined by A, B , Q, R defined
above. Then Pesatisfies the form
Pe=
q10··· 0
0q2··· 0
...
0··· qn
where qi=q1+ (i−1)δfor 1≤i≤nand δ < q1< δ + 1.
4It is easy to generalize the construction to multi-input case by constructing mdecoupled subsystems.
22
Proof of Lemma 10. By Proposition 4.4.1 in [52], there exists a unique positive definite solution. So
we suppose the solution is diagonal and substitute it in the DARE. If we can find a positive definite
solution, then the solution must be Pe.
Pe=Q+A⊤(Pe−PeB(B⊤PeB+R)−1B⊤Pe)A
q10··· 0
0q2··· 0
...
0··· qn
=
qn/(1 + qn) + δ0··· 0
0q1+δ··· 0
...
0··· qn−1+δ
So we have, qi=qi−1+δfor 1≤i≤n−1, and qn/(1 + qn) + δ=q1=qn−(n−1)δ. The
solution is qn=nδ+√n2δ2+4nδ
2> nδ, so q1=qn−(n−1)δ > δ > 0. So the solution is positive
definite. Moreover, by qn/(1 + qn)<1, we have q1< δ + 1.
Next, we will construct θt. Let θ0=θN=βe
N= 0 for simplicity. Let E=LN/(2¯
θ). For
simplicity, we only consider an integer E.5Since 2¯
θ≤LN≤(2N+ 1)¯
θand Eis an integer, we
have 1≤E≤N.
We provide two constructions for two different values of E. When E= 1, let J={W}. Let
θ1=···=θW−1= 0. Let θWfollow the distribution below:
θi=σwith prob 1/2
−σwith prob 1/2,i.i.d. for all i∈[n](26)
where σ=¯
θ
√n. It can be easily verified that kθk=¯
θfor any realization of this distribution. Let the
rest θtbe equal to θW, i.e. θW=θW+1 =···=θN−1. It can be shown that the total variation of
the constructed θtis no more than the variation budget LN:
N
X
t=0 kθt−θt−1k=kθW−θW−1k+kθN−1−θNk= 2¯
θ=LN
where the last equality is because E= 1.
When E≥2, we divide the stages {1,...,N −1}into E−1epochs, each epoch with size
∆ = ⌊N−1
E−1⌋.6Let Jbe the first stage of each epoch: J={1,∆ + 1,...,(E−2)∆ + 1}. Let θt
for t∈ J i.i.d. follows the distribution (26). Let the rest of θtbe equal to the value at the start of
their corresponding epochs, i.e., θt=θk∆+1 for k=⌊t/∆⌋. Now, we verify that the constructed θt
satisfies the variation budget:
N
X
t=0 kθt−θt−1k=kθ1−θ0k+
E−2
X
k=1 kθk∆+1 −θk∆k+kθN−1−θNk
≤¯
θ+ 2(E−2)¯
θ+¯
θ≤LN
by θ0=θ−1=θN= 0.
The tracking loss of our LQ tracking problem is
J(x, u) =
N−1
X
t=0
(δ
2kxt−θtk2+1
2u2
t) + 1
2x⊤
NPexN
We will verify that C(z)’s condition number is ζin Step 2.
Step 2: convert LQ tracking to min C(z)and find z∗The corresponding unconstrained optimiza-
tion’s objective function C(z)of our LQ tracking constructed above has an explicit form as below:
C(z) =
N−1
X
t=0
(δ
2
n
X
i=1
(zt−n+i−θi
t)2+1
2(zt+1 −zt−n+1)2) + 1
2
n
X
i=1
qiz2
N−n+i
5The proof can be generalized to the case when LN/(2¯
θ)is not an integer by using floor and ceiling
operators.
6The last epoch may contain less than ∆stages.
23
and zt= 0 and θt= 0 for t≤0.
Since C(z)is strongly convex, min C(z)admits a unique optimal solution, denoted as z∗, which
is determined by the first-order optimality condition: ∇C(z∗) = 0. In addition, our constructed
C(z)is a quadratic function, so there exists a matrix H∈RN×Nand a vector η∈RNsuch that
∇C(z∗) = Hz∗−η= 0. By partial gradients of C(z)below,
∂C
∂zt
=δ(zt−θn
t+zt−θn−1
t+1 +···+zt−θ1
t+n−1+ (zt−zt+n) + (zt−zt−n),1≤t≤N−n
∂C
∂zt
=δ(zt−θn
t+···+zt−θn+t−N+1
N−1) + qn+t−Nzt+zt−zt−n, N −n+ 1 ≤t≤N
For simplicity and without loss of generality, we assume N/p is an integer. Then, Hcan be repre-
sented as the block matrix below
H=
(δn + 2)In−In···
−In(δn + 2)In
...
......−In
−In(qn+ 1)In
and ηis a linear combination of θ:ηt=δ(θn
t+···+θ1
t+n−1) = δ(e⊤
nθt+···+e⊤
1θt+n−1)where
e1,...,en∈Rnare standard basis vectors and θt= 0 for t≥N.
By Gergoskin’s Disc Theorem, H’s condition number is (δn + 4)/δn =ζby our choice of δin Step
1 and p=n.
Since His strictly diagonally dominant with positive diagonal entries and nonpositive off-diagonal
entries, His invertible and its inverse, denoted by Y, is nonnegative. Consequently, the optimal
solution can be represented as z∗=Y η. We will use Yij to denote the Y’s entry in the ith row and
jth column.
It will be helpful to write z∗
t+1 in terms of θtdirectly since later we will analyze the dependence of
the optimal solution on the target trajectory, so we derive
z∗
t+1 =
N
X
i=1
Yt+1,iηi=δ
N
X
i=1
Yt+1,i
n−1
X
j=0
e⊤
n−jθi+j(by ηi’s def)
=δ
N−1
X
k=1
vt+1,kθk(27)
by θt= 0 for t≥N, where vt+1,k =Yt+1,k e⊤
n+···+Yt+1,k+1−ne⊤
1∈R1×nand Yt+1,i = 0 for
i≤0.
In addition, we are able to show in the next lemma that Yhas decaying row entries starting at the
diagonal entries. The proof is technical and deferred to the appendix.
Lemma 11. When N/p is an integer, the inverse of H, denoted by Y, can be represented as a block
matrix
Y=
y1,1Iny1,2In··· y1,N/pIn
y2,1Iny2,2In··· y2,N/pIn
.
.
........
.
.
yN/p,1InyN/p,2In··· yN/p,N/p In
where yt,t+τ≥1−ρ
δn+2 ρτ>0for τ≥0and ρ=√ζ−1
√ζ+1 .
Step 3: characterize zt+1(Az).For any online control algorithm A, we can define an equivalent on-
line algorithm for z, denoted as Az, which outputs zt+1(Az)at each time step tbased on prediction
and history, i.e.,
zt+1(Az) = Az({θs}t+W−1
s=0 ), t ≥0
For simplicity, we consider online deterministic algorithm.7Notice that zt+1 is a random variable
because θ1,...,θt+W−1are random. Based on this observation and Lemma 11, we are able to
provide a regret lower bound.
7The proof can be easily generalized to random algorithms
24
Step 4: prove the regret lower bound for A.Roughly speaking, the regret occurs when something
unexpected happens beyond the prediction window, that is, at each t, the prediction window goes
as far as t+W−1, but if θt+Wchanges from θt+W−1, the online algorithm cannot prepare for it,
resulting in poor control and positive regret.
By our construction of θt, the changes happen at t∈ J . To study the stage twith unexpected
changes at t+W, we define a set containing all such t:J1={0≤t≤N−W−1|t+W∈ J}.
By our construction, it can be shown that the cardinality of J1can be lower bounded by LNup to
some constants:
|J1| ≥ 1
12¯
θLN(28)
The proof of (28) is provided below. When E= 1,J1={0}so |J1|= 1 = LN
2¯
θ≥1
12¯
θLN.
When E≥2, notice that |J1|=|J | − |{1≤t≤W−1|t∈ J}|. Since |J| =E−1,
|{1≤t≤W−1|t∈ J }| =⌊W−1
∆⌋, we have
|J1|=E−1− ⌊W−1
∆⌋ ≥ E−1−N/3−1
∆≥E−1−(N−1)/3
∆=E−1−(N−1)/3
⌊N−1
E−1⌋
≥E−1−(N−1)/3
N−1
2(E−1)
=1
3(E−1) ≥1
6E=LN
12¯
θ
where the first inequality is by W≤N/3, the second equality is by substituting the definition of ∆,
the third inequality is by N−1
E−1≥1and ⌊N−1
E−1⌋ ≥ 1
2
N−1
E−1, then last inequality is by E≥2.
Moreover, we can show in Lemma 12, for all t∈ J1, the online decision zt+1(Az)is different from
the optimal solution z∗
t+1 and the difference is lower bounded,
Lemma 12. For t∈ J1,
Ekzt+1(Az)−z∗
t+1k2≥c10 σ2ρ2K
where c10 is a constant determined by A, B , n, Q, R.
The lower bound on the difference between the online decision and the optimal decision results in a
lower bound for the regret. By nδ-strong convexity of C(z),
E(C(z(Az)) −C(z∗)) ≥δn
2X
t∈J1
Ekzt+1(Az)−z∗
t+1k2
≥LN
12¯
θc10σ2ρ2K=LN
12¯
θc10 ¯
θ2/nρ2K= Ω(LNρ2K)
By the equivalence between Aand Az, we have EJ(A)−EJ∗= Ω(ρ2KLN). By the property of ex-
pectation, there must exist some realization of the random {θt}such that J(A)−J∗= Ω(ρ2KLN),
which completes the proof.
Proof of Lemma 12. By our construction, θtis random, and zA
t+1 is also random and its randomness
is provided by θ1,...,θt+W−1, while z∗
t+1 is determined by all θt. By i.i.d. construction of θt,
EkzA
t+1 −z∗
t+1k2=EkzA
t+1 −δ
N−1
X
i=1
vt+1,iθik2(by (27))
=EkzA
t+1 −δ
t+W−1
X
i=1
vt+1,iθik2+δ2Ek
N−1
X
i=t+W
vt+1,iθik2
≥δ2Ek
N−1
X
i=t+W
vt+1,iθik2
For t∈ J1,t+W≤N−1and t+W∈ J, so by the construction of θtwe have
θt+W=··· =θt+W+∆−1,...,θ(E−2)∆+1 =··· =θN−1and θN= 0. In addition,
25
θt+W, θt+W+∆,...,θ(E−2)∆+1 are i.i.d. with zero mean and variance σ2In. Thus,
Ek
N−1
X
i=t+W
vt+1,iθik2=Ek
t+W+∆−1
X
i=t+W
vt+1,iθt+Wk2+···+Ek
N−1
X
i=(E−2)∆+1
vt+1,iθ(E−2)∆+1 k2
≥ k
t+W+∆−1
X
i=t+W
vt+1,ik2σ2+k
N−1
X
i=(E−2)∆+1
vt+1,ik2σ2
≥σ2
N−1
X
i=t+Wkvt+1,ik2=σ2
N−1
X
i=t+W
(
n−1
X
k=0
Y2
t+1,i−k)(by vt+1,i’s def.)
≥σ2
N−1
X
i=t+1+W−n
Y2
t+1,i =σ2
N
X
i=t+1+W−n
Y2
t+1,i
where the second inequality is by vt+1,i having nonnegative entries, the last equality is because
when t∈ J1,Yt+1,N = 0.
When 1≤W≤n,PN
i=t+1+W−nY2
t+1,i ≥Y2
t+1,t+1. When W > n,PN
i=t+1+W−nY2
t+1,i ≥
Y2
t+1,t+1+n⌈W−n
n⌉. Moreover, when W≥1,⌈W−n
n⌉=⌊W−1
n⌋. Therefore, for any W≥1,
N
X
i=t+1+W−n
Y2
t+1,i ≥Y2
t+1,t+1+n⌊W−1
n⌋
≥ρ2K(1−ρ
δn + 2 )2
where the last inequality is by Lemma 11 and p=n.
F.1 Proof of Lemma 11
Proof. Since His a block matrix
H=
(δn + 2)In−In···
−In(δn + 2)In
...
......−In
−In(qn+ 1)In
its inverse matrix Wcan also be represented as a block matrix. Moreover, let
H1=
δn + 2 −1··· 0
−1δn + 2 ...0
.
.
........
.
.
0··· −1qn+ 1
¯
Y= (H1)−1= (yij )i,j∈RN/p . Then the inverse matrix Ycan be represented as (yij In).
Now, it suffices to provide a lower bound on yij .
Since H1is a symmetric positive definite tridiagonal matrix, by [55], the inverse has an explicit
formula given by (H1)−1
ij =aibjand
ai=ρ
1−ρ21
ρi−ρi
and
bt=c3
1
ρN−t+c4ρN−t
26
c3=bN(qn+ 1)ρ−ρ2
1−ρ2
c4=bN
1−(qn+ 1)ρ
1−ρ2
bN=1
−aN−1+ (qn+ 1)aN
In the following, we will show atbt+τ≥1−ρ2
δn+2 ρτ. Firstly, it is easy to verify that
ρ⊤at=ρ
1−ρ2(1 −ρ2t)≥ρ
since t≥1and ρ < 1.
Secondly, we bound bNin the following way:
ρ−NbN=1
(qn+ 1)(1 −ρ2N)−(ρ−ρ2N−1)
1−ρ2
ρ
≥1
(δn + 2)
1−ρ2
ρ
because 0<(qn+ 1)(1 −ρ2N)−(ρ−ρ2N−1)≤(δn + 2).
Thirdly, we bound bt+τ. When 1−(qn+ 1)ρ≥0
ρN−t−τbt+τ=bN(qn+ 1)ρ−ρ2
1−ρ2+bN
1−(qn+ 1)ρ
1−ρ2ρ2(N−t−τ)
≥bN(qn+ 1)ρ−ρ2
1−ρ2(by 1−(qn+ 1)ρ≥0)
≥bN(δn + 1)ρ−ρ2
1−ρ2(by qn≥nδ + 1)
=1−ρ
1−ρ2bN
where the last equality is by ρ2−(δn + 2)ρ+ 1 = 0.
When 1−(qn+ 1)ρ < 0
ρN−t−τbt+τ=bN(qn+ 1)ρ−ρ2
1−ρ2+bN
1−(qn+ 1)ρ
1−ρ2ρ2(N−t−τ)
≥bN(qn+ 1)ρ−ρ2
1−ρ2+bN
1−(qn+ 1)ρ
1−ρ2(by 1−(qn+ 1)ρ < 0, ρ ≤1)
≥bN(by ρ2(N−t−τ)≤1)
Combining three parts together:
yt,t+τ=atbt+τ≥ρbN
1−ρ
1−ρ2ρτ−N≥1−ρ
(δn + 2) ρτ
G Proofs of properties of LQT in Appendix E
In this section, we provide proofs for the properties of LQ tracking (LQT) provided in Appendix E.
27
G.1 Preliminaries: dynamic programming for finite-horizon LQT
In this section, we consider a discrete time LQ tracking problem with time-varying cost functions
and time-invariant dynamical system:
min
xt,ut
1
2
N−1
X
t=0 (xt−θt)⊤Qt(xt−θt) + u⊤
tRtut+1
2(xN−θN)⊤QN(xN−θN)
s.t. xt+1 =Axt+But, t = 0,...,N −1
where x0= 0 for simplicity.
The problem can be solved by dynamic programming.
Theorem 4 (Dynamic programming for the finite-horizon LQT).Consider a finite-horizon time-
varying LQ tracking problem. Let Vt(xt)be the cost to go from k=tto k=N, then
Vt(xt) = 1
2(xt−βt)⊤Pt(xt−βt) + 1
2
N−1
X
k=t
(Aθk−βk+1)⊤Hk(Aθk−βk+1 )
for t= 0,...,N. The parameters can be obtained by
Pt=Qt+A⊤MtA, t = 0,...,N −1, QN=QN
Mt=Pt+1 −Pt+1B(Rt+B⊤Pt+1 B)−1BTPt+1, t = 0,...,N −1
βt= (Qt+A⊤MtA)−1(Qtθt+A⊤Mtβt+1), t = 0,...,N −1
βN=θN
Ht=Mt−MtA(Qt+A⊤MtA)−1A⊤Mt, t = 0,...,N −1
The optimal controller is
u∗
t=−Ktxt+K′
tβt+1, t = 0,...,N −1
where the parameters are
Kt= (Rt+B⊤Pt+1B)−1B⊤Pt+1 A
K′
t= (Rt+B⊤Pt+1B)−1B⊤Pt+1
There is another way to write the optimal controller:
u∗
t=−Ktxt+Kα
tαt+1 t= 0,...,N −1
where the parameters are
Kα
t= (Rt+B⊤Pt+1B)−1B⊤
αt=Ptβt
αt=Qtθt+ (A−BKt)⊤αt+1 , t = 0,...,N −1
αN=PNθN
The proof is by dynamic programming [56].
G.2 Proof of Lemma 9
In the following, we first prove that the recursive solution Ptto the finite-horizon LQR is bounded.
Then, by taking limit, we can prove Pe
tis bounded.
Lemma 13 (Bounded Ptfor finite-horizon LQT).Consider a finite-horizon time-varying LQT prob-
lem. For any N, any 0≤t≤N, any Qt∈ Q, Rt∈ R, QN∈ P, we have Pt∈ P where Ptis
defined in Proposition 4.
28
Proof. Since Ptdoes not depend on θt, when proving Fact 3, we let θt= 0 and consider the LQR
problem for simplicity. Since Q≤Qt≤¯
Q, R ≤Rt≤¯
R, for 0≤t≤N−1and P≤QN≤¯
P,
we have for any xt, ut,k,Qt, Rt, QN,
N−1
X
t=k
(x⊤
tQtxt+u⊤
tRtut) + x⊤
NQNxN≤
N−1
X
t=k
(x⊤
t¯
Qxt+u⊤
t¯
Rut) + x⊤
N¯
P xN
N−1
X
t=k
(x⊤
tQtxt+u⊤
tRtut) + x⊤
NQNxN≥
N−1
X
t=k
(x⊤
tQxt+u⊤
tRut) + x⊤
NP xN
Taking minimum over all feasible trajectories on both sides, we have
min
xt+1=Axt+But
N−1
X
t=k
(x⊤
tQtxt+u⊤
tRtut) + x⊤
NQNxN≤min
xt+1=Axt+But
N−1
X
t=k
(x⊤
t¯
Qxt+u⊤
t¯
Rut) + x⊤
N¯
P xN
min
xt+1=Axt+But
N−1
X
t=k
(x⊤
tQtxt+u⊤
tRtut) + x⊤
NQNxN≥min
xt+1=Axt+But
N−1
X
t=k
(x⊤
tQxt+u⊤
tRut) + x⊤
NP xN
Notice that LHS = x⊤
kPkxk. Moreover, notice that
x⊤
k¯
P xk= min
xt+1=Axt+But
N−1
X
t=k
(x⊤
t¯
Qxt+u⊤
t¯
Rut) + x⊤
N¯
P xN
because ¯
P=Pe(¯
Q, ¯
R). The same holds for P. Therefore, we have
x⊤
kPxk≤x⊤
kPkxk≤x⊤
k¯
P xk
for any xk, so P≤Pk≤¯
P, so Pk∈ P.
Proof of Lemma 9. Consider the finite-horizon time-invariant LQR problem with stage cost Q, R,
i.e. the total cost function is PN−1
k=0 (x⊤
kQxk+u⊤
kRuk). By Lemma 13, we have P≤Pk≤¯
P.
Since Pk→Peas k→ −∞, we have P≤Pe≤¯
P, consequently, kPek2≤υmax(¯
P).
G.3 Proof of Lemma 6
Based on the dynamic programming solution in Theorem 4, we can provide a more complete charac-
terization of the solution to the Bellman equation, including the formula for λe, heand the optimal
controller.
Lemma 14 (Optimal solution to average-cost LQ tracking).Suppose (A, B)is controllable, Q, R
are positive definite. The optimal average cost λedoes not depend on the initial state x0and is
equal to
λe=1
2(Aθ −βe)⊤He(Aθ −βe),
the solution to the Bellman equation he(x) + λe= minu(f(x) + g(u) + he(Ax +Bu)) can be
represented by
he(x) = 1
2(x−βe)⊤P∗(x−βe),
and the optimal controller is
u=Kex+K′βe
where Pe=Pe(Q, R),αe=Qθ + (A−BKe)⊤αe,
βe=F θ (29)
and F= (Pe)−1αe= (Pe)−1(I−(A−BKe)⊤)−1Qonly depends on A, B, Q, R, and Me=
Pe−PeB(R+B⊤PeB)−1B⊤Peand He=Me−MeA(Q+A⊤MeA)−1A⊤Meand Ke=
(R+B⊤PeB)−1B⊤PeA,K′= (R+B⊤PeB)−1B⊤Peand αe=Qθ + (A−BK e)⊤αeand
βe= (Pe)−1αe.
29
Proof of Lemma 14. Proof outline.
•optimal average cost formula
•bias function he(x)’s formula
•optimal controller formula
Step 1: Optimal average cost formula. Consider a finite horizon LQT problem:
min
xt,ut
1
2
N−1
X
t=0 (xt−θ)⊤Q(xt−θ) + u⊤
tRut
s.t. xt+1 =Axt+But, t = 0,...,N −1
Given initial state x0, by Theorem 4, the total optimal cost in Ntime steps is
J∗
N(x0) = 1
2(x0−β0)⊤P0(x0−β0) + 1
2
N−1
X
k=0
(Aθ −βk+1)⊤Hk(Aθ −βk+1 )
The proof is by first showing that βk→βeand Pk→Peand Hk→Heas k→ −∞, and
consequently 1
2(Aθ −βk+1)⊤Hk(Aθ −βk+1 )→1
2(Aθ −βe)⊤He(Aθ −βe)as k→ −∞. Then
the optimal average cost in infinite horizon is
λe= lim
N→+∞
1
N(1
2(x0−β0)⊤P0(x0−β0) + 1
2
N−1
X
k=0
(Aθ −βk+1)⊤Hk(Aθ −βk+1 ))
=1
2(Aθ −βe)⊤He(Aθ −βe),
Now, we prove βk→βe,Pk→Peand Hk→Heas k→ −∞. The convergence of Pkis
from Proposition 4.4.1 [52]. Since matrix inverse is continuous when the matrix is invertible, we
have Mk→Meand Hk→Heas k→ −∞. Similarly, we have Kk→Ke, and Kα
k→Kαand
K′
k→K′as k→ −∞. Notice that βk=P−1
kzk, so we can prove the convergence of βkby proving
αk→αeas k→ −∞. The backward recursive equation for αtis αt=Qθ + (A−BKt)⊤αt+1
and we have (A−BKk)⊤→(A−BKe)⊤as k→ −∞. Based on the lemma below, we can show
αk→αeas k→ −∞ where αe=Qθ + (A−BKe)⊤αe.
Lemma 15 (Convergence of time-varying system).If At→Aand Ais stable, then system xt+1 =
Atxt+ηwill converge to xssuch that xs=Axs+ηfor any bounded initial value x0
The proof of this lemma is provided later in this subsection.
Step 2: he(x)’s formula. The proof is by plugging in he(x)and λe’s formula to both sides of the
Bellman equation and show the equality holds. The right-hand-side (RHS) of the Bellman equation
is
RHS = min
u
1
2(x−θ)⊤Q(x−θ) + 1
2u⊤Ru +1
2(Ax +Bu −βe)⊤Pe(Ax +Bu −β)
=1
2(x−θ)⊤Q(x−θ) + 1
2(Ax −β)⊤Me(Ax −β)
=1
2(Aθ −βe)⊤He(Aθ −β) + 1
2(x−βe)⊤Pe(x−β) = LHS
where Me=Pe−PeB(R+B⊤PeB)−1B⊤Peand the optimal control input is ue=Kex+K′βe,
and the last two inequalities are based on the following fact.
Fact: Consider a function
g(u) = 1
2(u−ξ)⊤R(u−ξ) + 1
2(Cu +η)⊤P(C u +η)
where P, R are pd, u, ξ , η are vectors, Cis matrix. Then,
g(u) = 1
2(u−u∗)⊤(R+C⊤P C)(u−u∗) + 1
2(Cξ +η)⊤M(C ξ +η)
30
u∗= (R+C⊤P C)−1(Rξ −C⊤P η )
M=P−P C(R+C⊤P C )−1C⊤P
Step 3: optimal controller’s formula. We prove u=Kex+K′βeis the optimal controller by
showing that the average cost by implementing this controller is no more than the optimal average
cost λe. Let xt, utbe the state and control at tby implementing u=Kex+K′βe.
1
N
1
2
N−1
X
t=0 (xt−θ)⊤Q(xt−θ) + u⊤
tRut
≤1
N 1
2
N−1
X
t=0 (xt−θ)⊤Q(xt−θ) + u⊤
tRut+1
2(xN−βe)⊤Pe(xN−βe)!
=1
N 1
2(x0−βe)⊤Pe(x0−βe) + 1
2
N−1
X
k=0
(Aθ −βe)⊤He(Aθ −βe)!
where the last equality is by dynamic programming and step 2. Taking N→+∞on both sides,
lim
N→+∞
1
N
1
2
N−1
X
t=0 (xt−θ)⊤Q(xt−θ) + u⊤
tRut
≤1
2(Aθ −βe)⊤He(Aθ −βe)
Therefore, the total cost by implementing u=Kex+K′βeis no greater than 1
2(Aθ−βe)⊤He(Aθ −
βe).
G.3.1 Proof of Lemma 15
Since we consider general At, it is difficult to construct a Lyapunov function. So we will prove it by
proving the error term dt=xt−xsgoes to zero. We rewrite the system as
dt+1 =Atdt+η+Atxs−xs
=Adt+ (At−A)dt+η+ (At−I)(I−A)−1η(by xs= (I−A)−1η)
=Adt+ (At−A)(dt+ (I−A)−1η)
Define wt= (At−A)(dt+ (I−A)−1η). Then
dt+1 =Adt+wt(30)
The proof has two steps. First, we will prove dtis bounded, then we will prove dt→0.
Bound dt.First, we provide a supportive lemma which is based on the fact that exponential stability
implies BIBO stability in LTI system.
Lemma 16. Let Sk=Pk−1
t=0 A⊤ut. When Ais stable, kutk2≤Mfor any t, then there exists a
constant c3>0such that
kSkk2≤c3M, ∀k= 1,2...
Proof. Consider a system xt+1 =Axt+utwith x0= 0. Since Ais stable, the system is expo-
nentially stable. By Theorem 9.4 [47], exponential stability implies bounded-input-bounded-output
stability, so
kxtk2≤c3M
for any t. Since xk=Sk=Pk−1
t=0 A⊤ut, we have kSkk2≤c3M, ∀k= 1,2....
Next, we will prove dtis bounded by induction.
Lemma 17. There exists M > 0that does not depend on t, such that kdtk2≤Mfor any t.
31
Proof. By At→A, we have for any ǫ1, there exists N1, such that when t≥N1,kAt−Ak2≤ǫ1.
Let ǫ1= 1/4c3. By Lis stable, we have L⊤→0, so for any ǫ2, there exists N2, such that when
t > N2,kL⊤k2≤ǫ2. Let ǫ2= 1/2. Let M= max(kd0k2,...,kdN1+N2k2,k(IA)−1ηk2). Notice
that kdtk2≤Mfor t≤N1+N2. We will show that kdN1+N2+1k2≤M. By (30), let t=N1+N2,
dt+1 =AN2+1dN1+wt+Awt−1+···+AN2wN1
kdt+1k2≤ kAN2+1 k2M+kwt+Awt−1+···+AN2wN1k2
≤ǫ2M+c3max
N1≤k≤tkwkk2(by Lemma 16)
≤ǫ2M+ 2c3ǫ1M(by wk= (Ak−A)(dk+ (I−A)−1η)and def. of ǫ1, M and k≥N1)
= (1/2 + 1/2)M+M
Next consider any t≥N1+N2+ 1 and kdkk2≤Mfor any k≤t. We can show kdt+1k2≤Min
a similar way. Thus we have proved that kdtk2≤Mfor any t.
Prove dt→0.It suffices to prove that for any ǫ3, there exists N3, such that when t > N3, we
have kdtk2≤ǫ3. By At→A, let ǫ′
1=ǫ3/(4c3M), there exists N′
1, such that when t≥N′
1,
kAt−Ak2≤ǫ1, where Mis defined in Lemma 17. By Lis stable, we have L⊤→0, so let ǫ′
2=
ǫ3/(2M)where Mis defined in Lemma 17, there exists N′
2, such that when t > N ′
2,kL⊤k2≤ǫ2.
Let N3=N′
1+N′
2. By (30),
dt+1 =AN′
2+1dN′
1+wt+Awt−1+···+AN′
2wN′
1
kdt+1k2≤ kAN′
2+1k2M+kwt+Awt−1+···+AN′
2wN′
1k2
≤ǫ2M+c3max
N′
1≤k≤tkwkk2(by Lemma 16)
≤ǫ′
2M+ 2c3ǫ′
1M(by wk= (Ak−A)(dk+ (I−A)−1η)and def. of ǫ′
1, M and k≥N′
1)
= (1/2 + 1/2)ǫ3=ǫ3
G.4 Proof of Lemma 7.
Let Dt=A−BKtwhere Ktis defined in Appendix G.1, then x∗
tfollows the system
x∗
t+1 =Dtx∗
t+BK α
tαt+1
We will prove x∗
tis bounded by three steps: 1) show that system xt+1 =Dtxtis exponential
stable, 2) show that BKα
tαt+1 is bounded, 3) show x∗
tis bounded by the fact that exponential stable
systems are bounded-input-bounded-output stable.
Step 1: show xt+1 =Dtxtis exponential stable by Lyapunov function.
Lemma 18 (Lyapunov function).Define L(t, xt) = x⊤
tPtxt. For any N, any 0≤t≤N, any
Qt∈ Q, Rt∈ R, QN∈ P, and for any xt, we have
υmin(P)kxtk2
2≤L(t, xt)≤υmax(¯
P)kxtk2
L(t+ 1, Dtxt)−L(t, xt)≤ −µfkxtk2
2
L(t, xt)is called the Lyapunov function for the system xt+1 =Dtxt.
Proof. By Lemma 13,
υmin(P)In≤P≤P≤¯
P≤υmax(¯
P)
so for any xt, we have
υmin(P)kxtk2
2≤L(t, xt) = x⊤
tPtxt≤υmax(¯
P)kxtk2
Notice that
L(t+ 1, Dtxt)−L(t, xt) = x⊤
tD⊤
tPt+1Dtxt−x⊤
tPtxt
=x⊤
t(D⊤
tPt+1Dt−Pt)xt
=x⊤
t(−Qt−K⊤
tRtKt)xt(by definition)
32
≤ −x⊤
tQxt(by Qt+K⊤
tRtKt≥Qt≥Q)
=−µfkxtk2
2(by Q=µfIn)
By the Lyapunov function above, we can show xt+1 =Dtxtis exponential stable. To provide a
formula for the exponential decay rate, we introduce a technical lemma below before proving the
exponential stability.
Lemma 19. 0≤µf≤lf≤υmax(¯
P).
Proof. If QN= 0,Qt=¯
Q, Rt=¯
R. then PN−1=¯
Q. By Propostion 4.4.1’s proof (Bert vol I), we
have ¯
P=P∗(¯
Q, ¯
R)≥PN−1. So done.
Next, we prove exponential stability.
Proposition 1 (Exponential stability).Define the state transition matrix:
Φ(t, t0) = Dt−1···Dt0
for t≥t0, and Φ(t, t) = I. For any N, any 0≤t0≤N t0≤t≤N, any Qt∈ Q, Rt∈ R, QN∈
P, and for any xt0, we have
kxtk2≤c1ct−t0
2kxt0k2(31)
kΦ(t, t0)k2≤c1ct−t0
2kxt0k2(32)
where c1=qυmax(¯
P)
υmin(P),c2=q1−q
υmax(¯
P)∈(0,1).
Proof. For any xt0, we denote as xtthe solution to the system xt+1 =Dtxtstarting at xt0. By
Lemma 18
L(t+ 1, xt+1)−L(t, xt)≤ −µfkxtk2
2≤ − q
υmax(¯
P)L(t, xt)
So for any t≥t0,
L(t+ 1, xt+1)≤(1 −q
υmax(¯
P))L(t, xt)
As a result,
υmin(P)kxtk2
2≤L(t, xt)≤(1 −q
υmax(¯
P))t−t0L(t0, xt0)≤(1 −q
υmax(¯
P))t−t0υmax(¯
P)kxt0k2
2
This completes the proof.
As for the state transition matrix, the bound is proved by noticing that xt= Φ(t, t0)xt0and
kΦ(t, t0)k2= maxxt06=0 kxtk
kxt0k.
Step 2: show that BKα
tαt+1 is bounded. We will first show that αtis bounded, then show that
BK α
tαt+1 is bounded
Lemma 20 (Bound αt).For any N, any 0≤t≤N, any Qt∈ Q, Rt∈ R, QN∈ P, we have
kαtk2≤c1
1−c2
υmax(¯
P)¯
θ=:¯α
where c1=qυmax(¯
P)
υmin(P),c2=q1−q
υmax(¯
P)∈(0,1).
Consequently,
kBK α
tαtk2≤ kBk2
2
¯α
µg
33
Proof. Consider system αt=D⊤
tαt+1 +Qtθt. First of all, we bound the input:
kQtθtk2≤ kQtk2kθtk2≤υmax(Qt)kθtk2(by Qtis pd)
≤lf¯
θ(by kθtk2≤¯
θ,Qt≤lfI)
The initial is αN=QNθN≤υmax(¯
P)¯
θ. By Lemma 19, lf≤υmax(¯
P).
Next, by αt=D⊤
tαt+1 +Qtθtand def of transition matrix Φ(t, t0), we have
αt=Qtθt+D⊤
tQt+1θt+1 +···+D⊤
t...D⊤
N−2QN−1θN−1+D⊤
t...D⊤
N−1PNθN
= Φ(t, t)⊤Qtθt+ Φ(t+ 1, t)⊤Qt+1θt+1 +···+ Φ(N−1, t)⊤QN−1θN−1+ Φ(N, t)⊤PNθN
By the exp decay of Φ(t, t0)established in Proposition 1, we have
kαtk2≤ kΦ(t, t)⊤k2kQtθtk2+···+kΦ(N−1, t)⊤k2kQN−1θN−1k2+kΦ(N , t)⊤k2kPNθNk
≤ kΦ(t, t)k2kQtθtk2+···+kΦ(N−1, t)k2kQN−1θN−1k2+kΦ(N , t)k2kPNθNk
(by kAk2=kA⊤k2)
≤c1c0
2lf¯
θ+···+c1cN−t−1
2lf¯
θ+c1cN−t
2υmax(¯
P)¯
θ(bylf≤υmax(¯
P).)
≤c1υmax(¯
P)¯
θ1
1−c2
= ¯α
Consequently,
kBK α
tαtk2=kB(Rt+B⊤Pt+1B)−1B⊤αtk ≤ kBk2
2k(Rt+B⊤Pt+1B)−1kkαtk
(by kBk2=kB⊤k)
≤ kBk2
2
¯α
µg
(by Rt+B⊤Pt+1B≥µgIm)
Step 3: bound x∗
t
Proof of Lemma 7. For simplicity, let ωt=BKα
tαt+1, and let ¯ω=kBk2
2¯α
µg. By definition, we
have
x∗
t= Φ(t, t)ωt−1+ Φ(t, t −1)ωt−2+...Φ(t, 1)ω0+ Φ(t, 0)x∗
0
By Proposition 1,
kx∗
tk2≤ kΦ(t, t)k2kωt−1k+···+kΦ(t, 1)kkω0k+kΦ(t, 0)kkx∗
0k
≤c1c0
2¯ω+···+c1ct−1
2¯ω+c1c⊤
2kx0k2
≤c1
1
1−c2
max(¯ω , kx0k2) =:¯x
G.5 Proof of Lemma 8
Consider the finite-horizon time-invariant LQR problem with stage cost Q, R, i.e. the total cost
function is PN−1
k=0 (x⊤
kQxk+u⊤
kRuk). By Lemma 20, we have kαkk ≤ ¯α. By Lemma 13, we have
P≤Pk≤¯
P. So kβkk=kP−1
kαkk ≤ 1
υmin(P)¯α. By the proof of Lemma 14, we know βk→βe
as k→ −∞, so kβek ≤ 1
υmin(P)¯α.
34
H Simulation descriptions
H.1 LQT
The experiment settings are as follows. Let A= [0,1; −1/, 5/6], B = [0; 1],N= 30. Consider
diagonal Qt, Rtwith diagonal entries i.i.d. from Unif[1,2]. Let θti.i.d. from Unif[−10,10]. We
will apply RHTM, and RHGD based on gradient descent, and RHAG based on Nesterov’s gradient
descent. The stepsizes of RHTM are provided in Theorem 1. The stepsizes of RHGD can be viewed
as RHTM with stepsize δc= 1/lc, δw=δy=δz= 0, and the the stepsizes of RHAG can be viewed
as RHTM with δc= 1/lc, δy=δw=√ζ−1
√ζ+1 and δz= 0.
H.2 Robotics tracking
Consider the following discrete-time counterpart of the kinematic model
xt+1 =xt+ ∆t·cos θt·vt(33a)
yt+1 =yt+ ∆t·sin θt·vt(33b)
θt+1 =θt+ ∆t·ωt(33c)
Thus we have
θt= arctan( yt+1 −yt
xt+1 −xt
)(34a)
vt=1
∆t·p(xt+1 −xt)2+ (yt+1 −yt)2(34b)
wt=θt+1 −θt
∆t=1
∆t·arctan( yt+2 −yt+1
xt+2 −xt+1
)−arctan( yt+1 −yt
xt+1 −xt
)(34c)
So that (θt, vt, wt)can be expressed by the state variables (xt, yt).
In the simulation, the given reference trajectory is
xr(t) = 16 sin3(t−6) (35a)
yr(t) = 13 cos(t)−5 cos(2t−12) −2 cos(3t−18) −cos(4t−24) (35b)
As for the objective function, we set the cost coefficients as
ce
t=0, t = 0
1,otherwise cv
t=0, t =N
15∆t2,otherwise cw
t=0, t =N
15∆t2,otherwise
The discrete-time resolution for online control is 0.025 second, i.e., ∆t= 0.025s. When imple-
menting each control decision, a much smaller time resolution of 0.001sis used to simulate the real
motion dynamics of the robot.
35