LINEAR QUADRATIC CONTROL WITH UNTRUSTED AI PREDICTIONS
2.4 Self-Tuning ๐ -Confident Control
where๐ป B ๐ต(๐ +๐ตโค๐ ๐ต)โ1๐ตโค,OPTdenotes the optimal cost,๐ถ >0is a constant that depends on ๐ด, ๐ต, ๐ , ๐ and
๐(๐น , ๐, ๐0, . . . , ๐๐โ1) B
๐โ1
โ๏ธ
๐ก=0
๐โ1
โ๏ธ
๐=๐ก
๐นโค๐โ๐ก
๐(๐ค๐ โ ๐คb๐)
2
, (2.7)
๐ (๐น , ๐,
๐คb0, . . . ,
๐คb๐โ1) B
๐โ1
โ๏ธ
๐ก=0
๐โ1
โ๏ธ
๐=๐ก
๐นโค๐โ๐ก
๐๐คb๐
2
.
From this result, we see that๐-confident control is guaranteed to be 1+ โฅ๐ปโฅ(1โ๐)2
๐ถ
- consistent and 1+ โฅ๐ปโฅ 1
๐ถ + ๐2
OPT๐ -robust. This highlights a trade-off between consistency and robustness such that if a large๐is used (i.e., predictions are trusted), then consistency decreases to 1, while the robustness increases unboundedly. In contrast, when a small๐is used (i.e., predictions are distrusted), the robustness of the policy converges to the optimal value, but the consistency does not improve on the robustness value. Due to the time-coupling structure in the control system, the mismatches๐๐ก =๐คb๐กโ๐ค๐ก at different times contribute unequally to the system. As a result, the prediction error๐in (2.2) and (2.7) is defined as a weighed quadratic sum of(๐0, . . . , ๐๐โ1). Moreover, the termOPTin (2.6) is common in the robustness and consistency analysis of online algorithms, such as [3, 5, 7, 9].
Algorithm 3:Self-Tuning๐-Confident Control for๐ก =0, . . . , ๐ โ1do
if๐ก =0or๐ก =1then Initialize and choose๐0 end
else
Compute a trust parameter๐๐ก ๐๐ก =
ร๐กโ1
๐ =0(๐(๐ค;๐ , ๐กโ1))โค๐ป(๐(
๐คb;๐ , ๐กโ1)) ร๐กโ1
๐ =0(๐(๐คb;๐ , ๐กโ1))โค๐ป(๐(๐คb;๐ , ๐กโ1)) where๐(๐ค;๐ , ๐ก) B
๐ก
โ๏ธ
๐=๐
๐นโค๐โ๐
๐๐ค๐ end
Generate an action๐ข๐ก using๐๐ก-confident control (Algorithm 2) Update๐ฅ๐ก+1= ๐ด๐ฅ๐ก+๐ต๐ข๐ก+๐ค๐ก
end
The key to the algorithm is the update rule for๐๐ก. Given previously observed pertur- bations and predictions, the goal of the algorithm is to find a greedy๐๐กthat minimizes the gap between the algorithmic and optimal costs so that๐๐ก Bmin๐
ร๐กโ1 ๐ =0๐โค
๐ ๐ป ๐๐ where๐๐ B ร๐กโ1
๐=๐ (๐นโค)๐โ๐ ๐(๐ค๐ โ๐
๐คb๐). This can be equivalently written as ๐๐ก =arg min
๐ ๐กโ1
โ๏ธ
๐ =0
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐กโ1
โ๏ธ
๐=๐
๐นโค๐โ๐
๐(๐ค๐โ๐ ๐คb๐)
!โค ๐ป
๐กโ1
โ๏ธ
๐=๐
๐นโค๐โ๐
๐(๐ค๐ โ๐ ๐คb๐)
!๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป , (2.8) which is a quadratic function of๐. Rearranging the terms in (2.8) yields the choice of๐๐ก in the self-tuning control scheme.
Algorithm 3 is efficient since, in each time step, updating the๐values only requires adding one more term. This means that the total computational complexity of๐๐ก is ๐(๐2๐๐ผ), where๐ผ < 2.373, which is polynomial in both the time horizon length๐ and state dimension๐. According to the expression of๐๐กin Algorithm 3, at each time ๐ก, the terms๐(๐ค;๐ , ๐กโ2)and๐(
๐คb;๐ , ๐กโ2)can be pre-computed for all๐ =0, . . . , ๐กโ1.
Therefore, the recursive formula๐(๐ค;๐ , ๐ก) B ร๐ก
๐=๐ (๐นโค)๐โ๐ ๐๐ค๐ =๐(๐ค;๐ , ๐กโ1) + (๐นโค)๐กโ๐ ๐๐ค๐ก implies the update rule of the terms{๐(๐ค;๐ , ๐กโ1) : ๐ =0, . . . , ๐กโ1}
in the expression of๐๐ก. This gives that, at each time๐ก, it takes no more than๐(๐ ๐๐ผ) steps to compute๐๐กwhere๐ผ <2.373 and๐(๐๐ผ) is the computational complexity of matrix multiplication.
Convergence
We now move to the analysis of Algorithm 3. First, we study the convergence of ๐๐ก, which depends on the variation of the predictionswb B (๐คb0, . . . ,
๐คb๐โ1)and the true perturbationsw B (๐ค0, . . . , ๐ค๐โ1), where we use a boldface letter to represent a sequence of vectors. Specifically, our results are in terms of the variation of the predictions and perturbations, which we define as follows. Theself-variation ๐VAR(y)of a sequenceyB (๐ฆ0, . . . , ๐ฆ๐โ1)is defined as
๐VAR(y) B
๐โ1
โ๏ธ
๐ =1
max
๐=0,...,๐ โ1
โฅ๐ฆ๐ โ๐ฆ๐+๐โ๐ โฅ.
The goal of the self-tuning algorithm is to converge to the optimal trust parameter ๐โ for the problem instance. To specify this formally, let ALG(๐0, . . . , ๐๐โ1) be the algorithmic cost with adaptively chosen trust parameters ๐0, . . . , ๐๐โ1 and denote byALG(๐) the cost with a fixed trust parameter๐. Then,๐โ is defined as ๐โ B min๐โRALG(๐).Further, let๐(๐ก) B ร๐ก
๐ =0๐(
๐คb;๐ , ๐ก)โค๐ป ๐(
๐คb;๐ , ๐ก).
We can now state a bound on the convergence rate of๐๐ก to๐โ under Algorithm 3.
The bound highlights that if the variation of the system perturbations and predictions is small, then the trust parameter๐๐กconverges quickly to๐โ. A proof can be found in Appendix 2.C.
Lemma 1. Assume๐(๐) = ฮฉ(๐)and๐๐ก โ [0,1] for all๐ก = 0, . . . , ๐ โ1. Under our model assumptions, the adaptively chosen trust parameters (๐0, . . . , ๐๐) by self-tuning control satisfy that for any1< ๐ก โค ๐,
|๐๐ก โ๐โ| =๐ ๐VAR(w) +๐VAR( wb)
/๐ก . Regret and Competitiveness
Building on the convergence analysis, we now prove bounds on the regret and competitive ratio of Algorithm 3. These are the main results presented in this chapter about the performance of an algorithm that adaptively determines the optimal trade-off between robustness and consistency.
Regret. We first study the regret as compared with the best, fixed trust parameter in hindsight, i.e.,๐โ, whose corresponding worst-case competitive ratio satisfies the upper bound given in Theorem 2.3.2.
Denote byRegret B ALG(๐0, . . . , ๐๐โ1) โALG(๐โ) theregretwe consider where (๐0, . . . , ๐๐โ1)are the trust parameters selected by the self-tuning control scheme.
Our main result is the following variation-based regret bound, which is proven in Appendix 2.C. Define the denominator of ๐๐ก in Algorithm 2 by ๐(๐ก) B ร๐ก
๐ =0๐(
๐คb;๐ , ๐ก)โค๐ป ๐(
๐คb;๐ , ๐ก).
Lemma 2. Assume๐(๐ก) = ฮฉ(๐)and๐๐ก โ [0,1]for all๐ก =0, . . . , ๐โ1. Under our model assumptions, for anywandbw, the regret of Algorithm 3 is bounded by
Regret=๐
๐VAR(w) +๐VAR(bw)2 .
Note that the baseline we evaluate against inRegret=ALG(๐0, . . . , ๐๐โ1) โALG(๐โ) is stronger than baselines in previousstatic regretanalysis for LQR, such as [52, 54]
where online controllers are compared with a linear control policy๐ข๐ก =โ๐พ ๐ฅ๐ก with a strongly stable ๐พ. The baseline policy considered in our regret analysis is the ๐-confident scheme (Algorithm 2) with
๐ข๐ก =โ (๐ +๐ตโค๐ ๐ต)โ1๐ตโค ๐ ๐ด๐ฅ๐ก +๐
๐โ1
โ๏ธ
๐=๐ก
๐นโค๐โ๐ก
๐๐คb๐
!
=โ๐พ ๐ฅ๐กโ๐(๐ +๐ตโค๐ ๐ต)โ1๐ตโค
๐โ1
โ๏ธ
๐=๐ก
๐นโค๐โ๐ก
๐๐คb๐,
which contains the class of strongly stable linear controllers as a special case. More- over, the regret bound in Lemma 2 holds for any predictions๐คb0, . . . ,
๐คb๐โ1. Taking ๐คb๐ก = ๐ค๐ก for all๐ก = 0, . . . , ๐ โ1, our regret directly compares ALG(๐0, . . . , ๐๐โ1) with the optimal costOPT, and therefore, our regret also involves thedynamic regret considered in [39, 40, 43] for LQR as a special case. The regret bound in Lemma 2 depends on the variation of perturbations and predictions. Note that such a term commonly exists in regret analysis based on the โfollow the leaderโ approach [37, 38].
For example, the regret analysis of the follow the optimal steady state (FOSS) method in [39] contains a similar โpath lengthโ term that captures the state variation. There is a variation budget of the predictions or prediction errors in theorem 1 of [43]. In many robotics applications (e.g., the trajectory tracking and EV charging experiments in this chapter shown in Section 2.5), each๐ค๐กis from some desired smooth trajectory to track.
To interpret this lemma, suppose the sequences of perturbations and predictions satisfy:
โฅ๐คb๐ โ
๐คb๐+๐โ๐ โฅ โค ๐(๐ ),
โฅ๐ค๐ โ๐ค๐+๐โ๐ โฅ โค ๐(๐ ), for any๐ โฅ 0,0โค ๐ โค ๐ .
These bounds correspond to an assumption of smooth variation in the disturbances and the predictions. Note that it is natural for the disturbances to vary smoothly in applications such as tracking problems where the disturbances correspond to the trajectory and in such situations one would expect the predictions to also vary smoothly. For example, machine learning algorithms are often regularized to provide smooth predictions.
Given these smoothness bounds, we have that ๐VAR(w) +๐VAR(
bw) โค
๐โ1
โ๏ธ
๐ =0
2๐(๐ ).
Note that, as long as
ร๐โ1
๐ =0 ๐(๐ )2
=๐(๐), the regret bound is sub-linear in๐. To understand how this bound may look in particular applications, suppose we have ๐(๐ ) =๐(1/๐ ). In this case, regret is poly-logarithmic, i.e.,Regret=๐( (log๐)2).
If ๐(๐ ) is exponential the regret is even smaller, i.e., if ๐(๐ ) = ๐(๐๐ ) for some 0< ๐ <1, thenRegret=๐(1).
Competitive Ratio. We are now ready to present our main result, which provides an upper bound on the competitive ratio of self-tuning control (Algorithm 3). Recall that, in Lemma 2, we bound the regret Regret B ALG(๐0, . . . , ๐๐โ1) โALG(๐โ) and, in Theorem 2.3.2, a competitive ratio bound is provided for the๐-confident control scheme, including ALG(๐โ)/OPT. Therefore, combining Lemma 2 and Theorem 2.3.2 leads to a novel competitive ratio bound for the self-tuning scheme (Algorithm 3). Note that compared with Theorem 2.3.2, which also provides a competitive ratio bound for๐-confident control, Theorem 2.4.1 below considers a competitive ratio bound for the self-tuning scheme in Algorithm 3 where, at each time ๐ก, a trust parameter๐๐กis determined by online learning and may be time-varying.
Theorem 2.4.1. Assume๐(๐) = ฮฉ(๐) and ๐๐ก โ [0,1] for all ๐ก = 0, . . . , ๐ โ1.
Under our model assumptions, the competitive ratio of Algorithm 3 is bounded by CR(๐) โค1+2โฅ๐ปโฅ ๐
OPT+๐ถ ๐ +๐
๐VAR(w) +๐VAR( wb)2
OPT
!
where๐ป,๐ถOPTand๐are defined in Theorem 2.3.2.
In contrast to the regret bound, Theorem 2.4.1 states an upper bound on the competitive ratioCR(๐)defined in Section 2.2, which indicates thatCR(๐)scales as
1+๐(๐)/(ฮ(1) +ฮ(๐))as a function of๐. As a comparison, the๐-confident control in Algorithm 2 has a competitive ratio upper bound that is linear in the prediction error๐(Theorem 2.3.2). This improved dependency highlights the importance of learning the trust parameter adaptively.
Our experimental results in the next section verify the implications of Theorem 2.3.2 and Theorem 2.4.1. Specifically, the simulated competitive ratio of the self-tuning control (Algorithm 3) is a non-linear envelope of the simulated competitive ratios for ๐-confident control with fixed trust parameters and as a function of prediction error ๐, it matches the implied competitive ratio upper bound 1+๐(๐)/(ฮ(1) +ฮ(๐)).
Theorem 2.4.1 is proven by combining Lemma 2 with Theorem 2.3.2, which bounds the competitive ratios for fixed trust parameters.
Proof of Theorem 2.4.1. Denote byALG(๐0, . . . , ๐๐โ1)the algorithmic cost of the self-tuning control scheme. We have
ALG(๐0, . . . , ๐๐โ1)
OPT โค |ALG(๐0, . . . , ๐๐โ1) โALG(๐โ) |
OPT + ALG(๐โ)
OPT . (2.9)
Using Theorem 2.3.2, ALG(๐โ)
OPT โค1+2โฅ๐ปโฅmin
min
๐
๐2
OPT๐+ (1โ๐)2 ๐ถ
,min
๐
1 ๐ถ
+ ๐2 OPT๐
=1+2โฅ๐ปโฅmin
๐ OPT+๐๐ถ
, 1 ๐ถ
=1+2โฅ๐ปโฅ ๐ OPT+๐๐ถ
. (2.10) Moreover, the regret bound in Lemma 2 implies
|ALG(๐0, . . . , ๐๐โ1) โALG(๐โ) |
OPT =๐
๐VAR(w) +๐VAR(bw)2
OPT
! ,
combing which with (2.10), (2.9) gives the results. โก 2.5 Applications
We now illustrate our main results using numerical examples and case studies to highlight the impact of the trust parameter๐in๐-confident control and demonstrate the ability of the self-tuning control algorithm to learn the appropriate trust parameter ๐. We consider three applications. The first is a robot tracking example where a robot is asked to follow locations of an unknown trajectory and the desired location is only revealed the time immediately before the robot makes a decision to modify its velocity. Predictions about the trajectory are available. However, the predictions
can be untrustworthy so that they may contain large errors. The second is an adaptive battery-buffered electric vehicle (EV) charging problem where a battery- buffered charging station adaptively supplies energy demands of arriving EVs while maintaining the state of charge of the batteries as close to a nominal level as possible.
Our third application considers a non-linear control problemโthe Cart-Pole problem.
Our๐-confident and self-tuning control schemes use a linearized model while the algorithms are tested with the non-linear environment. We use the third application to demonstrate the practicality of our algorithms by showing that they do not only work for LQC problems, but also non-linear systems.
To illustrate the impact of randomness in prediction errors in our case studies, the three applications all use different forms of random error models. For each selected distribution ofbwโw, we repeat the experiments multiple times and report the worst case with the highest algorithmic cost; see Appendix 2.E for details.