• Tidak ada hasil yang ditemukan

Self-Tuning ๐œ† -Confident Control

LINEAR QUADRATIC CONTROL WITH UNTRUSTED AI PREDICTIONS

2.4 Self-Tuning ๐œ† -Confident Control

where๐ป B ๐ต(๐‘…+๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค,OPTdenotes the optimal cost,๐ถ >0is a constant that depends on ๐ด, ๐ต, ๐‘„ , ๐‘…and

๐œ€(๐น , ๐‘ƒ, ๐‘’0, . . . , ๐‘’๐‘‡โˆ’1) B

๐‘‡โˆ’1

โˆ‘๏ธ

๐‘ก=0

๐‘‡โˆ’1

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œโˆ’๐‘ก

๐‘ƒ(๐‘ค๐œ โˆ’ ๐‘คb๐œ)

2

, (2.7)

๐‘Š (๐น , ๐‘ƒ,

๐‘คb0, . . . ,

๐‘คb๐‘‡โˆ’1) B

๐‘‡โˆ’1

โˆ‘๏ธ

๐‘ก=0

๐‘‡โˆ’1

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œโˆ’๐‘ก

๐‘ƒ๐‘คb๐œ

2

.

From this result, we see that๐œ†-confident control is guaranteed to be 1+ โˆฅ๐ปโˆฅ(1โˆ’๐œ†)2

๐ถ

- consistent and 1+ โˆฅ๐ปโˆฅ 1

๐ถ + ๐œ†2

OPT๐‘Š -robust. This highlights a trade-off between consistency and robustness such that if a large๐œ†is used (i.e., predictions are trusted), then consistency decreases to 1, while the robustness increases unboundedly. In contrast, when a small๐œ†is used (i.e., predictions are distrusted), the robustness of the policy converges to the optimal value, but the consistency does not improve on the robustness value. Due to the time-coupling structure in the control system, the mismatches๐‘’๐‘ก =๐‘คb๐‘กโˆ’๐‘ค๐‘ก at different times contribute unequally to the system. As a result, the prediction error๐œ€in (2.2) and (2.7) is defined as a weighed quadratic sum of(๐‘’0, . . . , ๐‘’๐‘‡โˆ’1). Moreover, the termOPTin (2.6) is common in the robustness and consistency analysis of online algorithms, such as [3, 5, 7, 9].

Algorithm 3:Self-Tuning๐œ†-Confident Control for๐‘ก =0, . . . , ๐‘‡ โˆ’1do

if๐‘ก =0or๐‘ก =1then Initialize and choose๐œ†0 end

else

Compute a trust parameter๐œ†๐‘ก ๐œ†๐‘ก =

ร๐‘กโˆ’1

๐‘ =0(๐œ‚(๐‘ค;๐‘ , ๐‘กโˆ’1))โŠค๐ป(๐œ‚(

๐‘คb;๐‘ , ๐‘กโˆ’1)) ร๐‘กโˆ’1

๐‘ =0(๐œ‚(๐‘คb;๐‘ , ๐‘กโˆ’1))โŠค๐ป(๐œ‚(๐‘คb;๐‘ , ๐‘กโˆ’1)) where๐œ‚(๐‘ค;๐‘ , ๐‘ก) B

๐‘ก

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒ๐‘ค๐œ end

Generate an action๐‘ข๐‘ก using๐œ†๐‘ก-confident control (Algorithm 2) Update๐‘ฅ๐‘ก+1= ๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐‘ก+๐‘ค๐‘ก

end

The key to the algorithm is the update rule for๐œ†๐‘ก. Given previously observed pertur- bations and predictions, the goal of the algorithm is to find a greedy๐œ†๐‘กthat minimizes the gap between the algorithmic and optimal costs so that๐œ†๐‘ก Bmin๐œ†

ร๐‘กโˆ’1 ๐‘ =0๐œ“โŠค

๐‘  ๐ป ๐œ“๐‘  where๐œ“๐‘  B ร๐‘กโˆ’1

๐œ=๐‘ (๐นโŠค)๐œโˆ’๐‘ ๐‘ƒ(๐‘ค๐œ โˆ’๐œ†

๐‘คb๐œ). This can be equivalently written as ๐œ†๐‘ก =arg min

๐œ† ๐‘กโˆ’1

โˆ‘๏ธ

๐‘ =0

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐‘กโˆ’1

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒ(๐‘ค๐œโˆ’๐œ† ๐‘คb๐œ)

!โŠค ๐ป

๐‘กโˆ’1

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒ(๐‘ค๐œ โˆ’๐œ† ๐‘คb๐œ)

!๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป , (2.8) which is a quadratic function of๐œ†. Rearranging the terms in (2.8) yields the choice of๐œ†๐‘ก in the self-tuning control scheme.

Algorithm 3 is efficient since, in each time step, updating the๐œ‚values only requires adding one more term. This means that the total computational complexity of๐œ†๐‘ก is ๐‘‚(๐‘‡2๐‘›๐›ผ), where๐›ผ < 2.373, which is polynomial in both the time horizon length๐‘‡ and state dimension๐‘›. According to the expression of๐œ†๐‘กin Algorithm 3, at each time ๐‘ก, the terms๐œ‚(๐‘ค;๐‘ , ๐‘กโˆ’2)and๐œ‚(

๐‘คb;๐‘ , ๐‘กโˆ’2)can be pre-computed for all๐‘ =0, . . . , ๐‘กโˆ’1.

Therefore, the recursive formula๐œ‚(๐‘ค;๐‘ , ๐‘ก) B ร๐‘ก

๐œ=๐‘ (๐นโŠค)๐œโˆ’๐‘ ๐‘ƒ๐‘ค๐œ =๐œ‚(๐‘ค;๐‘ , ๐‘กโˆ’1) + (๐นโŠค)๐‘กโˆ’๐‘ ๐‘ƒ๐‘ค๐‘ก implies the update rule of the terms{๐œ‚(๐‘ค;๐‘ , ๐‘กโˆ’1) : ๐‘ =0, . . . , ๐‘กโˆ’1}

in the expression of๐œ†๐‘ก. This gives that, at each time๐‘ก, it takes no more than๐‘‚(๐‘‡ ๐‘›๐›ผ) steps to compute๐œ†๐‘กwhere๐›ผ <2.373 and๐‘‚(๐‘›๐›ผ) is the computational complexity of matrix multiplication.

Convergence

We now move to the analysis of Algorithm 3. First, we study the convergence of ๐œ†๐‘ก, which depends on the variation of the predictionswb B (๐‘คb0, . . . ,

๐‘คb๐‘‡โˆ’1)and the true perturbationsw B (๐‘ค0, . . . , ๐‘ค๐‘‡โˆ’1), where we use a boldface letter to represent a sequence of vectors. Specifically, our results are in terms of the variation of the predictions and perturbations, which we define as follows. Theself-variation ๐œ‡VAR(y)of a sequenceyB (๐‘ฆ0, . . . , ๐‘ฆ๐‘‡โˆ’1)is defined as

๐œ‡VAR(y) B

๐‘‡โˆ’1

โˆ‘๏ธ

๐‘ =1

max

๐œ=0,...,๐‘ โˆ’1

โˆฅ๐‘ฆ๐œ โˆ’๐‘ฆ๐œ+๐‘‡โˆ’๐‘ โˆฅ.

The goal of the self-tuning algorithm is to converge to the optimal trust parameter ๐œ†โˆ— for the problem instance. To specify this formally, let ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) be the algorithmic cost with adaptively chosen trust parameters ๐œ†0, . . . , ๐œ†๐‘‡โˆ’1 and denote byALG(๐œ†) the cost with a fixed trust parameter๐œ†. Then,๐œ†โˆ— is defined as ๐œ†โˆ— B min๐œ†โˆˆRALG(๐œ†).Further, let๐‘Š(๐‘ก) B ร๐‘ก

๐‘ =0๐œ‚(

๐‘คb;๐‘ , ๐‘ก)โŠค๐ป ๐œ‚(

๐‘คb;๐‘ , ๐‘ก).

We can now state a bound on the convergence rate of๐œ†๐‘ก to๐œ†โˆ— under Algorithm 3.

The bound highlights that if the variation of the system perturbations and predictions is small, then the trust parameter๐œ†๐‘กconverges quickly to๐œ†โˆ—. A proof can be found in Appendix 2.C.

Lemma 1. Assume๐‘Š(๐‘‡) = ฮฉ(๐‘‡)and๐œ†๐‘ก โˆˆ [0,1] for all๐‘ก = 0, . . . , ๐‘‡ โˆ’1. Under our model assumptions, the adaptively chosen trust parameters (๐œ†0, . . . , ๐œ†๐‘‡) by self-tuning control satisfy that for any1< ๐‘ก โ‰ค ๐‘‡,

|๐œ†๐‘ก โˆ’๐œ†โˆ—| =๐‘‚ ๐œ‡VAR(w) +๐œ‡VAR( wb)

/๐‘ก . Regret and Competitiveness

Building on the convergence analysis, we now prove bounds on the regret and competitive ratio of Algorithm 3. These are the main results presented in this chapter about the performance of an algorithm that adaptively determines the optimal trade-off between robustness and consistency.

Regret. We first study the regret as compared with the best, fixed trust parameter in hindsight, i.e.,๐œ†โˆ—, whose corresponding worst-case competitive ratio satisfies the upper bound given in Theorem 2.3.2.

Denote byRegret B ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) โˆ’ALG(๐œ†โˆ—) theregretwe consider where (๐œ†0, . . . , ๐œ†๐‘‡โˆ’1)are the trust parameters selected by the self-tuning control scheme.

Our main result is the following variation-based regret bound, which is proven in Appendix 2.C. Define the denominator of ๐œ†๐‘ก in Algorithm 2 by ๐‘Š(๐‘ก) B ร๐‘ก

๐‘ =0๐œ‚(

๐‘คb;๐‘ , ๐‘ก)โŠค๐ป ๐œ‚(

๐‘คb;๐‘ , ๐‘ก).

Lemma 2. Assume๐‘Š(๐‘ก) = ฮฉ(๐‘‡)and๐œ†๐‘ก โˆˆ [0,1]for all๐‘ก =0, . . . , ๐‘‡โˆ’1. Under our model assumptions, for anywandbw, the regret of Algorithm 3 is bounded by

Regret=๐‘‚

๐œ‡VAR(w) +๐œ‡VAR(bw)2 .

Note that the baseline we evaluate against inRegret=ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) โˆ’ALG(๐œ†โˆ—) is stronger than baselines in previousstatic regretanalysis for LQR, such as [52, 54]

where online controllers are compared with a linear control policy๐‘ข๐‘ก =โˆ’๐พ ๐‘ฅ๐‘ก with a strongly stable ๐พ. The baseline policy considered in our regret analysis is the ๐œ†-confident scheme (Algorithm 2) with

๐‘ข๐‘ก =โˆ’ (๐‘…+๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค ๐‘ƒ ๐ด๐‘ฅ๐‘ก +๐œ†

๐‘‡โˆ’1

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œโˆ’๐‘ก

๐‘ƒ๐‘คb๐œ

!

=โˆ’๐พ ๐‘ฅ๐‘กโˆ’๐œ†(๐‘…+๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค

๐‘‡โˆ’1

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œโˆ’๐‘ก

๐‘ƒ๐‘คb๐œ,

which contains the class of strongly stable linear controllers as a special case. More- over, the regret bound in Lemma 2 holds for any predictions๐‘คb0, . . . ,

๐‘คb๐‘‡โˆ’1. Taking ๐‘คb๐‘ก = ๐‘ค๐‘ก for all๐‘ก = 0, . . . , ๐‘‡ โˆ’1, our regret directly compares ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) with the optimal costOPT, and therefore, our regret also involves thedynamic regret considered in [39, 40, 43] for LQR as a special case. The regret bound in Lemma 2 depends on the variation of perturbations and predictions. Note that such a term commonly exists in regret analysis based on the โ€œfollow the leaderโ€ approach [37, 38].

For example, the regret analysis of the follow the optimal steady state (FOSS) method in [39] contains a similar โ€œpath lengthโ€ term that captures the state variation. There is a variation budget of the predictions or prediction errors in theorem 1 of [43]. In many robotics applications (e.g., the trajectory tracking and EV charging experiments in this chapter shown in Section 2.5), each๐‘ค๐‘กis from some desired smooth trajectory to track.

To interpret this lemma, suppose the sequences of perturbations and predictions satisfy:

โˆฅ๐‘คb๐œ โˆ’

๐‘คb๐œ+๐‘‡โˆ’๐‘ โˆฅ โ‰ค ๐œŒ(๐‘ ),

โˆฅ๐‘ค๐œ โˆ’๐‘ค๐œ+๐‘‡โˆ’๐‘ โˆฅ โ‰ค ๐œŒ(๐‘ ), for any๐‘  โ‰ฅ 0,0โ‰ค ๐œ โ‰ค ๐‘ .

These bounds correspond to an assumption of smooth variation in the disturbances and the predictions. Note that it is natural for the disturbances to vary smoothly in applications such as tracking problems where the disturbances correspond to the trajectory and in such situations one would expect the predictions to also vary smoothly. For example, machine learning algorithms are often regularized to provide smooth predictions.

Given these smoothness bounds, we have that ๐œ‡VAR(w) +๐œ‡VAR(

bw) โ‰ค

๐‘‡โˆ’1

โˆ‘๏ธ

๐‘ =0

2๐œŒ(๐‘ ).

Note that, as long as

ร๐‘‡โˆ’1

๐‘ =0 ๐œŒ(๐‘ )2

=๐‘œ(๐‘‡), the regret bound is sub-linear in๐‘‡. To understand how this bound may look in particular applications, suppose we have ๐œŒ(๐‘ ) =๐‘‚(1/๐‘ ). In this case, regret is poly-logarithmic, i.e.,Regret=๐‘‚( (log๐‘‡)2).

If ๐œŒ(๐‘ ) is exponential the regret is even smaller, i.e., if ๐œŒ(๐‘ ) = ๐‘‚(๐‘Ÿ๐‘ ) for some 0< ๐‘Ÿ <1, thenRegret=๐‘‚(1).

Competitive Ratio. We are now ready to present our main result, which provides an upper bound on the competitive ratio of self-tuning control (Algorithm 3). Recall that, in Lemma 2, we bound the regret Regret B ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) โˆ’ALG(๐œ†โˆ—) and, in Theorem 2.3.2, a competitive ratio bound is provided for the๐œ†-confident control scheme, including ALG(๐œ†โˆ—)/OPT. Therefore, combining Lemma 2 and Theorem 2.3.2 leads to a novel competitive ratio bound for the self-tuning scheme (Algorithm 3). Note that compared with Theorem 2.3.2, which also provides a competitive ratio bound for๐œ†-confident control, Theorem 2.4.1 below considers a competitive ratio bound for the self-tuning scheme in Algorithm 3 where, at each time ๐‘ก, a trust parameter๐œ†๐‘กis determined by online learning and may be time-varying.

Theorem 2.4.1. Assume๐‘Š(๐‘‡) = ฮฉ(๐‘‡) and ๐œ†๐‘ก โˆˆ [0,1] for all ๐‘ก = 0, . . . , ๐‘‡ โˆ’1.

Under our model assumptions, the competitive ratio of Algorithm 3 is bounded by CR(๐œ€) โ‰ค1+2โˆฅ๐ปโˆฅ ๐œ€

OPT+๐ถ ๐œ€ +๐‘‚

๐œ‡VAR(w) +๐œ‡VAR( wb)2

OPT

!

where๐ป,๐ถOPTand๐œ€are defined in Theorem 2.3.2.

In contrast to the regret bound, Theorem 2.4.1 states an upper bound on the competitive ratioCR(๐œ€)defined in Section 2.2, which indicates thatCR(๐œ€)scales as

1+๐‘‚(๐œ€)/(ฮ˜(1) +ฮ˜(๐œ€))as a function of๐œ€. As a comparison, the๐œ†-confident control in Algorithm 2 has a competitive ratio upper bound that is linear in the prediction error๐œ€(Theorem 2.3.2). This improved dependency highlights the importance of learning the trust parameter adaptively.

Our experimental results in the next section verify the implications of Theorem 2.3.2 and Theorem 2.4.1. Specifically, the simulated competitive ratio of the self-tuning control (Algorithm 3) is a non-linear envelope of the simulated competitive ratios for ๐œ†-confident control with fixed trust parameters and as a function of prediction error ๐œ€, it matches the implied competitive ratio upper bound 1+๐‘‚(๐œ€)/(ฮ˜(1) +ฮ˜(๐œ€)).

Theorem 2.4.1 is proven by combining Lemma 2 with Theorem 2.3.2, which bounds the competitive ratios for fixed trust parameters.

Proof of Theorem 2.4.1. Denote byALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1)the algorithmic cost of the self-tuning control scheme. We have

ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1)

OPT โ‰ค |ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) โˆ’ALG(๐œ†โˆ—) |

OPT + ALG(๐œ†โˆ—)

OPT . (2.9)

Using Theorem 2.3.2, ALG(๐œ†โˆ—)

OPT โ‰ค1+2โˆฅ๐ปโˆฅmin

min

๐œ†

๐œ†2

OPT๐œ€+ (1โˆ’๐œ†)2 ๐ถ

,min

๐œ†

1 ๐ถ

+ ๐œ†2 OPT๐‘Š

=1+2โˆฅ๐ปโˆฅmin

๐œ€ OPT+๐œ€๐ถ

, 1 ๐ถ

=1+2โˆฅ๐ปโˆฅ ๐œ€ OPT+๐œ€๐ถ

. (2.10) Moreover, the regret bound in Lemma 2 implies

|ALG(๐œ†0, . . . , ๐œ†๐‘‡โˆ’1) โˆ’ALG(๐œ†โˆ—) |

OPT =๐‘‚

๐œ‡VAR(w) +๐œ‡VAR(bw)2

OPT

! ,

combing which with (2.10), (2.9) gives the results. โ–ก 2.5 Applications

We now illustrate our main results using numerical examples and case studies to highlight the impact of the trust parameter๐œ†in๐œ†-confident control and demonstrate the ability of the self-tuning control algorithm to learn the appropriate trust parameter ๐œ†. We consider three applications. The first is a robot tracking example where a robot is asked to follow locations of an unknown trajectory and the desired location is only revealed the time immediately before the robot makes a decision to modify its velocity. Predictions about the trajectory are available. However, the predictions

can be untrustworthy so that they may contain large errors. The second is an adaptive battery-buffered electric vehicle (EV) charging problem where a battery- buffered charging station adaptively supplies energy demands of arriving EVs while maintaining the state of charge of the batteries as close to a nominal level as possible.

Our third application considers a non-linear control problemโ€“the Cart-Pole problem.

Our๐œ†-confident and self-tuning control schemes use a linearized model while the algorithms are tested with the non-linear environment. We use the third application to demonstrate the practicality of our algorithms by showing that they do not only work for LQC problems, but also non-linear systems.

To illustrate the impact of randomness in prediction errors in our case studies, the three applications all use different forms of random error models. For each selected distribution ofbwโˆ’w, we repeat the experiments multiple times and report the worst case with the highest algorithmic cost; see Appendix 2.E for details.