• Tidak ada hasil yang ditemukan

Practical Implementation and Applications Learning confidence coefficients onlineLearning confidence coefficients online

NON-LINEAR CONTROL WITH BLACK-BOX AI POLICIES

3.5 Practical Implementation and Applications Learning confidence coefficients onlineLearning confidence coefficients online

Our main results in the previous section are stated without specifying a sequence of confidence coefficients (๐œ†๐‘ก :๐‘ก โ‰ฅ 0) for the policy; however in the following we introduce an online learning approach to generate confidence coefficients based on observations of actions, states and known system parameters. The negative result in Theorem 3.3.1 highlights that the adaptive nature of the confidence coefficients in

Algorithm 4 are crucial to ensuring stability. Naturally, learning the values of the confidence coefficients(๐œ†๐‘ก :๐‘ก โ‰ฅ 0)online can further improve performance.

In this section, we propose an online learning approach based on a linear parameter- ization of a black-box model-free policy b๐œ‹๐‘ก(๐‘ฅ) =โˆ’๐พ ๐‘ฅ โˆ’๐ปโˆ’1๐ตโŠครโˆž

๐œ=๐‘ก(๐นโŠค)๐‘กโˆ’๐œ๐‘ƒb๐‘“๐œ where (b๐‘“๐‘ก : ๐‘ก โ‰ฅ 0) are parameters representing estimates of the residual functions for a black-box policy. Note that when b๐‘“๐‘ก = ๐‘“โˆ—

๐‘ก B ๐‘“๐‘ก(๐‘ฅโˆ—

๐‘ก, ๐‘ขโˆ—

๐‘ก) where๐‘ฅโˆ—

๐‘ก and ๐‘ขโˆ—

๐‘ก are optimal state and action at time๐‘กfor an optimal policy, then the model-free policy is optimal. In general, a black-box model-free policy b๐œ‹ can be non-linear, the linear parameterization provides an example of how the time-varying confidence coefficients (๐œ†๐‘ก : ๐‘ก โ‰ฅ 0) are selected and the idea can be extended to non-linear parameterizations such as kernel methods.

Under the linear parameterization assumption, for linear dynamics, [23] shows that the optimal choice of๐œ†๐‘ก+1that minimizes the gap between the policy cost and optimal cost for the๐‘ก time steps is

๐œ†๐‘ก+1 = โˆ‘๏ธ๐‘ก

๐‘ =0

(๐œ‚(๐‘“โˆ—;๐‘ , ๐‘ก))โŠค๐ป

๐œ‚(b๐‘“;๐‘ , ๐‘ก) โˆ‘๏ธ๐‘ก

๐‘ =0

๐œ‚(b๐‘“;๐‘ , ๐‘ก) โŠค

๐ป

๐œ‚(b๐‘“;๐‘ , ๐‘ก) , (3.5) where ๐œ‚(๐‘“;๐‘ , ๐‘ก) B ร๐‘ก

๐œ=๐‘ (๐นโŠค)๐œโˆ’๐‘ ๐‘ƒ ๐‘“๐œ. Compared with a linear quadratic control problem, computing๐œ†๐‘กin (3.5) raises two problems. The first is different from a linear dynamical system where true perturbations can be observed, the optimal actions and states are unknown, making the computation of the term๐œ‚(๐‘“โˆ—;๐‘ , ๐‘กโˆ’1)impossible.

The second issue is similar. Since the model-free policy is a black-box, we do not know the parameters (b๐‘“๐‘ก : ๐‘ก โ‰ฅ 0) exactly. Therefore, we use approximations to compute the terms๐œ‚(๐‘“โˆ—;๐‘ , ๐‘ก) and๐œ‚(b๐‘“;๐‘ , ๐‘ก)in (3.5) and the linear parameterization and linear dynamics assumptions are used to derive the approximations, respectively, with details provided in Appendix 3.B. Let ๐ต ๐ปโˆ’1โ€ 

denote the Mooreโ€“Penrose inverse of๐ต ๐ปโˆ’1. Combining (3.8) and (3.9) with (3.5) yields the following online- learning choice of a confidence coefficient๐œ†๐‘ก =min{๐œ†โ€ฒ, ๐œ†๐‘กโˆ’1โˆ’๐›ผ}where๐›ผ >0 is a fixed step size and

๐œ†โ€ฒ B ร๐‘กโˆ’1

๐‘ =1

ร๐‘กโˆ’1

๐œ=๐‘ (๐นโŠค)๐œโˆ’๐‘ ๐‘ƒ(๐ด๐‘ฅ๐œ +๐ต๐‘ข๐œโˆ’๐‘ฅ๐œ+1)โŠค

๐ต(

b๐‘ข๐‘ +๐พ ๐‘ฅ๐‘ ) ร๐‘กโˆ’1

๐‘ =0(b๐‘ข๐‘ +๐พ ๐‘ฅ๐‘ )โŠค ๐ต ๐ปโˆ’1โ€ 

๐ต(b๐‘ข๐‘ +๐พ ๐‘ฅ๐‘ )

(3.6)

Figure 3.5.1: Competitiveness and stability of the adaptive policy. Top (Competi- tiveness): costs of pre-trained RL agents, an LQR and the adaptive policy when the initial pole angle๐œƒ(unit: radians) varies. Bottom (Stability): convergence of โˆฅ๐‘ฅ๐‘กโˆฅin ๐‘กwith๐œƒ =0.4 for pre-trained RL agents, a naive combination (Section 3.3) using a fixed๐œ†=0.8 and the adaptive policy.

Figure 3.5.2: Simulation results for real-world adaptive EV charging. Left: total rewards of the adaptive policy and SAC for pre-COVID-19 days and post-COVID-19 days. Right: shift of data distributions due to the work-from-home policy.

based on the crude model information ๐ด, ๐ต, ๐‘„ , ๐‘…and previously observed states, model-free actions and policy actions. This online learning process provides a choice of the confidence coefficient in Algorithm 4. It is worth noting that other approaches for generating๐œ†๐‘ก exist, and our theoretical guarantees apply to any approach.

Applications

To demonstrate the efficacy of the adaptive ๐œ†-confident policy (Algorithm 4), we first apply it to the Cart-Pole OpenAI gym environment (Cart-Pole-v1, Example 1 in

Appendix 3.A) [55].1 Next, we apply it to an adaptive electric vehicle (EV) charging environment modeled by a real-world dataset [1].

The Cart-Pole problem. We use the Stable-Baselines3 pre-trained agents [61] of A2C [72], ARS [62], PPO [56] and TRPO [60] as four candidate black-box policies.

In Figure 3.5.1, the adaptive policy finds a trade-off between the pre-trained black-box polices and an LQR with crude model information (i.e., about 50% estimation error in the mass and length values). In particular, when๐œƒincreases, it stabilizes the state while the A2C and PPO policies become unstable when the initial angle๐œƒ is large.

Real-world adaptive EV charging. In the EV charging application, a SAC [73]

agent is trained with data collected from a pre-COVID-19 period and tested on days before and after COVID-19. Due to a policy change (the work-from-home policy), the SAC agent becomes biased in the post-COVID-19 period (see the right sub-figure in Figure 3.5.2). With crude model information, the adaptive policy has rewards matching the SAC agent in the pre-COVID-19 period and significantly outperforms the SAC agent in the post-COVID-19 period with an average total award 1951.2 versus 1540.3 for SAC. Further details on the hyper-parameters and reward function are included in Appendix 3.A.

1The Cart-Pole environment is modified so that quadratic costs are considered rather than discrete rewards.

APPENDIX

3.A Experimental Setup and Supplementary Results

We describe the experimental settings and choices of hyper-parameters and re- ward/cost functions in the two applications.

Table 3.A.1: Hyper-parameters used in the Cart-Pole problem.

Parameter Value

Number of Monte Carlo Tests 10 Initial angle variation (in rad) ๐œƒยฑ0.05

Cost matrix๐‘„ ๐ผ

Cost matrix๐‘…

10โˆ’4 Acceleration of gravity๐‘”(in m/s2) 9.8

Pole mass๐‘š(in kg) 0.2 for LQR; 0.1 for real environment Cart mass๐‘€(in kg) 2.0 for LQR; 1.0 for real environment

Pole length๐‘™(in m) 2

Duration๐œ(in second) 0.02

Force magnitude๐น 10

CPU Intelยฎi7-8850H

The Cart-Pole Problem

Problem setting. The Cart-Pole problem considered in the experiments is described by the following example.

Example 1 (The Cart-Pole Problem). In the Cart-Pole problem, the goal of a controller is to stabilize the pole in the upright position. Neglecting friction, the dynamical equations of the Cart-Pole problem are

๐œƒยฅ=

๐‘”sin๐œƒ+cos๐œƒ

โˆ’๐‘ขโˆ’๐‘š ๐‘™๐œƒยค2sin๐œƒ ๐‘š+๐‘€

๐‘™ 4

3 โˆ’ ๐‘šcos2๐œƒ

๐‘š+๐‘€

, ๐‘ฆยฅ=

๐‘ข+๐‘š ๐‘™ ๐œƒยค2sin๐œƒโˆ’ ยฅ๐œƒcos๐œƒ ๐‘š+๐‘€

where๐‘ขis the input force;๐œƒis the angle between the pole and the vertical line;๐‘ฆis the location of the pole;๐‘”is the gravitational acceleration;๐‘™ is the pole length;๐‘šis the pole mass; and๐‘€ is the cart mass. Takingsin๐œƒ โ‰ˆ๐œƒ andcos๐œƒ โ‰ˆ1and ignoring higher order terms provides a linearized system and the discretized dynamics of the

Cart-Pole problem can be represented as for any๐‘ก,

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐‘ฆ๐‘ก+1

ยค ๐‘ฆ๐‘ก+1 ๐œƒ๐‘ก+1 ๐œƒยค๐‘ก+1

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

|{z}

๐‘ฅ๐‘ก+1

=

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

1 ๐œ 0 0

0 1 โˆ’๐œ‚(๐‘š+๐‘š ๐‘™ ๐‘”๐œ๐‘€) 0

0 0 1 ๐œ

0 0 ๐‘”๐œ๐œ‚ 1

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

| {z }๏ฃป

๐ด

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐‘ฆ๐‘ก

ยค ๐‘ฆ๐‘ก ๐œƒ๐‘ก ๐œƒยค๐‘ก

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

|{z}

๐‘ฅ๐‘ก

+

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ 0

(๐‘š+๐‘€)๐œ‚+๐‘š ๐‘™ (๐‘š+๐‘€)2๐œ‚

๐œ 0

โˆ’(๐‘š+๐‘€)๐œ‚๐œ

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

| {z }๏ฃป

๐ต

๐‘ข๐‘ก+ ๐‘“๐‘ก ๐‘ฆ๐‘ก,๐‘ฆยค๐‘ก, ๐œƒ๐‘ก,๐œƒยค๐‘ก, ๐‘ข๐‘ก

where(๐‘ฆ๐‘ก,๐‘ฆยค๐‘ก, ๐œƒ๐‘ก,๐œƒยค๐‘ก)โŠคdenotes the system state at time๐‘ก;๐œdenotes the time interval between state updates;๐œ‚ B (4/3)๐‘™ โˆ’๐‘š ๐‘™/(๐‘š+๐‘€)and the function ๐‘“๐‘กmeasures the difference between the linearized system and the true system dynamics. Note that

๐‘“๐‘ก(0) =0for all time steps๐‘ก โ‰ฅ 0.

Policy setting. The pre-trained agents Stable-Baselines3 [61] of A2C [72], ARS [62], PPO [56] and TRPO [60] are selected as four candidate black-box policies. The Cart-Pole environment is modified so that quadratic costs are considered rather than discrete rewards to match our control problem (3.3). The choices of๐‘„ and๐‘…in the costs and other parameters are provided in Table 3.A.1. Note that we vary the values of ๐‘š and ๐‘€ in the LQR implementation to model the case of only having crude estimates of linear dynamics. The LQR outputs an action 0 ifโˆ’๐พ ๐‘ฅ๐‘ก+๐นโ€ฒ <0 and 1 otherwise. A shifted force๐นโ€ฒ=15 is used to model inaccurate linear approximations and noise. The pre-trained RL policies output a binary decision{0,1} representing force directions. To use our adaptive policy in this setting, given a system state๐‘ฅ๐‘กat each time๐‘ก, we implement the following:

๐‘ข๐‘ก =๐œ†๐‘ก(2๐œ‹RL(๐‘ฅ๐‘ก)๐นโˆ’๐น) + (1โˆ’๐œ†๐‘ก) 2๐œ‹LQR(๐‘ฅ๐‘ก)๐นโˆ’๐น

where๐นis a fixed force magnitude defined in Table 3.A.1;๐œ‹RLdenotes an RL policy;

๐œ‹LQR denotes an LQR policy and ๐œ†๐‘ก is a confidence coefficient generated based on (3.6). Instead of fixing a step size๐›ผ, we set an upper bound๐›ฟ =0.2 on the learned step size๐œ†โ€ฒto avoid converging too fast to a pure model-based policy.

Real-World Adaptive EV Charging in Tackling COVID-19

Problem setting. The second application considered is an EV charging problem modeled by real-world large-scale charging data [1]. The problem is formally described below.

Example 2 (Adaptive EV charging). Consider the problem of managing a fleet of electric vehicle supply equipment (EVSE). Let๐‘›be the number of EV charging

Table 3.A.2: Hyper-parameters used in the real-world EV charging problem.

Parameter Value

Problem setting

Number of chargers๐‘› 5

Line limit๐›พ(in kW) 6.6

Duration๐œ(in minute) 5

Reward coefficient๐œ™1 50

Reward coefficient๐œ™2 0.01

Reward coefficient๐œ™3 10

Policy setting

Discount๐›พSAC 0.9

Target smoothing coefficient๐œSAC 0.005 Temperature parameter๐›ผSAC 0.2

Learning rate 3ยท10โˆ’4

Maximum number of steps 10ยท106

Reply buffer size 10ยท106

Number of hidden layers (all networks) 2 Number of hidden units per layer 256 Number of samples per minibatch 256

Non-linearity ReLU

Training and testing data

Pre-COVID-19 (Training) May, 2019 - Aug, 2019 Pre-COVID-19 (Testing) Sep, 2019 - Dec, 2019

In-COVID-19 Feb, 2020 - May, 2020

Post-COVID-19 May, 2021 - Aug, 2021

CPU Intelยฎi7-8850H

stations. Denote by๐‘ฅ๐‘ก โˆˆR๐‘›+the charging states of the๐‘›stations, i.e.,๐‘ฅ(๐‘–)

๐‘ก > 0if an EV is charging at station-๐‘–and๐‘ฅ(๐‘–)

๐‘ก (kWh) energy needs to be delivered; otherwise๐‘ฅ(๐‘–)

๐‘ก =0.

Let๐‘ข๐‘ก โˆˆR๐‘›+be the allocation of energy to the๐‘›stations. There is a line limit๐›พ >0 so thatร๐‘›

๐‘–=1๐‘ข(๐‘–)

๐‘ก โ‰ค ๐›พ for any๐‘–and๐‘ก. At each time๐‘ก, new EVs may arrive and EVs being charged may depart from previously occupied stations. Each new EV ๐‘—induces a charging session, which can be represented by๐‘ ๐‘— B (๐‘Ž๐‘—, ๐‘‘๐‘—, ๐‘’๐‘—, ๐‘–)where at time ๐‘Ž๐‘—, EV ๐‘— arrives at station๐‘–, with a battery capacity๐‘’๐‘— > 0, and depart at time๐‘‘๐‘—. Assuming lossless charging, the system dynamics is๐‘ฅ๐‘ก+1 =๐‘ฅ๐‘ก+๐‘ข๐‘ก+ ๐‘“๐‘ก(๐‘ฅ๐‘ก, ๐‘ข๐‘ก), ๐‘ก โ‰ฅ 0 where the non-linear residual functions (๐‘“๐‘ก : ๐‘ก โ‰ฅ 0) represent uncertainty and

Figure 3.A.1: Illustration of the impact of COVID-19 on charging behaviors in terms of the total number of charging sessions and energy delivered (left) and distribution shifts (right).

constraint violations. Let๐œbe the time interval between state updates. Given fixed session information(๐‘ ๐‘— : ๐‘— > 0), denote by the following sets containing the sessions that are assigned to a charger๐‘–and activated (deactivated) at time๐‘ก:

A๐‘– B n

(๐‘— , ๐‘ก) :๐‘Ž๐‘— โ‰ค ๐‘ก โ‰ค ๐‘Ž๐‘— +๐œ, ๐‘ (4)

๐‘— =๐‘–

o , D๐‘– B

n

(๐‘— , ๐‘ก) :๐‘‘๐‘— โ‰ค ๐‘ก โ‰ค ๐‘‘๐‘— +๐œ, ๐‘ (4)

๐‘— =๐‘–

o . The charging uncertainty is summarized as for any๐‘– =1, . . . , ๐‘›,

๐‘“(๐‘–)

๐‘ก (๐‘ฅ๐‘ก, ๐‘ข๐‘ก) B

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃณ ๐‘ (3)

๐‘— if(๐‘— , ๐‘ก) โˆˆ A๐‘– (New sessions are active)

โˆ’๐‘ฅ(

๐‘–) ๐‘ก โˆ’๐‘ข(

๐‘–)

๐‘ก if(๐‘— , ๐‘ก) โˆˆ D๐‘–or๐‘ฅ(

๐‘–) ๐‘ก +๐‘ข(

๐‘–)

๐‘ก < 0 (Sessions end) (or battery is full) ๐›พ

โˆฅ๐‘ข๐‘กโˆฅ1๐‘ข(๐‘–)

๐‘ก if ร๐‘›

๐‘–=1๐‘ข(๐‘–)

๐‘ก > ๐›พ (Line limit is exceeded)

0 otherwise

.

Note that the non-linear residual functions (๐‘“๐‘ก :๐‘ก โ‰ฅ 0)in Example 2 may not satisfy ๐‘“๐‘ก(0) =0 for all๐‘ก โ‰ฅ 0 in Assumption 1. Our experiments further validate that the adaptive policy works well in practice even if some of the model assumptions are violated. The goal of an EV charging controller is to maximize a system-level reward function including maximizing energy delivery, avoiding a penalty due to uncharged

Figure 3.A.2: Bar-plots of rewards/number of sessions corresponding to testing the SAC policy and the adaptive policy on the EV charging environment based on sessions collected from three time periods.

Figure 3.A.3: Supplementary results of Figure 3.5.2 with additional testing rewards for an in-COVID-19 period.

capacities and minimizing electricity costs. The reward function is ๐‘Ÿ(๐‘ข๐‘ก, ๐‘ฅ๐‘ก) B

๐œ™1ร—๐œโˆฅ๐‘ข๐‘กโˆฅ2

| {z }

Charging rewards

โˆ’ ๐œ™2ร— โˆฅ๐‘ฅ๐‘กโˆฅ2

| {z }

Unfinished charging

โˆ’๐œ™3ร—๐‘๐‘กโˆฅ๐‘ข๐‘กโˆฅ1

| {z }

Electricity cost

โˆ’๐œ™4ร—

๐‘›

โˆ‘๏ธ

๐‘–=1

1( (๐‘— , ๐‘ก) โˆˆ D๐‘–)๐‘ฅ(๐‘–)

๐‘ก

๐‘’๐‘—

| {z }

Penalty

with coefficients๐œ™1, ๐œ™2, ๐œ™3and๐œ™4shown in Table 3.A.2. The environment is wrapped as an OpenAI gym environment [55]. In our implementation, for convenience, the state ๐‘ฅ๐‘ก is in R2+๐‘› with additional ๐‘› coordinates representing remaining charging duration. The electricity prices (๐‘๐‘ก :๐‘ก โ‰ฅ 0)are average locational marginal prices (LMPs) on the CAISO (California Independent System Operator) day-ahead market in 2016.

Policy setting. We train an SAC [73] policy๐œ‹SACfor EV charging with 4-month data collected from a real-world charging garage [1] before the outbreak of COVID-19.

The public charging infrastructure has 54 chargers and we use the charging history to set up our charging environment with 5 chargers. Knowledge of the linear parts in the non-linear dynamics๐‘ฅ๐‘ก+1=๐‘ฅ๐‘ก+๐‘ข๐‘ก+ ๐‘“๐‘ก(๐‘ฅ๐‘ก, ๐‘ข๐‘ก), ๐‘ก โ‰ฅ 0 is assumed to be known, based

Table 3.A.3: Average total rewards for the SAC policy and the adaptive๐œ†-confident policy (Algorithm 4).

Policy Pre-COVID-19 In-COVID-19 Post-COVID-19 SAC [73] 1601.914 765.664 1540.315 Adaptive 1489.338 839.651 1951.192

on which an LQR controller๐œ‹LQR is constructed. Our adaptive policy presented in Algorithm 4 learns a confidence coefficient๐œ†๐‘กat each time step๐‘กto combine the two policies๐œ‹SAC and๐œ‹LQR.

Impact of COVID-19. We test the policies on different periods from 2019 to 2021.

The impact of COVID-19 on the charging behavior is intuitive. As COVID-19 became an outbreak in early Feb, 2020 and later a pandemic in May, 2020, limited Stay at Home Order and curfew were issued, which significantly reduce the number of active users per day. Figure 3.A.1 illustrates the dramatic fall of the total number of monthly charging sessions and total monthly energy delivered between Feb, 2020 and Sep, 2020. Moreover, despite the recovery of the two factors since Jan, 2021, COVID-19 has a long-term impact on lifestyle behaviors. For example, the right sub-figure in Figure 3.A.1 shows that the arrival times of EVs (start times of sessions) are flattened in the post-COVID-19 period, compared to a more concentrated arrival peak before COVID-19. The significant shift of distributions highly deteriorates the performance of DNN-based model-free policies, e.g., SAC that are trained on normal charging data collected before COVID-19. In this work, we demonstrate that taking advantage of model-based information, the adaptive๐œ†-confident policy (Algorithm 4) is able to correct the mistakes made by DNN-based model-free policies trained on biased data and achieve more robust charging performance.

Additional experimental results. We provide supplementary experimental results.

Besides comparing the periods of pre-COVID-19 and post-COVID-19, we include the testing rewards for an in-COVID-19 period in Figure 3.A.3, together with the corresponding bar-plots in Figure 3.A.2. In addition, the average total rewards for the SAC policy and the adaptive policy are summarized in Table 3.A.3.

3.B Notation and Supplementary Definitions Summary of Notation

A summary of notation is provided in Table 3.B.1.

Symbol Definition System Model

๐ด, ๐ต, ๐‘„ , ๐‘… Linear system parameters

๐‘ƒ Solution of the DARE (3.4)

๐ป ๐‘…+๐ตโŠค๐‘ƒ ๐ต

๐พ ๐ปโˆ’1๐ตโŠค๐‘ƒ ๐ด

๐น ๐ดโˆ’๐ต๐พ

๐œŽ ๐‘„ , ๐‘…โชฐ๐œŽ ๐ผ

๐œ… max{2,โˆฅ๐ดโˆฅ,โˆฅ๐ตโˆฅ}

๐ถ๐นand๐œŒ โˆฅ๐น๐‘กโˆฅ โ‰ค๐ถ๐น๐œŒ๐‘ก ๐‘“๐‘ก:R๐‘›ร—R๐‘šโ†’R๐‘› Non-linear residual functions

๐ถโ„“ Lipschitz constant of ๐‘“๐‘ก

b๐œ‹ Black-box (model-free) policy

๐œ‹ Model-based policy (LQR)

๐œ€ Consistency error of a black-box policy Main Results

๐ถ๐‘Žsys, ๐ถsys

๐‘

, ๐ถsys๐‘ Constants defined in Section 3.B

๐›พ ๐œŒ+๐ถ๐น๐ถโ„“(1+ โˆฅ๐พโˆฅ)

๐œ‡ ๐ถ๐น ๐œ€(๐ถโ„“+ โˆฅ๐ตโˆฅ) +๐ถ๐‘Žsys๐ถโ„“

ALG Algorithm cost

OPT Optimal cost

CR(๐œ€) 2๐œ…(๐ถ๐น1โˆ’๐œŒโˆฅ๐‘ƒโˆฅ)2/๐œŽ

๐œ† lim๐‘กโ†’โˆž๐œ†๐‘ก

๐‘ก0 The smallest time index when๐œ†๐‘ก=0 or๐‘ฅ๐‘ก=0

Table 3.B.1: Symbols used in this work.

Constants in Theorem 3.4.1 and 3.4.2

Let ๐ป B ๐‘… + ๐ตโŠค๐‘ƒ ๐ต. With ๐œŽ > 0 defined in Assumption 2, the parameters ๐ถsys

๐‘Ž , ๐ถsys

๐‘

, ๐ถsys

๐‘ > 0 in the statements of Theorem 3.4.1 and 3.4.2 (Section 3.4) are the following constants that only depend on the known system parameters in (3.3):

๐ถsys

๐‘Ž B1/

2๐ถ๐นโˆฅ๐‘…+๐ตโŠค๐‘ƒ ๐ตโˆฅโˆ’1

โˆฅ๐‘ƒ ๐นโˆฅ + (1+ โˆฅ๐พโˆฅ) ( โˆฅ๐‘ƒ ๐ตโˆฅ + โˆฅ๐‘ƒโˆฅ) +

๐ถsys

๐‘

2 โˆฅ๐ต+๐ผโˆฅ (1+ โˆฅ๐นโˆฅ + โˆฅ๐พโˆฅ) , (3.7) ๐ถsys

๐‘ B2๐ถ๐น2โˆฅ๐‘ƒโˆฅ (๐œŒ+๐ถ) (๐œŒ+ (1+ โˆฅ๐พโˆฅ)) 1โˆ’ (๐œŒ+๐ถ)2

โˆš๏ธ‚โˆฅ๐‘„+๐พโŠค๐‘… ๐พโˆฅ ๐œŽ

, ๐ถ๐‘sys Bโˆฅ๐ปโˆฅ/(4โˆฅ๐‘ƒ ๐ตโˆฅ +2โˆฅ๐‘ƒโˆฅ +๐ถโˆ‡( โˆฅ๐ตโˆฅ +1) โˆฅ๐ตโˆฅ).

Approximations of Online Learning Steps in Section 3.5

The following approximations of๐œ‚(b๐‘“;๐‘ , ๐‘ก) and๐œ‚(๐‘“โˆ—;๐‘ , ๐‘ก) in (3.5) are used to derive the expression of๐œ†โ€ฒin (3.6) for learning the confidence coefficients online:

๐œ‚(b๐‘“;๐‘ , ๐‘ก) โ‰ˆ

โˆž

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒb๐‘“๐œ =โˆ’

๐ปโˆ’1๐ตโŠค โ€ 

(b๐‘ข๐‘ +๐พ ๐‘ฅ๐‘ ), (3.8)

๐œ‚(๐‘“โˆ—;๐‘ , ๐‘ก) =

๐‘ก

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒ ๐‘“โˆ—

๐œ โ‰ˆ

๐‘ก

โˆ‘๏ธ

๐œ=๐‘ 

๐นโŠค๐œโˆ’๐‘ 

๐‘ƒ(๐‘ฅ๐œ+1โˆ’๐ด๐‘ฅ๐œ โˆ’๐ต๐‘ข๐œ). (3.9) 3.C Useful Lemmas

The following lemma generalizes the results in [41].

Lemma 10(Generalized cost characterization lemma). Consider a linear quadratic control problem below where๐‘„ , ๐‘… โ‰ป 0and the pair of matrices(๐ด, ๐ต)is stabilizable:

min

(๐‘ข๐‘ก:๐‘กโ‰ฅ0)

โˆž

โˆ‘๏ธ

๐‘ก=0

(๐‘ฅโŠค

๐‘ก ๐‘„ ๐‘ฅ๐‘ก+๐‘ขโŠค

๐‘ก ๐‘…๐‘ข๐‘ก), subject to๐‘ฅ๐‘ก+1= ๐ด๐‘ฅ๐‘ก +๐ต๐‘ข๐‘ก+๐‘ฃ๐‘ก for any๐‘ก โ‰ฅ 0. If at each time๐‘ก โ‰ฅ 0,๐‘ข๐‘ก =โˆ’๐พ ๐‘ฅ๐‘ก โˆ’๐ปโˆ’1๐ตโŠค๐‘Š๐‘ก+๐œ‚๐‘ก where๐œ‚๐‘ก โˆˆ R๐‘š, then the induced cost is

๐‘ฅโŠค

0๐‘ƒ๐‘ฅ0+2๐‘ฅโŠค

0๐นโŠค๐‘‰0+

โˆž

โˆ‘๏ธ

๐‘ก=0

๐œ‚โŠค

๐‘ก ๐ป ๐œ‚๐‘ก+

โˆž

โˆ‘๏ธ

๐‘ก=0

๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+2๐‘ฃโŠค

๐‘ก ๐นโŠค๐‘‰๐‘ก+1 +

โˆž

โˆ‘๏ธ

๐‘ก=0

๐‘ŠโŠค

๐‘ก ๐ต ๐ปโˆ’1๐ตโŠค(๐‘Š๐‘กโˆ’2๐‘‰๐‘ก) +2๐œ‚โŠค

๐‘ก ๐ตโŠค(๐‘‰๐‘กโˆ’๐‘Š๐‘ก)

+๐‘‚(1)

where๐‘ƒis the unique solution of the DARE in(3.4),๐ป B ๐‘…+๐ตโŠค๐‘ƒ ๐ต,๐น B ๐ดโˆ’๐ต(๐‘…+ ๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค๐‘ƒ ๐ด= ๐ดโˆ’๐ต๐พ,๐‘Š๐‘ก B รโˆž

๐œ=๐‘ก(๐นโŠค)๐œ๐‘ƒ๐‘ค๐‘ก+๐œ and๐‘‰๐‘ก Bรโˆž

๐œ=๐‘ก(๐นโŠค)๐œ๐‘ƒ๐‘ฃ๐‘ก+๐œ. Proof. Denote byCOST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค)the terminal cost at time๐‘กgiven a state๐‘ฅ๐‘กwith fixed action perturbations๐œ‚ B (๐œ‚๐‘ก :๐‘ก โ‰ฅ 0)and state perturbations๐‘ค B (๐‘ค๐‘ก :๐‘ก โ‰ฅ 0). We assumeCOST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค) B ๐‘ฅโŠค

๐‘ก ๐‘ƒ๐‘ฅ๐‘ก+๐‘โŠค

๐‘ก ๐‘ฅ๐‘ก +๐‘ž๐‘ก. Similar to the proof of Lemma 13 in [41], using the backward induction, the cost can be rewritten as

COST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค) =COST๐‘ก+1(๐‘ฅ๐‘ก+1;๐œ‚, ๐‘ค) +๐‘ฅโŠค

๐‘ก ๐‘„ ๐‘ฅ๐‘ก+๐‘ขโŠค

๐‘ก ๐‘…๐‘ข๐‘ก

=๐‘ฅโŠค

๐‘ก ๐‘„ ๐‘ฅ๐‘ก+๐‘ขโŠค

๐‘ก ๐‘…๐‘ข๐‘ก+ (๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐‘ก+๐‘ฃ๐‘ก)โŠค๐‘ƒ(๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐‘ก+๐‘ฃ๐‘ก) +๐‘โŠค

๐‘ก+1(๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐‘ก+๐‘ฃ๐‘ก) +๐‘ž๐‘ก+1

=๐‘ฅโŠค

๐‘ก ๐‘„ ๐‘ฅ๐‘ก+ (๐ด๐‘ฅ๐‘ก +๐‘ฃ๐‘ก)โŠค๐‘ƒ(๐ด๐‘ฅ๐‘ก+๐‘ฃ๐‘ก) + (๐ด๐‘ฅ๐‘ก+๐‘ฃ๐‘ก)โŠค๐‘๐‘ก+1+๐‘ž๐‘ก+1 +๐‘ขโŠค

๐‘ก (๐‘…+๐ตโŠค๐‘ƒ ๐ต)๐‘ข๐‘ก

| {z }

(๐‘Ž)

+2๐‘ขโŠค

๐‘ก ๐ตโŠค(๐‘ƒ ๐ด๐‘ฅ๐‘ก+๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1/2)

| {z }

(๐‘)

.

Denote by๐ป B ๐‘…+๐ตโŠค๐‘ƒ ๐ต. Noting that๐‘ข๐‘ก =โˆ’๐พ ๐‘ฅ๐‘ก+๐บ๐‘ก +๐œ‚๐‘ก where we denote ๐บ๐‘ก B ๐ปโˆ’1๐ตโŠค๐‘Š๐‘ก, ๐‘Š๐‘ก B

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ

๐‘ƒ๐‘ค๐‘ก+๐œ and๐‘‰๐‘ก B

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ ๐‘ƒ๐‘ฃ๐‘ก+๐œ, it follows that

(๐‘Ž) =(๐พ ๐‘ฅ๐‘ก+๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ป(๐พ ๐‘ฅ๐‘ก+๐บ๐‘ก โˆ’๐œ‚๐‘ก) โˆ’2(๐พ ๐‘ฅ๐‘ก) +2(๐พ ๐‘ฅ๐‘ก)โŠค๐ป(๐บ๐‘กโˆ’๐œ‚๐‘ก) + (๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค(๐‘…+๐ตโŠค๐‘ƒ ๐ต) (๐บ๐‘ก โˆ’๐œ‚๐‘ก)

(๐‘) =โˆ’2(๐พ ๐‘ฅ๐‘ก)โŠค๐ป(๐พ ๐‘ฅ๐‘ก) โˆ’2๐‘ฅโŠค

๐‘ก ๐พโŠค๐ตโŠค๐‘ƒ๐‘ฃ๐‘ก โˆ’๐‘ฅโŠค

๐‘ก ๐พโŠค๐ตโŠค๐‘๐‘ก+1โˆ’2(๐พ ๐‘ฅ๐‘ก)โŠค๐ป(๐บ๐‘กโˆ’๐œ‚๐‘ก)

โˆ’2(๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ตโŠค(๐‘ƒ๐‘ฃ๐‘ก+๐‘๐‘ก+1/2), implying

COST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค)=๐‘ฅโŠค

๐‘ก (๐‘„+ ๐ดโŠค๐‘ƒ ๐ดโˆ’๐พโŠค๐ป ๐พ)๐‘ฅ๐‘ก+๐‘ฅโŠค

๐‘ก ๐นโŠค(2๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1)

+ (๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ป(๐บ๐‘กโˆ’๐œ‚๐‘ก) โˆ’2(๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ตโŠค(๐‘ƒ๐‘ฃ๐‘ก+๐‘๐‘ก+1/2) +๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+๐‘ฃโŠค

๐‘ก ๐‘๐‘ก+1+๐‘ž๐‘ก+1.

According to the DARE in (3.4), since๐พโŠค๐ป ๐พ = ๐ดโŠค๐‘ƒ ๐ต(๐‘…+๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค๐‘ƒ ๐ด, we get COST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค) =๐‘ฅโŠค

๐‘ก ๐‘ƒ๐‘ฅ๐‘ก+๐‘ฅโŠค

๐‘ก ๐นโŠค(2๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1) + (๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ป(๐บ๐‘กโˆ’๐œ‚๐‘ก)

โˆ’2(๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ตโŠค(๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1/2) +๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก +๐‘ฃโŠค

๐‘ก ๐‘๐‘ก+1+๐‘ž๐‘ก+1, which implies

๐‘๐‘ก =2๐นโŠค(๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1) =2

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ+1

๐‘ƒ๐‘ฃ๐‘ก+๐œ =2๐นโŠค๐‘‰๐‘ก, (3.10) ๐‘ž๐‘ก =๐‘ž๐‘ก+1+๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+2๐‘ฃโŠค

๐‘ก ๐นโŠค๐‘‰๐‘ก+1+ (๐บ๐‘ก โˆ’๐œ‚๐‘ก)โŠค๐ป(๐บ๐‘ก โˆ’๐œ‚๐‘ก)

โˆ’2(๐บ๐‘กโˆ’๐œ‚๐‘ก)โŠค๐ตโŠค(๐‘ƒ๐‘ฃ๐‘ก+ ๐‘๐‘ก+1/2)

=๐‘ž๐‘ก+1+๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+2๐‘ฃโŠค

๐‘ก ๐นโŠค๐‘‰๐‘ก+1+ (๐บ๐‘ก โˆ’๐œ‚๐‘ก)โŠค๐ป(๐บ๐‘ก โˆ’๐œ‚๐‘ก)

โˆ’2๐บโŠค

๐‘ก ๐ตโŠค๐‘‰๐‘ก+2๐œ‚โŠค

๐‘ก ๐ตโŠค๐‘‰๐‘ก

=๐‘ž๐‘ก+1+๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+2๐‘ฃโŠค

๐‘ก ๐นโŠค๐‘‰๐‘ก+1+๐บโŠค

๐‘ก ๐ตโŠค(๐‘Š๐‘กโˆ’2๐‘‰๐‘ก) +2๐œ‚โŠค

๐‘ก ๐ตโŠค(๐‘‰๐‘กโˆ’๐‘Š๐‘ก) +๐œ‚โŠค

๐‘ก ๐ป ๐œ‚๐‘ก. (3.11) Therefore, (3.10) and (3.11) together imply the following general cost characteriza- tion:

COST๐‘ก(๐‘ฅ๐‘ก;๐œ‚, ๐‘ค)=๐‘ฅโŠค

0๐‘ƒ๐‘ฅ0+๐‘ฅโŠค

0๐‘0+๐‘ž0

=๐‘ฅโŠค

0๐‘ƒ๐‘ฅ0+2๐‘ฅโŠค

0๐นโŠค๐‘‰0+

โˆž

โˆ‘๏ธ

๐‘ก=0

๐œ‚โŠค

๐‘ก ๐ป ๐œ‚๐‘ก +

โˆž

โˆ‘๏ธ

๐‘ก=0

๐‘ฃโŠค

๐‘ก ๐‘ƒ๐‘ฃ๐‘ก+2๐‘ฃโŠค

๐‘ก ๐นโŠค๐‘‰๐‘ก+1 +

โˆž

โˆ‘๏ธ

๐‘ก=0

๐บโŠค

๐‘ก ๐ตโŠค(๐‘Š๐‘กโˆ’2๐‘‰๐‘ก) +2๐œ‚โŠค

๐‘ก ๐ตโŠค(๐‘‰๐‘กโˆ’๐‘Š๐‘ก) .

Rearranging the terms above completes the proof. โ–ก To deal with the non-linear dynamics in (3.1), we consider an auxiliary linear system, with a fixed perturbation ๐‘ค๐‘ก = ๐‘“๐‘ก(๐‘ฅโˆ—

๐‘ก, ๐‘ขโˆ—

๐‘ก) for all๐‘ก โ‰ฅ 0 where each๐‘ฅโˆ—

๐‘ก denotes an optimal state and๐‘ขโˆ—

๐‘ก an optimal action, generated by an optimal policy๐œ‹โˆ—. We define a sequence of linear policies ๐œ‹โ€ฒ

๐‘ก :๐‘ก โ‰ฅ 0

where๐œ‹โ€ฒ

๐‘ก :R๐‘› โ†’R๐‘š generates an action ๐‘ขโ€ฒ

๐‘ก =๐œ‹โ€ฒ

๐‘ก(๐‘ฅ๐‘ก) B โˆ’(๐‘…+๐ตโŠค๐‘ƒ ๐ต)โˆ’1๐ตโŠค ๐‘ƒ ๐ด๐‘ฅ๐‘ก+

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œโˆ’๐‘ก

๐‘ƒ ๐‘“๐œ(๐‘ฅโˆ—

๐œ, ๐‘ขโˆ—

๐œ)

!

, (3.12) which is an optimal policy for the auxiliary linear system. Utilizing Lemma 10, the gap between the optimal cost and algorithm cost for the system in (3.3) can be characterized below.

Lemma 11. For any๐œ‚๐‘ก โˆˆ R๐‘š, if at each time๐‘ก โ‰ฅ 0, a policy ๐œ‹ : R๐‘› โ†’ R๐‘š takes an action๐‘ข๐‘ก = ๐œ‹(๐‘ฅ๐‘ก) = ๐œ‹โ€ฒ

๐‘ก(๐‘ฅ๐‘ก) +๐œ‚๐‘ก, then the gap between the optimal costOPT of the non-linear system(3.3)and the algorithm costALGinduced by selecting control actions(๐‘ข๐‘ก :๐‘ก โ‰ฅ 0)equals to

ALGโˆ’OPT โ‰ค

โˆž

โˆ‘๏ธ

๐‘ก=0

๐œ‚โŠค

๐‘ก ๐ป ๐œ‚๐‘ก+๐‘‚(1) +2

โˆž

โˆ‘๏ธ

๐‘ก=0

๐œ‚โŠค

๐‘ก ๐ตโŠค

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ

๐‘ƒ(๐‘“๐‘ก+๐œ โˆ’ ๐‘“๐‘ก+๐œโˆ— )

!

+

โˆž

โˆ‘๏ธ

๐‘ก=0

๐‘“โŠค

๐‘ก ๐‘ƒ ๐‘“๐‘กโˆ’ ๐‘“โˆ—

๐‘ก

โŠค

๐‘ƒ ๐‘“โˆ—

๐‘ก

+2๐‘ฅโŠค

0

โˆž

โˆ‘๏ธ

๐‘ก=0

๐นโŠค๐‘ก+1

๐‘ƒ ๐‘“๐‘กโˆ’ ๐‘“โˆ—

๐‘ก

!

+2

โˆž

โˆ‘๏ธ

๐‘ก=0

๐‘“โŠค

๐‘ก

โˆž

โˆ‘๏ธ

๐œ=0

๐นโŠค๐œ+1

๐‘ƒ ๐‘“๐‘ก+๐œ+1โˆ’ ๐‘“โˆ—

๐‘ก

โŠค

โˆž

โˆ‘๏ธ

๐œ=0

๐นโŠค๐œ+1

๐‘ƒ ๐‘“โˆ—

๐‘ก+๐œ+1

!

+2

โˆž

โˆ‘๏ธ

๐‘ก=0

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ ๐‘ƒ ๐‘“โˆ—

๐‘ก+๐œ

!

๐ต ๐ปโˆ’1๐ตโŠค

โˆž

โˆ‘๏ธ

๐œ=๐‘ก

๐นโŠค๐œ ๐‘ƒ(๐‘“โˆ—

๐‘ก+๐œ โˆ’ ๐‘“๐‘ก+๐œ)

!

(3.13) where๐ป B ๐‘…+๐ตโŠค๐‘ƒ ๐ตand ๐น B ๐ดโˆ’ ๐ต๐พ. For any๐‘ก โ‰ฅ 0, we write ๐‘“๐‘ก B ๐‘“๐‘ก(๐‘ฅ๐‘ก, ๐‘ข๐‘ก) and ๐‘“โˆ—

๐‘ก B ๐‘“๐‘ก(๐‘ฅโˆ—

๐‘ก, ๐‘ขโˆ—

๐‘ก) where(๐‘ฅ๐‘ก :๐‘ก โ‰ฅ 0)denotes a trajectory of states generated by the policy๐œ‹ with actions (๐‘ขโˆ—

๐‘ก :๐‘ก โ‰ฅ 0)and(๐‘ฅโˆ—

๐‘ก :๐‘ก โ‰ฅ 0)denotes an optimal trajectory of states generated by optimal actions (๐‘ขโˆ—

๐‘ก :๐‘ก โ‰ฅ 0).

Proof. Note that with optimal trajectories of states and actions fixed, the optimal controller ๐œ‹โˆ— induces the same cost OPT for both the non-linear system in (3.3)

and the auxiliary linear system. Moreover, the linear controller defined in (3.12) induces a costOPTโ€ฒthat is smaller thanOPT when running both in the auxiliary linear system since the constructed linear policy is optimal. Therefore, according to Lemma 10,ALGโˆ’OPT โ‰ค ALGโˆ’OPTโ€ฒand applying Lemma 10 with๐‘ฃ๐‘ก = ๐‘“๐‘ก(๐‘ฅ๐‘ก, ๐‘ข๐‘ก) and๐‘ค๐‘ก = ๐‘“๐‘ก(๐‘ฅโˆ—

๐‘ก, ๐‘ขโˆ—

๐‘ก)for all๐‘ก โ‰ฅ 0, (3.13) is obtained.

โ–ก

3.D Proof of Theorem 3.3.1

Proof. Fix an arbitrary controller ๐พ1 with a closed-loop system matrix ๐น1 B ๐ดโˆ’ ๐ต๐พ1 โ‰  0. We first consider the case when ๐น1 is not a diagonal matrix, i.e., ๐ด โˆ’ ๐ต๐พ1 has at least one non-zero off-diagonal entry. Consider the following closed-loop system matrix for the second controller๐พ2:

๐น2B

ยฉ

ยญ

ยญ

ยญ

ยญ

ยญ

ยญ

ยญ

ยญ

ยซ

๐›ฝ ยท ยท ยท

๐›ฝ

0

...

โˆ’1โˆ’๐œ†

๐œ† ๐ฟ ๐›ฝ

๐›ฝ ยช

ยฎ

ยฎ

ยฎ

ยฎ

ยฎ

ยฎ

ยฎ

ยฎ

ยฌ

+๐‘†(๐น1) (3.14)

where 0 < ๐›ฝ < 1 is a value of the diagonal entry; ๐ฟ is the lower triangular part of the closed-loop system matrix ๐น1. The matrix ๐‘†(๐น1) is a singleton matrix that depends on๐น1, whose only non-zero entry๐‘†๐‘–+๐‘˜ ,๐‘– corresponding to the first non-zero off-diagonal entry (๐‘–, ๐‘–+๐‘˜) in the upper triangular part of๐น1 searched according to the order๐‘– =1, . . . , ๐‘›and๐‘˜ =1, . . . , ๐‘›with๐‘–increases first and then ๐‘˜. Such a non-zero entry always exists because in this case๐น1is not a diagonal matrix. If the non-zero off-diagonal entry appears to be in the lower triangular part, we can simply transpose ๐น2so without loss of generality we assume it is in the upper triangular part of ๐น1. Since ๐น1 is a lower triangular matrix, all of its eigenvalues equal to 0< ๐›ฝ < 1, implying that the linear controller๐พ2is stabilizing. Then, based on the construction of๐น1in (3.14), the linearly combined controller๐พ B๐œ†๐พ2+ (1โˆ’๐œ†)๐พ1 has a closed-loop system matrix๐น which is upper-triangular, whose determinant satisfies

det(๐น) =det(๐œ† ๐น2+ (1โˆ’๐œ†)๐น1)

=(โˆ’1)2๐‘–+๐‘˜det(๐นโ€ฒ) (3.15)

=(โˆ’1)2๐‘–+๐‘˜ ร— (โˆ’1)๐‘˜+1(1โˆ’๐œ†)๐œ†๐‘›โˆ’1๐‘†๐‘–+๐‘˜ ,๐‘–(๐น1)๐‘–,๐‘–+๐‘˜๐›ฝ๐‘›โˆ’2 (3.16)

=โˆ’๐‘†๐‘–+๐‘˜ ,๐‘–(1โˆ’๐œ†)๐œ†๐‘›โˆ’1๐›ฝ๐‘›โˆ’2 (3.17)

where the term(โˆ’1)2๐‘–+๐‘˜ in (3.15) comes from the computation of the determinant of ๐นand๐นโ€ฒis a sub-matrix of ๐นby eliminating the๐‘–+๐‘˜-th row and๐‘–-th column. The term(โˆ’1)๐‘˜+1in (3.16) appears because ๐นโ€ฒis a permutation of an upper triangular matrix with๐‘›โˆ’2 diagonal entries being ๐›ฝโ€™s and one remaining entry being(๐น1)๐‘–,๐‘–+๐‘˜ since the entries๐‘†๐‘— , ๐‘—+๐‘˜ are zeros all ๐‘— < ๐‘–; otherwise another non-zero entry๐‘†๐‘–+๐‘˜ ,๐‘– would be chosen according to our search order.

Continuing from (3.17), since ๐‘†๐‘–+๐‘˜ ,๐‘– can be selected arbitrarily, setting ๐‘†๐‘–+๐‘˜ ,๐‘– =

โˆ’2๐‘›๐›ฝ(๐œ† ๐›ฝ)โˆ’๐‘›+1 1โˆ’๐œ† gives

2๐‘› โ‰ค det(๐น) โ‰ค |๐œŒ(๐น) |๐‘›,

implying that the spectral radius ๐œŒ(๐น) โ‰ฅ 2. Therefore๐พ =๐œ†๐พ2+ (1โˆ’๐œ†)๐พ1is an unstable controller. It remains to prove the theorem in the case when๐น1is a diagonal matrix. In the following lemma, we consider the case when๐‘› =2, and extend it to the general case.

Lemma 12. For any๐œ†โˆˆ (0,1) and any diagonal matrix๐น1 โˆˆR2ร—2with a spectral radius ๐œŒ(๐น1) < 1๐น1 โ‰ ๐›พ ๐ผ for any๐›พ โˆˆR, there exists a matrix ๐น2 โˆˆR2ร—2such that ๐œŒ(๐น2) <1and๐œŒ(๐œ† ๐น1+ (1โˆ’๐œ†)๐น2) > 1.

Proof of Lemma 12. Suppose๐‘›=2 and ๐น1 =

"

๐‘Ž 0 0 ๐‘

#

is a diagonal matrix and without loss of generality assume that๐‘Ž > 0 and๐‘Ž > ๐‘. Let

๐น2= 4

๐œ†(1โˆ’๐œ†) (๐‘Žโˆ’๐‘)

"

1 1

โˆ’1 โˆ’1

#

. (3.18)

Notice that๐œŒ(๐น2)=0< 1 satisfies the constraint. Then

๐œŒ(๐œ† ๐น1+ (1โˆ’๐œ†)๐น2) =๐œŒ(๐œ† ๐‘ ๐ผ+๐œ†Diag(๐‘Žโˆ’๐‘,0) + (1โˆ’๐œ†)๐น2)

=๐œ† ๐‘+ ๐œŒ(๐œ†Diag(๐‘Žโˆ’๐‘,0) + (1โˆ’๐œ†)๐น2) (3.19) where we have used the assumption that ๐‘Ž > 0 and ๐‘Ž > ๐‘ to derive the equality in (3.19) and the notion Diag(๐‘Žโˆ’๐‘,0)denotes a diagonal matrix whose diagonal entries are๐‘Žโˆ’๐‘and 0, respectively. The eigenvalues of๐œ†Diag(๐‘Žโˆ’๐‘,0) + (1โˆ’๐œ†)๐น2 are

๐œ†(๐‘Žโˆ’๐‘) ยฑโˆš๏ธƒ

(๐œ†(๐‘Žโˆ’๐‘) +๐œ†(๐‘Žโˆ’๐‘)8 )2โˆ’4(๐œ†(๐‘Žโˆ’๐‘)4 )2

2 =

๐œ†(๐‘Žโˆ’๐‘) ยฑโˆš๏ธ

(๐œ†(๐‘Žโˆ’๐‘))2+16

2 .

Since๐‘Žโˆ’๐‘ > 0, the spectral radius of๐œ† ๐น1+ (1โˆ’๐œ†)๐น2satisfies ๐œŒ(๐œ† ๐น1+ (1โˆ’๐œ†)๐น2) =๐œ† ๐‘+ ๐œ†(๐‘Žโˆ’๐‘) +โˆš๏ธ

(๐œ†(๐‘Žโˆ’๐‘))2+16

2 > โˆ’1+

โˆš 16 2 =1.

โ–ก Applying Lemma 12, when ๐น1is an๐‘›ร—๐‘›matrix with๐‘› >2, we can always create an๐‘›ร—๐‘›matrix๐น2whose first two columns and rows form a sub-matrix that is the same as (3.18) and the remaining entries are zeros. Therefore the spectral radius of the convex combination๐œ† ๐น1+ (1โˆ’๐œ†)๐น2is greater than one. This completes the

proof. โ–ก

3.E Stability Analysis Proof Outline

Figure 3.E.1: Outline of proofs of Theorem 3.4.1 and 3.4.2 with a stability analysis presented in Appendix 3.E and a competitive ratio analysis presented in Appendix 3.F.

Arrows denote implications.

In the sequel, we present the proofs of Theorem 3.4.1 and 3.4.2. An outline of the proof structure is provided in Figure 3.E.1. The proof of our main results contains two partsโ€“thestability analysisandcompetitive ratio analysis. First, we prove that Algorithm 4 guarantees a stabilizing policy, regardless of the prediction error ๐œ€ (Theorem 3.4.1). Second, in our competitive ratio analysis, we provide a competitive ratio bound in terms of ๐œ€ and ๐œ†. We show in Lemma 14 that the competitive ratio bound is bounded if the adaptive policy is exponentially stabilizing and has