• Tidak ada hasil yang ditemukan

LEARNING-BASED PREDICTIVE CONTROL: REGRET ANALYSIS

5.5 Results

Causally Invariant Safety Constraints

Motivated by the lower bound in the previous section, we now introduce a particular class of safety constraints where it is possible to have better performance. The class is defined by a form of causal invariance that is intuitive and general. Specifically, we state a condition under which the sets of subsequent feasible actions do not change too much if the measures of the sets are close.

We define the following specific sequences of actions. Let u≀𝑑 = (𝑒1, . . . , 𝑒𝑑) be a subsequence of optimal actions that maximizes the volume of the set of feasible actions, defined as u≀𝑑 := arg supu∈U𝑑 πœ‡(S(u)). With a slight abuse of notation, given u≀𝑑, define the length-π‘˜ maximizing subsequence of actions as u𝑑+1:𝑑+π‘˜ :=arg supu∈Uπ‘˜ πœ‡(Sπ‘˜(u≀𝑑,u)).

Definition 5.5.1((π‘˜ , 𝛿, πœ†)-causal invariance). The safety sets are(π‘˜ , 𝛿, πœ†)-causally invariant if there exist constants𝛿, πœ† > 0such that the following holds:

1. For all𝑑 ∈ [𝑇] and sequences of actionsu≀𝑑 andv≀𝑑, 𝑑H(Sπ‘˜(u≀𝑑),Sπ‘˜(v≀𝑑)) ≀𝛿

|πœ‡(S(u≀𝑑)) βˆ’πœ‡(S(v≀𝑑)) | πœ‡(B)

1/( (π‘‡βˆ’π‘‘)π‘š)

(5.10) whereBdenotes the unit ball inRπ‘šΓ—(π‘‡βˆ’π‘‘).

2. For all𝑑 ∈ [𝑇]and sequences of actionsu≀𝑑, πœ‡(S(u≀𝑑))

πœ‡(S(u≀𝑑)) β‰€πœ†

πœ‡(S( (u≀𝑑,u𝑑+1:𝑑+π‘˜))) πœ‡(S(u≀𝑑+π‘˜))

π‘‡βˆ’π‘‘ π‘‡βˆ’π‘‘βˆ’π‘˜

. (5.11)

Note that the definition of causal invariance is independent of the costs. Condition (1) is an inverse of the isodiametric inequality. It says that if the two subsequences of actionsu≀𝑑andv≀𝑑 do not affect the measure of the space of feasible sequences of actions too much, then the Hausdorff distance between the two setsSπ‘˜(u≀𝑑)and Sπ‘˜(v≀𝑑) is also small. Condition (2) states that the ratio of the measure of a space of feasible sequences of actions, given any previous actionsu≀𝑑, does not differ too much from the ratio of the measure of another space by fixing the next π‘˜ actions to be those preserving the most future flexibility.

Definition 5.5.1 is general and practically applicable, as the following examples show.

Example4(Inventory constraints). Consider the following set of inventory constraints, which is a common form of constraints in energy storage and demand response problems [118, 127, 142]. The summation of the squares of theβ„“2norms of the actions is bounded from above by𝛾 > 0, representing the limited resources of the system:

Í𝑇

𝑑=1||𝑒𝑑||2

2 ≀ 𝛾 where𝑒𝑑 ∈Rπ‘š. These inventory constraints are (π‘˜ ,1,1)-causally invariant for all1 ≀ π‘˜ ≀𝑇.

Example5(Tracking constraints). The following form of tracking constraints are common in situations where the system has a target "optimal” resource configuration that varies depending on the environment [116, 137, 143]. The agent seeks to track a fixed sequence of nominal actions y := (𝑦1, . . . , 𝑦𝑇). The constraints are Í𝑇

𝑑=1|𝑒𝑑 βˆ’π‘¦π‘‘|𝑝 ≀ 𝜎with 𝑝 β‰₯ 2where𝑒𝑑, 𝑦𝑑 ∈Rand𝜎 >0represents the system’s adjusting ability. These tracking constraints are(π‘˜ ,√2

πœ‹

,1)-causally invariant for all 1≀ π‘˜ ≀𝑇.

Additionally, to highlight the structures that arenotcausally invariant, we return to the construction of constraints used in the proof of the lower bound in Theorem 5.5.1.

Example6(Constraints in the lower bound (Theorem 5.5.1)). Consider the following set of constraints defined in the proof of Theorem 5.5.1 for obtaining a lower bound Ξ©(𝑑(𝑇 βˆ’π‘€)) on the dynamic regret: S:={u∈U𝑇 :𝑒𝑑 ∈ U𝑑,βˆ€π‘‘ ∈ [𝑇]}where

U𝑑 :=





ο£²



ο£³

B (π‘Ž/2), if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2 ≀ π‘Ž U\B (π‘Ž), if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2> π‘Ž

(5.12) for someπ‘Ž > 0. Suppose the action spaceU βŠ† Rπ‘š is closed and bounded. These constraints are not causally invariant for allπ‘˜ = Ξ©(𝑇). The reason why the causal invariance criterion is violated for the safety constraints in(5.12)is that, in order to satisfy Condition (1) in(5.10), it is necessary to have𝛿 = Ξ©(π‘˜) = Ξ©(𝑇), which is not a constant.

Motivated by the example above, and mimicking the adversarial construction in Theorem 5.5.1, we can derive the following theorem and corollary, whose proofs are in Appendix 5.B.

Theorem 5.5.2 (Lower bound on Regret). If there exist a sequence of actions v≀𝑠 ∈U𝑠,1≀ 𝑠 ≀𝑇 βˆ’π‘˜,π‘˜ β‰₯ 𝑀with π‘˜ = Ξ©(𝑇)and some constant𝛼 >0such that 𝑑H(Sπ‘˜(u≀𝑠),Sπ‘˜(v≀𝑠)) β‰₯ 𝛼 𝑑 π‘˜ for anyu≀𝑠 ∈U𝑠, then for any sequence of actions u ∈ Sgenerated by a deterministic online policy that can access the safety sets

{U𝑑 : 𝑑 ∈ [𝑇]} and {X𝑑 : 𝑑 ∈ [𝑇]}, Regret(u) = Ξ©(𝑑(𝑇 βˆ’π‘€)) where πœ†(𝑇) is a parameter that may depend on𝑇;𝑑 is the diameter of the action spaceU;𝑀is the prediction window size and𝑇 is the total number of time slots.

The theorem above states that, for any online policy knowing the safety sets {U𝑑 : 𝑑 ∈ [𝑇]} and {X𝑑 : 𝑑 ∈ [𝑇]} in advance, if there are two sets of length-π‘˜ subsequences of actionsSπ‘˜(u≀𝑑)andSπ‘˜(v≀𝑑)that are far from each other in terms of the Hausdorff distance, then a sub-linear regret is impossible. This highlights the necessity of the causal invariance condition. We make this explicit by further restricting the power of the online policy and assuming that it can only access the MEF 𝑝𝑑, . . . , 𝑝𝑑+π‘˜βˆ’1at time𝑑, which yields the following impossibility result as a corollary of Theorem 5.5.2.

Corollary 5.5.2 (Lower bound on Regret). If there exist a sequence of actions v≀𝑑 ∈ U𝑑,1 ≀ 𝑠 ≀ 𝑇 βˆ’ π‘˜, π‘˜ β‰₯ 𝑀 with π‘˜ = Ξ©(𝑇), a constantπœ‰ > 0 and 𝛼 = Ξ©(𝑇) such that

πœ‡(S(u≀𝑠)) >πœ‰ πœ‡(S(u≀𝑠)), 𝑑H(Sπ‘˜(u≀𝑠),Sπ‘˜(v≀𝑠)) β‰₯𝛼

|πœ‡(S(u≀𝑠)) βˆ’πœ‰ πœ‡(S(v≀𝑠)) | πœ‡(B)

1/( (π‘‡βˆ’π‘ )π‘š)

for anyu≀𝑠 ∈U𝑠, then for any sequence of actionsu∈Sgenerated by a determin- istic online policy that can access the MEF 𝑝𝑑, . . . , 𝑝𝑑+π‘€βˆ’1 at each time 𝑑 ∈ [𝑇], Regret(u) = Ξ©(𝑑(𝑇 βˆ’π‘€))whereπœ†(𝑇) is a parameter that may depend on𝑇;𝑑 is the diameter of the action spaceU;𝑀is the prediction window size and𝑇 is the total number of time slots.

The corollary above indicates that it is necessary to have an upper bound on the Hausdorff distance betweenSπ‘˜(u≀𝑑) andSπ‘˜(v≀𝑑) (such as Condition (1) and (2) in the causal invariance criterion) in order to make the dynamic regret sub-linear. In the next section, we show that, however, if the causal invariance criterion holds such that both𝛿andπœ†are constants, then the dynamic regret can be made sub-linear with sufficiently many predictions, even for arbitrary Lipschitz continuous cost functions.

Bounding the Dynamic Regret of PPC

We are now ready to present our main result, which bounds the dynamic regret by a decreasing function of the prediction window size under the assumption that the safety sets are causally invariant.

Theorem 5.5.3 (Upper bound on Regret). Suppose the safety sets are (𝑀 , 𝛿, πœ†)- causally invariant. The dynamic regret for the sequence of actionsugiven by PPC is bounded from above by

Regret(u) =𝑂 𝑑𝑇

𝛿logπœ†

√ 𝑀

+

√ 𝛿 𝑀1/4

! !

where𝑑denotes the diameter of the action spaceU,𝑀is the prediction window size and𝑇 is the total number of time slots.

This theorem implies that, with additional assumptions on the safety constraints, a sub-linear(in𝑇) dynamic regret is achievable, given a sufficiently large prediction window size𝑀 =πœ”(1)(in𝑇). Notably, Theorem 5.5.1 implies that for the worst-case costs and constraints, a deterministic online controller that has full information of the constraints suffers from a linear regret. As a comparison, Theorem 5.5.3 shows that, under additional assumptions, even if only aggregated information is available, PPC achieves a sub-linear regret. This does not contradict to the lower bound on the dynamic regret in Theorem 5.5.1, since if there is no regulation assumptions on the safety sets, in the worst-case, the Hausdorff distance𝑑H(S𝑀(u≀𝑑),S𝑀(v≀𝑑)) =𝑂(𝑑 𝑀).

This implies that a trivial upper bound of𝑂(𝑑𝑇)holds. Additionally, note that there is a trade-off betweenflexibilityandoptimalitywhen selecting the tuning parameter 𝛽 > 0. On the one hand, if the tuning parameter is too small, the algorithm is greedy and may suffer losses in the future; on the other hand, if the tuning parameter is too large, the algorithm is penalized by the feedback and therefore is far from being optimal.

A proof is presented in Appendix 5.B. Briefly, Theorem 5.5.1 is proven by optimizing 𝛽 in an upper bound Regret(u) = 𝑂 𝑇 𝑑 𝛿logπœ†/𝑀+𝑑 𝛿

√

𝑀/𝛽+𝛽/𝑑 𝑀 , which holds for any sequence of actions ugiven by the PCC with any tuning parameter 𝛽 > 0.

APPENDIX

5.A Explanation of Definition 5.5.1 and Two Examples

Figure 5.A.1: Graphical illustration of (5.10) in Definition 5.5.1.

Example 1: Inventory constraints.

Recall that the feasible set of actions for the inventory constraints is S=

(

u∈Rπ‘‡Γ—π‘š :

𝑇

βˆ‘οΈ

𝑑=1

||𝑒𝑑||2

2≀ 𝛾 )

.

The sequence of actionsu≀𝑑 maximizing the size of the set of admissible actions is the all-zero vector. Hence,

πœ‡(S(u≀𝑑)) πœ‡(S(u≀𝑑))

π‘‡βˆ’π‘‘1

= 1βˆ’ Í𝑑

𝜏=1||π‘’πœ||2

2

𝛾

!π‘š2

=

πœ‡(S( (u≀𝑑,u𝑑+1:𝑑+π‘˜))) πœ‡(S(u<𝑑+π‘˜))

π‘‡βˆ’π‘‘βˆ’π‘˜1 . Therefore settingπœ†=1, (5.11) is satisfied.

It remains to check the bound on the Hausdorff distance. Figure 5.A.1 shows the idea behind the definition. If the two setsS(v≀𝑑) andS(u≀𝑑) are close to each other, the Hausdorff distance of the projected sub-spacesSπ‘˜(u≀𝑑) and Sπ‘˜(v≀𝑑) can also be bounded. For inventory constraints, this is indeed the case. For all 1 ≀ π‘˜ ≀ 𝑇, sequences of actionsu≀𝑑andv≀𝑑,

𝑑H(Sπ‘˜(u≀𝑑),Sπ‘˜(v≀𝑑)) =

𝑑

βˆ‘οΈ

𝜏=1

||π‘’πœ||22βˆ’

𝑑

βˆ‘οΈ

𝜏=1

||π‘£πœ||22

!1/2

(5.13)

and

πœ‡(B)(π‘‡βˆ’2𝑑)π‘š

𝑑

βˆ‘οΈ

𝜏=1

||π‘’πœ||22βˆ’

𝑑

βˆ‘οΈ

𝜏=1

||π‘£πœ||22

=

πœ‡(S(v≀𝑑))(π‘‡βˆ’2𝑑)π‘š βˆ’ πœ‡(S(u≀𝑑))(π‘‡βˆ’2𝑑)π‘š

≀ |πœ‡(S(v≀𝑑)) βˆ’πœ‡(S(u≀𝑑)) |(π‘‡βˆ’π‘‘)π‘š2 . (5.14) Therefore setting𝛿 =1, (5.13) and (5.14) imply (5.10). We validate that the inventory constraints in Example 4 are(π‘˜ ,1,1)-causally invariantfor all 1 ≀ π‘˜ ≀𝑇.

Example 2: Tracking constraints.

Recall that the feasible set of actions for the tracking constraints is (with𝑝 β‰₯2):

S=

u ∈R𝑇 : ||uβˆ’y||𝑝𝑝 ≀ 𝜎 .

The sequence of actions maximizing the size of the set of admissible actions is u≀𝑑 = y≀𝑑. Similar to Example 4, according to the formula of the volume of an ℓ𝑝-ball [144], we have

πœ‡(S(u≀𝑑)) πœ‡(S(u≀𝑑))

π‘‡βˆ’π‘‘1

= 1βˆ’ ||uβ‰€π‘‘βˆ’y≀𝑑||𝑝𝑝 𝜎

!1/𝑝

=

πœ‡(S( (u≀𝑑,u𝑑+1:𝑑+π‘˜))) πœ‡(S(u<𝑑+π‘˜))

π‘‡βˆ’π‘‘βˆ’π‘˜1 . Therefore settingπœ†=1, (5.11) is satisfied. Next,we give a bound on the Hausdorff distance. For all 1≀ π‘˜ ≀𝑇, sequences of actionsu≀𝑑andv≀𝑑,

(2Ξ“(1+1/𝑝))π‘‡βˆ’π‘‘ Ξ“( (π‘‡βˆ’π‘‘)/𝑝+1)

π‘‡βˆ’π‘‘1

𝑑H(Sπ‘˜(u≀𝑑),Sπ‘˜(v≀𝑑))

≀2Ξ“(1+1/𝑝)

√ πœ‹

πœ‹(π‘‡βˆ’π‘‘)/2 Ξ“( (𝑇 βˆ’π‘‘)/2+1)

π‘‡βˆ’π‘‘1

𝑑

βˆ‘οΈ

𝜏=1

||π‘’πœβˆ’π‘¦πœ||π‘βˆ’

𝑑

βˆ‘οΈ

𝜏=1

||π‘£πœβˆ’π‘¦πœ||𝑝

=2Ξ“(1+1/𝑝)

√ πœ‹

πœ‡(B)π‘‡βˆ’π‘‘1

𝑑

βˆ‘οΈ

𝜏=1

||π‘’πœβˆ’ π‘¦πœ||π‘βˆ’

𝑑

βˆ‘οΈ

𝜏=1

||π‘£πœβˆ’π‘¦πœ||𝑝

≀2Ξ“(1+1/𝑝)

√ πœ‹

|πœ‡(S(v≀𝑑)) βˆ’πœ‡(S(u≀𝑑)) |𝑇1βˆ’π‘‘

whereΞ“(Β·)is Euler’s gamma function. Therefore setting𝛿= 2Ξ“(1+1/√ 𝑝)

πœ‹ ≀ √2

πœ‹ for all 𝑝 β‰₯2, (5.10) holds. The tracking constraints in Example 5 are (π‘˜ ,2/√

πœ‹,1)-causally invariantfor all 1 ≀ π‘˜ ≀ 𝑇.

Example 3: Constraints in the Proof of Theorem 5.5.1.

Denote by A1 := B (π‘Ž/2) and A2 := U\B (π‘Ž/2). The feasible length-π‘˜ sub- sequences of actions are either in the Cartesian product of setsAπ‘˜1 :=A1Γ— Β· Β· Β· Γ—A1

orA2π‘˜ :=A2Γ— Β· Β· Β· Γ—A2. If the two sequences u<𝑑 andv<𝑑 are in the same set, then 𝛿 = πœ† = 1; otherwise, the RHS of (5.11) becomes a constant term. Assuming 𝑑+π‘˜ ≀𝑇, the Hausdorff distance betweenSπ‘˜(u<𝑑) =A1π‘˜ andSπ‘˜(v<𝑑) =A2π‘˜ isΞ©(π‘˜).

Therefore, a non-scalar parameter𝛿 = Ξ©(π‘˜)is necessary for (5.11) to hold.

5.B Proofs

Proofs of Corollary 5.5.1

Proof of Corollary 5.5.1. The explicit expression in Lemma 17 ensures that when- ever π‘βˆ—

𝑑(𝑒𝑑|u<𝑑) > 0, then there is always a feasible sequence of actions inS(u<𝑑). Now, if the tuning parameter 𝛽 > 0, then the optimization (5.8)-(5.9) guaran- tees that π‘βˆ—

𝑑(𝑒𝑑|u<𝑑) > 0 for all 𝑑 ∈ [𝑇]; otherwise, the objective value in (5.8)- (5.9) is unbounded. Corollary 4.4.1 guarantees that for any sequence of actions u = (𝑒1, . . . , 𝑒𝑇), if π‘βˆ—

𝑑(𝑒𝑑|u<𝑑) > 0 for all 𝑑 ∈ [𝑇], thenu ∈ S. Therefore, the sequence of actionsugiven by the PPC is always feasible. β–‘ Proof of Theorem 5.5.1

The actions space U is closed and bounded by Assumption 5. Therefore, there exists a closed ball B (π‘Ž) of radius π‘Ž > 0 centered atv = (𝑣1, . . . , 𝑣𝑇) ∈ U such thatB (π‘Ž) βŠ† U. Consider the following safety constraints for all𝑑 ∈ [𝑇]\{1}with memory size one:

U𝑑 :=





ο£²



ο£³

B (π‘Ž/2), if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2 ≀ π‘Ž U\B (π‘Ž), if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2 > π‘Ž

(5.15) whereB (π‘Ž/2)is a closed balls around the same center asB (π‘Ž)with radiusπ‘Ž/2. Let the state safety setX𝑑 =R𝑛for all𝑑 ∈ [𝑇](i.e., no constraints on states). Whenever the first action𝑒1at time𝑑 = 1 is taken, the future actions have to stay in the ball B (π‘Ž/2), or stay outsideB (π‘Ž). Any deterministic policy at time𝑑 ∈ [𝑇]\{1}has to take either||π‘’π‘‘βˆ’π‘£π‘‘||2 ≀ π‘Ž/2 or ||π‘’π‘‘βˆ’π‘£π‘‘||2 > π‘Ž.

Consider the following Lipschitz continuous cost functions that can be chosen adversarially:

𝑐𝑑(𝑒𝑑) =







ο£²







ο£³

0, if𝑑 ≀ 𝑀

||π‘’π‘‘βˆ’π‘£π‘‘||2, if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2> π‘Ž, 𝑑 > 𝑀 (𝑀 βˆ’ ||π‘’π‘‘βˆ’π‘£π‘‘||2), if||π‘’π‘‘βˆ’1βˆ’π‘£π‘‘βˆ’1||2≀ π‘Ž/2, 𝑑 > 𝑀

where 𝑀 :=supπ‘’βˆˆU||𝑒||2. The construction ofc= (𝑐1, . . . , 𝑐𝑇)guarantees that the dynamic regret for any sequence of actionsugiven by a deterministic online policy

is bounded from below by

Regret(u) β‰₯(𝑇 βˆ’π‘€)minn π‘Ž,

𝑀 βˆ’ π‘Ž

2 o

.

Take π‘Ž = 𝑀/2 and note that 𝑀 = Ξ©(𝑑) where 𝑑 := diam(U) := sup{||π‘’βˆ’π‘£||2 : 𝑒, 𝑣 ∈ U} is the diameter of the action spaceU. Therefore the dynamic regret is bounded from below as

Regret(u)= Ξ©(𝑑(𝑇 βˆ’π‘€))

for any sequence of actionsugiven by a deterministic online policy.

Proofs of Theorem 5.5.2 and Corollary 5.5.2

Proof of Theorem 5.5.2. Suppose there exists a sequence of actionsv≀𝑠 ∈ U𝑠, 1 ≀ 𝑠 ≀ π‘‡βˆ’ π‘˜, π‘˜ β‰₯ 𝑀withπ‘˜ = Ξ©(𝑇)and some constant𝛼 >0 such that

𝑑H(Sπ‘˜(u≀𝑠),Sπ‘˜(v≀𝑠)) β‰₯𝛼 𝑑 π‘˜ (5.16) for anyu≀𝑠 ∈U𝑠. For any safety sets satisfying (5.16), whenever the first𝑠actions u≀𝑠are taken, the future actions have to stay inSπ‘˜(u≀𝑠). The same argument also holds for choosingv≀𝑠. This means that there exist two subsequencesu𝑠+1:𝑠+π‘˜ and v𝑠+1:𝑠+π‘˜ such that

||u𝑠+𝑀:𝑠+π‘˜βˆ’v𝑠+𝑀:𝑠+π‘˜||2β‰₯ ||u𝑠+1:𝑠+π‘˜ βˆ’v𝑠+1:𝑠+π‘˜||2βˆ’ ||u𝑠+1:𝑠+π‘€βˆ’1βˆ’v𝑠+1:𝑠+π‘˜||2

β‰₯ ||u𝑠+1:𝑠+π‘˜||2βˆ’

𝑠+π‘€βˆ’1

βˆ‘οΈ

𝑑=𝑠+1

||π‘’π‘‘βˆ’π‘£π‘‘||2

β‰₯ 𝛼 𝑑 π‘˜βˆ’π‘‘(π‘€βˆ’1)

=𝑑(𝛼 π‘˜ βˆ’π‘€+1).

Then the adversary can construct Lipschitz continuous cost functions similar to the one we used in the proof of Theorem 5.5.1 so that the costs𝑐𝑠+𝑀, . . . , 𝑐𝑠+π‘˜ satisfy Í𝑠+π‘˜

𝑑=𝑠+𝑀𝑐(π‘Žπ‘‘) = Ξ©( (π‘˜βˆ’π‘€+1)𝑑) = Ξ©(𝑑(π‘‡βˆ’π‘€))(sinceπ‘˜ = Ξ©(𝑇)) for the case when π‘Žπ‘‘ =𝑣𝑑 for𝑠 < 𝑑 ≀ 𝑠+π‘˜ andπ‘Žπ‘‘ =𝑣𝑑 for𝑑 ≀ 𝑠; but𝑐𝑠+𝑀 =Β· Β· Β·=𝑐𝑠+π‘˜ =0 for the case whenπ‘Žπ‘‘ =𝑒𝑑 for𝑠 < 𝑑 ≀ 𝑠+π‘˜andπ‘Žπ‘‘ =𝑣𝑑 for𝑑 ≀ 𝑠and vice versa. This β€œswitching pattern” attack is possible, because the online agent knows nothing about the costs 𝑐𝑠+𝑀, . . . , 𝑐𝑠+π‘˜ at time 𝑠 and the adversary can design costs freely as long as they satisfy Assumption 5. The bound on the dynamic regret follows since for the optimal actionsuβˆ— = (𝑒1, . . . , 𝑒𝑇),Í𝑠+π‘˜

𝑑=𝑠+𝑀𝑐(π‘’βˆ—

𝑑) = 0 but Í𝑠+π‘˜

𝑑=𝑠+𝑀𝑐(π‘Žπ‘‘) = Ξ©(𝑑(𝑇 βˆ’π‘€)) for any actionsagenerated by a deterministic online policy. β–‘

Proof of Corollary 5.5.2. Applying Theorem 5.5.2 and noting that𝛼 = Ξ©(𝑇) and at each time𝑑 the received MEF functions π‘βˆ—

𝑑, . . . , π‘βˆ—

𝑑+π‘€βˆ’1equivalently form a joint density function onu𝑑+1:𝑑+𝑀, which has less information than an online policy that knows the offline safety sets. Sinceumaximizesπœ‡(S(Β·)),

πœ‡(S(u≀𝑠))

πœ‡(S(v≀𝑠)) β‰₯ πœ‡(S(u≀𝑠)) πœ‡(S(u≀𝑠)) > πœ‰ , it follows that

|πœ‡(S(u≀𝑠)) βˆ’πœ‰ πœ‡(S(v≀𝑠)) | πœ‡(B)

1/( (π‘‡βˆ’π‘ )π‘š)

= Ξ©(1).

Hence, the same argument in the proof of Theorem 5.5.2 follows and we obtain the desired regret lower bound.

β–‘

Proof of Theorem 5.5.3

In this appendix, we prove Theorem 5.5.3 in four steps. First, in Lemma 20, we bound the deviation of the density function (feedback) evaluated at the actions selected by the PPC from the largest density function value, given the previous actions selected by the PPC. The deviation decreases with the tuning parameter 𝛽 >0. Next, using the bound obtained from the first step, in Lemma 21 we show that the ratio of the volume of the set of feasible actions, given the previous actions selected by the PPC and the volume of the largest feasible set is bounded from above by an exponential function, that is decreasing with𝛽as well. Next, we bound the partial cost difference, of the cost induced by a subsequence of feasible actions and the offline optimal cost.

Finally, combing the partial costs, we bound the total cost and this leads to an upper bound on the dynamic regret.

Step 1. Bound the feedback deviation

Recall that 𝑑′ := min{𝑑 +𝑀 βˆ’1, 𝑇}. For every 𝑑 ∈ I, let u𝑑:𝑑′ = 𝑒𝑑, . . . , 𝑒𝑑′ be a subsequence of optimal actions that maximizes the penalty term:

u𝑑:𝑑′ := arg sup

u∈Uπ‘‘β€²βˆ’π‘‘+1

π‘βˆ—

𝑑:𝑑′(u|u<𝑑) = arg sup

u∈Uπ‘‘β€²βˆ’π‘‘+1

πœ‡(S( (u<𝑑,u))).

Before proceeding to the proof of Theorem 5.5.3, we first show the following lemma, which gives a lower bound on the feedback given the sequence of actions selected by the PPC. Note that whenπ‘‘βˆ’1 <0, the density functions become unconditional.

Lemma 20. For any𝑑 ∈ I, the sequence of actionsu= (u≀𝑑′)selected by the PPC satisfies

π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) β‰₯ exp(βˆ’πΏc𝑀 𝑑/𝛽)

where𝐿cis the Lipschitz constant,𝑑 :=diam(U) =sup{||π‘’βˆ’π‘£||2 :𝑒, 𝑣 ∈U} is the diameter of the action spaceU.

Proof of Lemma 20. First, we note that the actions u𝑑:𝑑′ chosen by the PPC must satisfy

𝑑′

βˆ‘οΈ

𝜏=𝑑

𝛽 logπ‘βˆ—

𝜏(π‘’πœ|u<𝜏) βˆ’logπ‘βˆ—

𝜏(π‘’πœ|u<𝜏)

≀

𝑑′

βˆ‘οΈ

𝜏=𝑑

(π‘πœ(π‘’πœ) βˆ’π‘πœ(π‘’πœ))

. (5.17) To see this, suppose (5.17) does not hold. Then choosing u𝑑:𝑑′ gives a smaller objective value in (5.8)-(5.9), which violates the definition of PPC-generated actions u𝑑:𝑑′. Using the chain rule, (5.17) becomes

logπ‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) ≀logπ‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) + 1 𝛽

𝑑′

βˆ‘οΈ

𝜏=𝑑

(π‘πœ(π‘’πœ) βˆ’π‘πœ(π‘’πœ)) . Therefore, since the cost functions are Lipschitz continuous,

π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) ≀exp 1 𝛽

𝑑′

βˆ‘οΈ

𝜏=𝑑

(π‘πœ(π‘’πœ) βˆ’π‘πœ(π‘’πœ))

! π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑)

≀ exp

𝐿c 𝑀 𝑑

𝛽

π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑).

β–‘ Step 2. Bound the Lebesgue measure deviation

Based on Lemma 20, the following lemma holds. For every 𝑑 ∈ I, let u<𝑑′ = 𝑒1, . . . , 𝑒𝑑′ be a subsequence of optimal actions maximizing the volume of the set of feasible actions:

u<𝑑′ :=arg sup

u∈Uπ‘‘β€²βˆ’1

πœ‡(S(u)).

Lemma 21. Suppose the safety sets are(𝑀 , 𝛿, πœ†)-causally invariant. For any𝑑 ∈ I, the actions selected by the PPC satisfy that

πœ‡(S(u≀𝑑′))

πœ‡(S(u<𝑑′)) β‰₯ exp(βˆ’πΏc𝑑 𝑑/𝛽)πœ†βˆ’βŒˆπ‘‘/π‘€βŒ‰

where𝐿𝑐 > 0is the Lipschitz constant,𝑑:=sup{||π‘’βˆ’π‘£||2 :𝑒, 𝑣 ∈U}is the diameter of the action spaceU.

Proof. For any 𝑑 ∈ I, since the safety sets are (𝑀 , 𝛿, πœ†)-causally invariant and πœ‡(S(u<𝑑′)) =πœ‡(S(u≀𝑑′)),

πœ‡(S(u≀𝑑′)) πœ‡(S(u<𝑑′)) β‰₯ 1

πœ†

πœ‡(S(u<𝑑)) πœ‡(S(u<𝑑))

π‘‡βˆ’π‘‘π‘‡βˆ’π‘‘β€²+1+1

Β· πœ‡(S(u≀𝑑′))

πœ‡(S( (u<𝑑,u𝑑:𝑑′))). (5.18) Applying the explicit expression in Lemma 17, Lemma 20 implies that

πœ‡(S(u≀𝑑′)) πœ‡(S( (u<𝑑,u𝑑:𝑑′))) =

π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) β‰₯exp(βˆ’π‘€ 𝐿c𝑑/𝛽). Therefore, applying (5.18) recursively leads to that for any𝑑 ∈ I,

πœ‡(S(u≀𝑑′))

πœ‡(S(u<𝑑′)) β‰₯ exp(βˆ’πΏc𝑑 𝑑/𝛽)πœ†βˆ’βŒˆπ‘‘/π‘€βŒ‰.

β–‘

Step 3. Bound the partial cost deviation

Lemma 21 implies that for the offline optimal solutionuβˆ—π‘‘β€²:𝑑′+𝑀 :=(π‘’βˆ—

𝑑′, . . . , π‘’βˆ—

𝑑′+𝑀), πœ‡(S(u≀𝑑′)) β‰₯exp(βˆ’πΏc𝑑 𝑑/𝛽)πœ†βˆ’2𝑑/π‘€πœ‡(S(u<𝑑′))

β‰₯exp(βˆ’πΏc𝑑 𝑑/𝛽)πœ†βˆ’2𝑑/π‘€πœ‡ S uβˆ—<𝑑′ , which leads to

πœ‡ S uβˆ—<𝑑′ βˆ’πœ‡(S(u≀𝑑′)) ≀

1+exp 2𝑑

𝑀

logπœ†βˆ’ 𝐿c 𝑑 𝑑

𝛽

πœ‡ S uβˆ—<𝑑′

≀

𝐿c 𝑑 𝑑

𝛽 + 2𝑑

𝑀 logπœ†

πœ‡(S(uβˆ—<𝑑′))

≀

𝐿c 𝑑 𝑑

𝛽 + 2𝑑

𝑀 logπœ†

πœ‡(B)

𝑑 2

(π‘‡βˆ’π‘‘β€²+1)π‘š

where we have used the inequality 𝑒π‘₯ β‰₯ 1+π‘₯ for allπ‘₯ ∈ R. The last inequality follows from the isodiametric inequality, withBdenoting the unit ball inR(π‘‡βˆ’π‘‘

β€²+1)π‘š

. Since the safety sets are(𝑀 , 𝛿, πœ†)-causally invariant, it follows that for any𝑑 ∈ I,

||uβˆ—π‘‘:𝑑′ βˆ’

bu𝑑:𝑑′||2 ≀ 𝑑H S𝑀 uβˆ—<𝑑

,S𝑀(u<𝑑)

≀𝛿 𝑑

𝐿c 𝑑 𝑑

𝛽 + 2𝑑

𝑀 logπœ†

(π‘‡βˆ’π‘‘+11)π‘š

for somebu𝑑:𝑑′ ∈S𝑀(u<𝑑) satisfyingπ‘βˆ—

𝑑:𝑑′(bu𝑑:𝑑′|u<𝑑) >0.

Next, we consider the objective for the PPC. For any𝑑 ∈ I, denote by𝐢(u𝑑:𝑑′) :=

Í𝑑′

𝜏=π‘‘π‘πœ(π‘’πœ) the cost for the times slots between𝑑 and𝑑′:=min{𝑑+𝑀 , 𝑇}. Since the costs are Lipschitz continuous,

𝐢(

bu𝑑:𝑑′) βˆ’πΆ(uβˆ—π‘‘:𝑑′) ≀

𝑑′

βˆ‘οΈ

𝜏=𝑑

π‘πœ(π‘’πœ) βˆ’π‘πœ(π‘’βˆ—

𝜏)

≀

𝑑′

βˆ‘οΈ

𝜏=𝑑

𝐿c||π‘’βˆ—

πœβˆ’ bπ‘’πœ||2

β‰€βˆš

𝑀 𝐿c||uβˆ—π‘‘:π‘‘β€²βˆ’bu𝑑:𝑑′||2

≀𝑑 𝛿 𝐿c

𝐿c 𝑑 𝑑

√ 𝑀 𝛽

+ 2𝑑

√ 𝑀

logπœ†

(π‘‡βˆ’π‘‘+1)1 π‘š

. (5.19) Sinceu𝑑:𝑑′ is a minimizer of (5.8)-(5.9), we obtain

𝐢(u𝑑:𝑑′) βˆ’

𝑑′

βˆ‘οΈ

𝜏=𝑑

𝛽logπ‘πœ(π‘’πœ|u<𝜏) ≀𝐢(bu𝑑:𝑑′) βˆ’

𝑑′

βˆ‘οΈ

𝜏=𝑑

𝛽logπ‘πœ(bπ‘’πœ|u<𝜏), which implies

𝐢(u𝑑:𝑑′) ≀𝐢(bu𝑑:𝑑′) +𝛽log π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) π‘βˆ—

𝑑:𝑑′(bu𝑑:𝑑′|u<𝑑). (5.20) Step 4. Bound the dynamic regretRegret(u)

Proof of Theorem 5.5.3. For the total cost, it follows that 𝐢𝑇(u) =βˆ‘οΈ

π‘‘βˆˆI

𝑐𝑑(u𝑑:𝑑′) β‰€βˆ‘οΈ

π‘‘βˆˆI

𝐢(bu𝑑:𝑑′) +𝛽log π‘βˆ—

𝑑:𝑑′(u𝑑:𝑑′|u<𝑑) π‘βˆ—

𝑑:𝑑′(bu𝑑:𝑑′|u<𝑑)

β‰€βˆ‘οΈ

π‘‘βˆˆI

𝐢( bu𝑑:𝑑′)

| {z }

:=(π‘Ž)

+𝛽

βˆ‘οΈ

π‘‘βˆˆI

logπœ‡(S(u<𝑑,u𝑑:𝑑′)) πœ‡ S(u<𝑑,

bu𝑑:𝑑′)

| {z }

:=(𝑏)

.

For (a), plugging in (5.19), the total cost for the sequencebuis bounded from above by

βˆ‘οΈ

π‘‘βˆˆI

𝐢(

bu𝑑:𝑑′) β‰€πΆβˆ—

𝑇 +βˆ‘οΈ

π‘‘βˆˆI

𝐢(

bu𝑑:𝑑′) βˆ’πΆ(uβˆ—π‘‘:𝑑′)

β‰€πΆβˆ—

𝑇 +

2𝑇/𝑀

βˆ‘οΈ

π‘˜=1

𝑑 𝛿 𝐿c

𝐿c π‘˜ 𝑑

√ 𝑀 𝛽

+ π‘˜

√ 𝑀

logπœ†

(π‘‡βˆ’π‘˜ 𝑀1+1)π‘š

=πΆβˆ—

𝑇 +𝑂

𝑇

√ 𝑀

𝑑 𝛿logπœ†+ 𝑑2𝛿 𝛽

𝑇

√ 𝑀

whereπΆβˆ—

𝑇 :=Í𝑇 𝑑=1π‘βˆ—

𝑑(π‘’βˆ—

𝑑) is the optimal total cost. Finally, (b) becomes𝑂(𝑇 𝛽/𝑀) since the safety sets are atomic andUis bounded by Assumption 6 and 7. Rearranging the terms,

Regret(u)=𝑂

𝑇 𝑑

𝛿logπœ†

√ 𝑀

+ 𝑑 𝛿

√ 𝑀 𝛽

+ 𝛽 𝑑 𝑀

. Setting𝛽 =𝑑

√

𝛿𝑀3/4immediately implies Regret(u)=𝑂

𝑇 𝑑

(𝛿logπœ†)/√ 𝑀+

√

𝛿/𝑀1/4 .

β–‘

Learning, Inference, and Data

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 165-179)