Results - LEARNING-BASED PREDICTIVE CONTROL: REGRET ANALYSIS

LEARNING-BASED PREDICTIVE CONTROL: REGRET ANALYSIS

5.5 Results

Causally Invariant Safety Constraints

Motivated by the lower bound in the previous section, we now introduce a particular class of safety constraints where it is possible to have better performance. The class is defined by a form of causal invariance that is intuitive and general. Specifically, we state a condition under which the sets of subsequent feasible actions do not change too much if the measures of the sets are close.

We define the following specific sequences of actions. Let u≤𝑡 = (𝑢₁, . . . , 𝑢_𝑡) be a subsequence of optimal actions that maximizes the volume of the set of feasible actions, defined as u≤𝑡 := arg sup_u∈_U^𝑡 𝜇(S(u)). With a slight abuse of notation, given u≤𝑡, define the length-𝑘 maximizing subsequence of actions as u𝑡+1:𝑡+𝑘 :=arg sup_u∈_U^𝑘 𝜇(S𝑘(u≤𝑡,u)).

Definition 5.5.1((𝑘 , 𝛿, 𝜆)-causal invariance). The safety sets are(𝑘 , 𝛿, 𝜆)-causally invariant if there exist constants𝛿, 𝜆 > 0such that the following holds:

1. For all𝑡 ∈ [𝑇] and sequences of actionsu≤𝑡 andv≤𝑡, 𝑑_H(S𝑘(u≤𝑡),_S_𝑘(v≤𝑡)) ≤𝛿

|𝜇(S(u≤𝑡)) −𝜇(S(v≤𝑡)) | 𝜇(B)

1/( (𝑇−𝑡)𝑚)

(5.10) whereBdenotes the unit ball inR^𝑚^×(^𝑇⁻^𝑡⁾.

2. For all𝑡 ∈ [𝑇]and sequences of actionsu≤𝑡, 𝜇(S(u≤𝑡))

𝜇(S(u≤𝑡)) ≤𝜆

𝜇(S( (u≤𝑡,u𝑡+1:𝑡+𝑘))) 𝜇(S(u≤𝑡+𝑘))

𝑇−𝑡 𝑇−𝑡−𝑘

. (5.11)

Note that the definition of causal invariance is independent of the costs. Condition (1) is an inverse of the isodiametric inequality. It says that if the two subsequences of actionsu≤𝑡andv≤𝑡 do not affect the measure of the space of feasible sequences of actions too much, then the Hausdorff distance between the two setsS𝑘(u≤𝑡)and S𝑘(v≤𝑡) is also small. Condition (2) states that the ratio of the measure of a space of feasible sequences of actions, given any previous actionsu≤𝑡, does not differ too much from the ratio of the measure of another space by fixing the next 𝑘 actions to be those preserving the most future flexibility.

Definition 5.5.1 is general and practically applicable, as the following examples show.

Example4(Inventory constraints). Consider the following set of inventory constraints, which is a common form of constraints in energy storage and demand response problems [118, 127, 142]. The summation of the squares of theℓ₂norms of the actions is bounded from above by𝛾 > 0, representing the limited resources of the system:

Í^𝑇

𝑡=1||𝑢_𝑡||²

2 ≤ 𝛾 where𝑢_𝑡 ∈R^𝑚. These inventory constraints are (𝑘 ,1,1)-causally invariant for all1 ≤ 𝑘 ≤𝑇.

Example5(Tracking constraints). The following form of tracking constraints are common in situations where the system has a target "optimal” resource configuration that varies depending on the environment [116, 137, 143]. The agent seeks to track a fixed sequence of nominal actions y := (𝑦₁, . . . , 𝑦_𝑇). The constraints are Í^𝑇

𝑡=1|𝑢_𝑡 −𝑦_𝑡|^𝑝 ≤ 𝜎with 𝑝 ≥ 2where𝑢_𝑡, 𝑦_𝑡 ∈Rand𝜎 >0represents the system’s adjusting ability. These tracking constraints are(𝑘 ,√²

𝜋

,1)-causally invariant for all 1≤ 𝑘 ≤𝑇.

Additionally, to highlight the structures that arenotcausally invariant, we return to the construction of constraints used in the proof of the lower bound in Theorem 5.5.1.

Example6(Constraints in the lower bound (Theorem 5.5.1)). Consider the following set of constraints defined in the proof of Theorem 5.5.1 for obtaining a lower bound Ω(𝑑(𝑇 −𝑤)) on the dynamic regret: S:={u∈U^𝑇 :𝑢_𝑡 ∈ U𝑡,∀𝑡 ∈ [𝑇]}where

U𝑡 :=











B (𝑎/2), if||𝑢_𝑡₋₁−𝑣_𝑡₋₁||₂ ≤ 𝑎 U\B (𝑎), if||𝑢_𝑡−1−𝑣_𝑡−1||₂> 𝑎

(5.12) for some𝑎 > 0. Suppose the action spaceU ⊆ R^𝑚 is closed and bounded. These constraints are not causally invariant for all𝑘 = Ω(𝑇). The reason why the causal invariance criterion is violated for the safety constraints in(5.12)is that, in order to satisfy Condition (1) in(5.10), it is necessary to have𝛿 = Ω(𝑘) = Ω(𝑇), which is not a constant.

Motivated by the example above, and mimicking the adversarial construction in Theorem 5.5.1, we can derive the following theorem and corollary, whose proofs are in Appendix 5.B.

Theorem 5.5.2 (Lower bound on Regret). If there exist a sequence of actions v≤𝑠 ∈U^𝑠,1≤ 𝑠 ≤𝑇 −𝑘,𝑘 ≥ 𝑤with 𝑘 = Ω(𝑇)and some constant𝛼 >0such that 𝑑_H(S𝑘(u≤𝑠),_S_𝑘(v≤𝑠)) ≥ 𝛼 𝑑 𝑘 for anyu≤𝑠 ∈U^𝑠, then for any sequence of actions u ∈ Sgenerated by a deterministic online policy that can access the safety sets

{U𝑡 : 𝑡 ∈ [𝑇]} and {X𝑡 : 𝑡 ∈ [𝑇]}, Regret(u) = Ω(𝑑(𝑇 −𝑤)) where 𝜆(𝑇) is a parameter that may depend on𝑇;𝑑 is the diameter of the action spaceU;𝑤is the prediction window size and𝑇 is the total number of time slots.

The theorem above states that, for any online policy knowing the safety sets {U𝑡 : 𝑡 ∈ [𝑇]} and {X𝑡 : 𝑡 ∈ [𝑇]} in advance, if there are two sets of length-𝑘 subsequences of actionsS𝑘(u≤𝑡)andS𝑘(v≤𝑡)that are far from each other in terms of the Hausdorff distance, then a sub-linear regret is impossible. This highlights the necessity of the causal invariance condition. We make this explicit by further restricting the power of the online policy and assuming that it can only access the MEF 𝑝_𝑡, . . . , 𝑝_𝑡₊_𝑘₋₁at time𝑡, which yields the following impossibility result as a corollary of Theorem 5.5.2.

Corollary 5.5.2 (Lower bound on Regret). If there exist a sequence of actions v≤𝑡 ∈ U^𝑡,1 ≤ 𝑠 ≤ 𝑇 − 𝑘, 𝑘 ≥ 𝑤 with 𝑘 = Ω(𝑇), a constant𝜉 > 0 and 𝛼 = Ω(𝑇) such that

𝜇(S(u≤𝑠)) >𝜉 𝜇(S(u≤𝑠)), 𝑑_H(S𝑘(u≤𝑠),_S_𝑘(v≤𝑠)) ≥𝛼

|𝜇(S(u≤𝑠)) −𝜉 𝜇(S(v≤𝑠)) | 𝜇(B)

1/( (𝑇−𝑠)𝑚)

for anyu≤𝑠 ∈U^𝑠, then for any sequence of actionsu∈Sgenerated by a determin- istic online policy that can access the MEF 𝑝_𝑡, . . . , 𝑝_𝑡₊_𝑤₋₁ at each time 𝑡 ∈ [𝑇], Regret(u) = Ω(𝑑(𝑇 −𝑤))where𝜆(𝑇) is a parameter that may depend on𝑇;𝑑 is the diameter of the action spaceU;𝑤is the prediction window size and𝑇 is the total number of time slots.

The corollary above indicates that it is necessary to have an upper bound on the Hausdorff distance betweenS𝑘(u≤𝑡) andS𝑘(v≤𝑡) (such as Condition (1) and (2) in the causal invariance criterion) in order to make the dynamic regret sub-linear. In the next section, we show that, however, if the causal invariance criterion holds such that both𝛿and𝜆are constants, then the dynamic regret can be made sub-linear with sufficiently many predictions, even for arbitrary Lipschitz continuous cost functions.

Bounding the Dynamic Regret of PPC

We are now ready to present our main result, which bounds the dynamic regret by a decreasing function of the prediction window size under the assumption that the safety sets are causally invariant.

Theorem 5.5.3 (Upper bound on Regret). Suppose the safety sets are (𝑤 , 𝛿, 𝜆)- causally invariant. The dynamic regret for the sequence of actionsugiven by PPC is bounded from above by

Regret(u) =𝑂 𝑑𝑇

𝛿log𝜆

√ 𝑤

√ 𝛿 𝑤¹^/⁴

! !

where𝑑denotes the diameter of the action spaceU,𝑤is the prediction window size and𝑇 is the total number of time slots.

This theorem implies that, with additional assumptions on the safety constraints, a sub-linear(in𝑇) dynamic regret is achievable, given a sufficiently large prediction window size𝑤 =𝜔(1)(in𝑇). Notably, Theorem 5.5.1 implies that for the worst-case costs and constraints, a deterministic online controller that has full information of the constraints suffers from a linear regret. As a comparison, Theorem 5.5.3 shows that, under additional assumptions, even if only aggregated information is available, PPC achieves a sub-linear regret. This does not contradict to the lower bound on the dynamic regret in Theorem 5.5.1, since if there is no regulation assumptions on the safety sets, in the worst-case, the Hausdorff distance𝑑_H(S𝑤(u≤𝑡),_S_𝑤(v≤𝑡)) =𝑂(𝑑 𝑤).

This implies that a trivial upper bound of𝑂(𝑑𝑇)holds. Additionally, note that there is a trade-off betweenflexibilityandoptimalitywhen selecting the tuning parameter 𝛽 > 0. On the one hand, if the tuning parameter is too small, the algorithm is greedy and may suffer losses in the future; on the other hand, if the tuning parameter is too large, the algorithm is penalized by the feedback and therefore is far from being optimal.

A proof is presented in Appendix 5.B. Briefly, Theorem 5.5.1 is proven by optimizing 𝛽 in an upper bound Regret(u) = 𝑂 𝑇 𝑑 𝛿log𝜆/𝑤+𝑑 𝛿

√

𝑤/𝛽+𝛽/𝑑 𝑤 , which holds for any sequence of actions ugiven by the PCC with any tuning parameter 𝛽 > 0.

APPENDIX

5.A Explanation of Definition 5.5.1 and Two Examples

Figure 5.A.1: Graphical illustration of (5.10) in Definition 5.5.1.

Example 1: Inventory constraints.

Recall that the feasible set of actions for the inventory constraints is S=

(

u∈R^𝑇^×𝑚 :

𝑇

∑︁

𝑡=1

||𝑢_𝑡||²

2≤ 𝛾 )

The sequence of actionsu≤𝑡 maximizing the size of the set of admissible actions is the all-zero vector. Hence,

𝜇(S(u≤𝑡)) 𝜇(S(u≤𝑡))

_𝑇−𝑡¹

= 1− Í^𝑡

𝜏=1||𝑢_𝜏||²

𝛾

!^𝑚₂

𝜇(S( (u≤𝑡,u^𝑡+1:𝑡+𝑘))) 𝜇(S(u<𝑡+𝑘))

_{𝑇−𝑡−𝑘}¹ . Therefore setting𝜆=1, (5.11) is satisfied.

It remains to check the bound on the Hausdorff distance. Figure 5.A.1 shows the idea behind the definition. If the two setsS(v≤𝑡) andS(u≤𝑡) are close to each other, the Hausdorff distance of the projected sub-spacesS𝑘(u≤𝑡) and S𝑘(v≤𝑡) can also be bounded. For inventory constraints, this is indeed the case. For all 1 ≤ 𝑘 ≤ 𝑇, sequences of actionsu≤𝑡andv≤𝑡,

𝑑_H(S𝑘(u≤𝑡),_S_𝑘(v≤𝑡)) =

𝑡

∑︁

𝜏=1

||𝑢_𝜏||²₂−

𝑡

∑︁

𝜏=1

||𝑣_𝜏||²₂

!1/2

(5.13)

and

𝜇(B)⁽^𝑇⁻²^𝑡⁾^𝑚

𝑡

∑︁

𝜏=1

||𝑢_𝜏||²₂−

𝑡

∑︁

𝜏=1

||𝑣_𝜏||²₂

𝜇(S(v≤𝑡))⁽^𝑇⁻²^𝑡⁾^𝑚 − 𝜇(S(u≤𝑡))⁽^𝑇⁻²^𝑡⁾^𝑚

≤ |𝜇(S(v≤𝑡)) −𝜇(S(u≤𝑡)) |^{(𝑇−𝑡)𝑚}² . (5.14) Therefore setting𝛿 =1, (5.13) and (5.14) imply (5.10). We validate that the inventory constraints in Example 4 are(𝑘 ,1,1)-causally invariantfor all 1 ≤ 𝑘 ≤𝑇.

Example 2: Tracking constraints.

Recall that the feasible set of actions for the tracking constraints is (with𝑝 ≥2):

u ∈R^𝑇 : ||u−y||^𝑝𝑝 ≤ 𝜎 .

The sequence of actions maximizing the size of the set of admissible actions is u≤𝑡 = y≤𝑡. Similar to Example 4, according to the formula of the volume of an ℓ_𝑝-ball [144], we have

𝜇(S(u≤𝑡)) 𝜇(S(u≤𝑡))

_𝑇−𝑡¹

= 1− ||u≤𝑡−y≤𝑡||^𝑝_𝑝 𝜎

!1/𝑝

𝜇(S( (u≤𝑡,u𝑡+1:𝑡+𝑘))) 𝜇(S(u<𝑡+𝑘))

_{𝑇−𝑡−𝑘}¹ . Therefore setting𝜆=1, (5.11) is satisfied. Next,we give a bound on the Hausdorff distance. For all 1≤ 𝑘 ≤𝑇, sequences of actionsu≤𝑡andv≤𝑡,

(2Γ(1+1/𝑝))^𝑇⁻^𝑡 Γ( (𝑇−𝑡)/𝑝+1)

𝑇−𝑡¹

𝑑_H(S𝑘(u≤𝑡),_S_𝑘(v≤𝑡))

≤2Γ(1+1/𝑝)

√ 𝜋

𝜋⁽^𝑇⁻^𝑡^)/2 Γ( (𝑇 −𝑡)/2+1)

𝑇−𝑡¹

𝑡

∑︁

𝜏=1

||𝑢_𝜏−𝑦_𝜏||𝑝−

𝑡

∑︁

𝜏=1

||𝑣_𝜏−𝑦_𝜏||𝑝

=2Γ(1+1/𝑝)

√ 𝜋

𝜇(B)^𝑇−𝑡¹

𝑡

∑︁

𝜏=1

||𝑢_𝜏− 𝑦_𝜏||𝑝−

𝑡

∑︁

𝜏=1

||𝑣_𝜏−𝑦_𝜏||𝑝

≤2Γ(1+1/𝑝)

√ 𝜋

|𝜇(S(v≤𝑡)) −𝜇(S(u≤𝑡)) |^𝑇¹⁻^𝑡

whereΓ(·)is Euler’s gamma function. Therefore setting𝛿= ^2Γ(1+1/^√ ^𝑝⁾

𝜋 ≤ ^√²

𝜋 for all 𝑝 ≥2, (5.10) holds. The tracking constraints in Example 5 are (𝑘 ,2/√

𝜋,1)-causally invariantfor all 1 ≤ 𝑘 ≤ 𝑇.

Example 3: Constraints in the Proof of Theorem 5.5.1.

Denote by A₁ := B (𝑎/2) and A₂ := U\B (𝑎/2). The feasible length-𝑘 subsequences of actions are either in the Cartesian product of setsA^𝑘₁ :=A₁× · · · ×A₁

orA₂^𝑘 :=A₂× · · · ×A₂. If the two sequences u<𝑡 andv<𝑡 are in the same set, then 𝛿 = 𝜆 = 1; otherwise, the RHS of (5.11) becomes a constant term. Assuming 𝑡+𝑘 ≤𝑇, the Hausdorff distance betweenS𝑘(u<𝑡) =A₁^𝑘 andS𝑘(v<𝑡) =A₂^𝑘 isΩ(𝑘).

Therefore, a non-scalar parameter𝛿 = Ω(𝑘)is necessary for (5.11) to hold.

5.B Proofs

Proofs of Corollary 5.5.1

Proof of Corollary 5.5.1. The explicit expression in Lemma 17 ensures that whenever 𝑝^∗

𝑡(𝑢_𝑡|u<𝑡) > 0, then there is always a feasible sequence of actions inS(u<𝑡). Now, if the tuning parameter 𝛽 > 0, then the optimization (5.8)-(5.9) guarantees that 𝑝^∗

𝑡(𝑢_𝑡|u<𝑡) > 0 for all 𝑡 ∈ [𝑇]; otherwise, the objective value in (5.8)- (5.9) is unbounded. Corollary 4.4.1 guarantees that for any sequence of actions u = (𝑢₁, . . . , 𝑢_𝑇), if 𝑝^∗

𝑡(𝑢_𝑡|u<𝑡) > 0 for all 𝑡 ∈ [𝑇], thenu ∈ S. Therefore, the sequence of actionsugiven by the PPC is always feasible. □ Proof of Theorem 5.5.1

The actions space U is closed and bounded by Assumption 5. Therefore, there exists a closed ball B (𝑎) of radius 𝑎 > 0 centered atv = (𝑣₁, . . . , 𝑣_𝑇) ∈ U such thatB (𝑎) ⊆ U. Consider the following safety constraints for all𝑡 ∈ [𝑇]\{1}with memory size one:

U𝑡 :=











B (𝑎/2), if||𝑢_𝑡₋₁−𝑣_𝑡₋₁||₂ ≤ 𝑎 U\B (𝑎), if||𝑢_𝑡−1−𝑣_𝑡−1||₂ > 𝑎

(5.15) whereB (𝑎/2)is a closed balls around the same center asB (𝑎)with radius𝑎/2. Let the state safety setX𝑡 =R^𝑛for all𝑡 ∈ [𝑇](i.e., no constraints on states). Whenever the first action𝑢₁at time𝑡 = 1 is taken, the future actions have to stay in the ball B (𝑎/2), or stay outsideB (𝑎). Any deterministic policy at time𝑡 ∈ [𝑇]\{1}has to take either||𝑢_𝑡−𝑣_𝑡||₂ ≤ 𝑎/2 or ||𝑢_𝑡−𝑣_𝑡||₂ > 𝑎.

Consider the following Lipschitz continuous cost functions that can be chosen adversarially:

𝑐_𝑡(𝑢_𝑡) =













0, if𝑡 ≤ 𝑤

||𝑢_𝑡−𝑣_𝑡||₂, if||𝑢_𝑡₋₁−𝑣_𝑡₋₁||₂> 𝑎, 𝑡 > 𝑤 (𝑀 − ||𝑢_𝑡−𝑣_𝑡||₂), if||𝑢_𝑡₋₁−𝑣_𝑡₋₁||₂≤ 𝑎/2, 𝑡 > 𝑤

where 𝑀 :=sup_𝑢∈_U||𝑢||₂. The construction ofc= (𝑐₁, . . . , 𝑐_𝑇)guarantees that the dynamic regret for any sequence of actionsugiven by a deterministic online policy

is bounded from below by

Regret(u) ≥(𝑇 −𝑤)minn 𝑎,

𝑀 − 𝑎

2 o

Take 𝑎 = 𝑀/2 and note that 𝑀 = Ω(𝑑) where 𝑑 := diam(U) := sup{||𝑢−𝑣||₂ : 𝑢, 𝑣 ∈ U} is the diameter of the action spaceU. Therefore the dynamic regret is bounded from below as

Regret(u)= Ω(𝑑(𝑇 −𝑤))

for any sequence of actionsugiven by a deterministic online policy.

Proofs of Theorem 5.5.2 and Corollary 5.5.2

Proof of Theorem 5.5.2. Suppose there exists a sequence of actionsv≤𝑠 ∈ U^𝑠, 1 ≤ 𝑠 ≤ 𝑇− 𝑘, 𝑘 ≥ 𝑤with𝑘 = Ω(𝑇)and some constant𝛼 >0 such that

𝑑_H(S𝑘(u≤𝑠),_S_𝑘(v≤𝑠)) ≥𝛼 𝑑 𝑘 (5.16) for anyu≤𝑠 ∈U^𝑠. For any safety sets satisfying (5.16), whenever the first𝑠actions u≤𝑠are taken, the future actions have to stay inS𝑘(u≤𝑠). The same argument also holds for choosingv≤𝑠. This means that there exist two subsequencesu𝑠+1:𝑠+𝑘 and v𝑠+1:𝑠+𝑘 such that

||u𝑠+𝑤:𝑠+𝑘−v𝑠+𝑤:𝑠+𝑘||₂≥ ||u𝑠+1:𝑠+𝑘 −v𝑠+1:𝑠+𝑘||₂− ||u𝑠+1:𝑠+𝑤−1−v𝑠+1:𝑠+𝑘||₂

≥ ||u𝑠+1:𝑠+𝑘||₂−

𝑠+𝑤−1

∑︁

𝑡=𝑠+1

||𝑢_𝑡−𝑣_𝑡||₂

≥ 𝛼 𝑑 𝑘−𝑑(𝑤−1)

=𝑑(𝛼 𝑘 −𝑤+1).

Then the adversary can construct Lipschitz continuous cost functions similar to the one we used in the proof of Theorem 5.5.1 so that the costs𝑐_𝑠+𝑤, . . . , 𝑐_𝑠+𝑘 satisfy Í𝑠+𝑘

𝑡=𝑠+𝑤𝑐(𝑎_𝑡) = Ω( (𝑘−𝑤+1)𝑑) = Ω(𝑑(𝑇−𝑤))(since𝑘 = Ω(𝑇)) for the case when 𝑎_𝑡 =𝑣_𝑡 for𝑠 < 𝑡 ≤ 𝑠+𝑘 and𝑎_𝑡 =𝑣_𝑡 for𝑡 ≤ 𝑠; but𝑐_𝑠+𝑤 =· · ·=𝑐_𝑠+𝑘 =0 for the case when𝑎_𝑡 =𝑢_𝑡 for𝑠 < 𝑡 ≤ 𝑠+𝑘and𝑎_𝑡 =𝑣_𝑡 for𝑡 ≤ 𝑠and vice versa. This “switching pattern” attack is possible, because the online agent knows nothing about the costs 𝑐_𝑠₊_𝑤, . . . , 𝑐_𝑠₊_𝑘 at time 𝑠 and the adversary can design costs freely as long as they satisfy Assumption 5. The bound on the dynamic regret follows since for the optimal actionsu^∗ = (𝑢₁, . . . , 𝑢_𝑇),Í𝑠+𝑘

𝑡=𝑠+𝑤𝑐(𝑢^∗

𝑡) = 0 but Í𝑠+𝑘

𝑡=𝑠+𝑤𝑐(𝑎_𝑡) = Ω(𝑑(𝑇 −𝑤)) for any actionsagenerated by a deterministic online policy. □

Proof of Corollary 5.5.2. Applying Theorem 5.5.2 and noting that𝛼 = Ω(𝑇) and at each time𝑡 the received MEF functions 𝑝^∗

𝑡, . . . , 𝑝^∗

𝑡+𝑤−1equivalently form a joint density function onu𝑡+1:𝑡+𝑤, which has less information than an online policy that knows the offline safety sets. Sinceumaximizes𝜇(S(·)),

𝜇(S(u≤𝑠))

𝜇(S(v≤𝑠)) ≥ 𝜇(S(u≤𝑠)) 𝜇(S(u≤𝑠)) > 𝜉 , it follows that

|𝜇(S(u≤𝑠)) −𝜉 𝜇(S(v≤𝑠)) | 𝜇(B)

1/( (𝑇−𝑠)𝑚)

= Ω(1).

Hence, the same argument in the proof of Theorem 5.5.2 follows and we obtain the desired regret lower bound.

□

Proof of Theorem 5.5.3

In this appendix, we prove Theorem 5.5.3 in four steps. First, in Lemma 20, we bound the deviation of the density function (feedback) evaluated at the actions selected by the PPC from the largest density function value, given the previous actions selected by the PPC. The deviation decreases with the tuning parameter 𝛽 >0. Next, using the bound obtained from the first step, in Lemma 21 we show that the ratio of the volume of the set of feasible actions, given the previous actions selected by the PPC and the volume of the largest feasible set is bounded from above by an exponential function, that is decreasing with𝛽as well. Next, we bound the partial cost difference, of the cost induced by a subsequence of feasible actions and the offline optimal cost.

Finally, combing the partial costs, we bound the total cost and this leads to an upper bound on the dynamic regret.

Step 1. Bound the feedback deviation

Recall that 𝑡^′ := min{𝑡 +𝑤 −1, 𝑇}. For every 𝑡 ∈ I, let u𝑡:𝑡^′ = 𝑢_𝑡, . . . , 𝑢_𝑡′ be a subsequence of optimal actions that maximizes the penalty term:

u𝑡:𝑡^′ := arg sup

u∈U^{𝑡′−𝑡+1}

𝑝^∗

𝑡:𝑡^′(u|u<𝑡) = arg sup

u∈U^{𝑡′−𝑡+1}

𝜇(S( (u<𝑡,u))).

Before proceeding to the proof of Theorem 5.5.3, we first show the following lemma, which gives a lower bound on the feedback given the sequence of actions selected by the PPC. Note that when𝑡−1 <0, the density functions become unconditional.

Lemma 20. For any𝑡 ∈ I, the sequence of actionsu= (u≤𝑡^′)selected by the PPC satisfies

𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) 𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) ≥ exp(−𝐿_c𝑤 𝑑/𝛽)

where𝐿_cis the Lipschitz constant,𝑑 :=diam(U) =sup{||𝑢−𝑣||₂ :𝑢, 𝑣 ∈U} is the diameter of the action spaceU.

Proof of Lemma 20. First, we note that the actions u𝑡:𝑡^′ chosen by the PPC must satisfy

𝑡^′

∑︁

𝜏=𝑡

𝛽 log𝑝^∗

𝜏(𝑢_𝜏|u^<𝜏) −log𝑝^∗

𝜏(𝑢_𝜏|u^<𝜏)

≤

𝑡^′

∑︁

𝜏=𝑡

(𝑐_𝜏(𝑢_𝜏) −𝑐_𝜏(𝑢_𝜏))

. (5.17) To see this, suppose (5.17) does not hold. Then choosing u𝑡:𝑡^′ gives a smaller objective value in (5.8)-(5.9), which violates the definition of PPC-generated actions u𝑡:𝑡^′. Using the chain rule, (5.17) becomes

log𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) ≤log𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) + 1 𝛽

𝑡^′

∑︁

𝜏=𝑡

(𝑐_𝜏(𝑢_𝜏) −𝑐_𝜏(𝑢_𝜏)) . Therefore, since the cost functions are Lipschitz continuous,

𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) ≤exp 1 𝛽

𝑡^′

∑︁

𝜏=𝑡

(𝑐_𝜏(𝑢_𝜏) −𝑐_𝜏(𝑢_𝜏))

! 𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡)

≤ exp

𝐿_c 𝑤 𝑑

𝛽

𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡).

□ Step 2. Bound the Lebesgue measure deviation

Based on Lemma 20, the following lemma holds. For every 𝑡 ∈ I, let u<𝑡^′ = 𝑢₁, . . . , 𝑢_𝑡′ be a subsequence of optimal actions maximizing the volume of the set of feasible actions:

u<𝑡^′ :=arg sup

u∈U^𝑡^′−1

𝜇(S(u)).

Lemma 21. Suppose the safety sets are(𝑤 , 𝛿, 𝜆)-causally invariant. For any𝑡 ∈ I, the actions selected by the PPC satisfy that

𝜇(S(u≤𝑡^′))

𝜇(S(u<𝑡^′)) ≥ exp(−𝐿_c𝑡 𝑑/𝛽)𝜆^−⌈^𝑡^/^𝑤^⌉

where𝐿_𝑐 > 0is the Lipschitz constant,𝑑:=sup{||𝑢−𝑣||₂ :𝑢, 𝑣 ∈U}is the diameter of the action spaceU.

Proof. For any 𝑡 ∈ I, since the safety sets are (𝑤 , 𝛿, 𝜆)-causally invariant and 𝜇(S(u<𝑡^′)) =𝜇(S(u≤𝑡^′)),

𝜇(S(u≤𝑡^′)) 𝜇(S(u<𝑡^′)) ≥ 1

𝜆

𝜇(S(u<𝑡)) 𝜇(S(u<𝑡))

^𝑇−𝑡𝑇−𝑡^′+1+1

· 𝜇(S(u≤𝑡^′))

𝜇(S( (u<𝑡,u𝑡:𝑡^′))). (5.18) Applying the explicit expression in Lemma 17, Lemma 20 implies that

𝜇(S(u≤𝑡^′)) 𝜇(S( (u<𝑡,u𝑡:𝑡^′))) =

𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) 𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) ≥exp(−𝑤 𝐿_c𝑑/𝛽). Therefore, applying (5.18) recursively leads to that for any𝑡 ∈ I,

𝜇(S(u≤𝑡^′))

𝜇(S(u<𝑡^′)) ≥ exp(−𝐿_c𝑡 𝑑/𝛽)𝜆^{−⌈𝑡/𝑤⌉}.

□

Step 3. Bound the partial cost deviation

Lemma 21 implies that for the offline optimal solutionu^∗_𝑡′:𝑡^′+𝑤 :=(𝑢^∗

𝑡^′, . . . , 𝑢^∗

𝑡^′+𝑤), 𝜇(S(u≤𝑡^′)) ≥exp(−𝐿_c𝑡 𝑑/𝛽)𝜆⁻²^𝑡/𝑤𝜇(S(u<𝑡^′))

≥exp(−𝐿_c𝑡 𝑑/𝛽)𝜆⁻²^𝑡^/^𝑤𝜇 _S u^∗_<𝑡′ , which leads to

𝜇 _S u^∗<𝑡^′ −𝜇(S(u≤𝑡^′)) ≤

1+exp 2𝑡

𝑤

log𝜆− 𝐿_c 𝑡 𝑑

𝛽

𝜇 _S u^∗<𝑡^′

≤

𝐿_c 𝑡 𝑑

𝛽 + 2𝑡

𝑤 log𝜆

𝜇(S(u^∗^<𝑡^′))

≤

𝐿_c 𝑡 𝑑

𝛽 + 2𝑡

𝑤 log𝜆

𝜇(B)

𝑑 2

(𝑇−𝑡^′+1)𝑚

where we have used the inequality 𝑒^𝑥 ≥ 1+𝑥 for all𝑥 ∈ R. The last inequality follows from the isodiametric inequality, withBdenoting the unit ball inR⁽^𝑇⁻^𝑡

′+1)𝑚

. Since the safety sets are(𝑤 , 𝛿, 𝜆)-causally invariant, it follows that for any𝑡 ∈ I,

||u^∗𝑡:𝑡^′ −

bu𝑡:𝑡^′||₂ ≤ 𝑑_H _S_𝑤 u^∗<𝑡

,_S_𝑤(u<𝑡)

≤𝛿 𝑑

𝐿_c 𝑡 𝑑

𝛽 + 2𝑡

𝑤 log𝜆

_{(𝑇−𝑡+}¹₁₎𝑚

for somebu𝑡:𝑡^′ ∈S𝑤(u<𝑡) satisfying𝑝^∗

𝑡:𝑡^′(bu𝑡:𝑡^′|u<𝑡) >0.

Next, we consider the objective for the PPC. For any𝑡 ∈ I, denote by𝐶(u𝑡:𝑡^′) :=

Í𝑡^′

𝜏=𝑡𝑐_𝜏(𝑢_𝜏) the cost for the times slots between𝑡 and𝑡^′:=min{𝑡+𝑤 , 𝑇}. Since the costs are Lipschitz continuous,

𝐶(

bu𝑡:𝑡^′) −𝐶(u^∗𝑡:𝑡^′) ≤

𝑡^′

∑︁

𝜏=𝑡

𝑐_𝜏(𝑢_𝜏) −𝑐_𝜏(𝑢^∗

𝜏)

≤

𝑡^′

∑︁

𝜏=𝑡

𝐿_c||𝑢^∗

𝜏− b𝑢_𝜏||₂

≤√

𝑤 𝐿_c||u^∗_𝑡_:_𝑡′−bu𝑡:𝑡^′||₂

≤𝑑 𝛿 𝐿_c

𝐿_c 𝑡 𝑑

√ 𝑤 𝛽

+ 2𝑡

√ 𝑤

log𝜆

_{(𝑇−𝑡+1)}¹ 𝑚

. (5.19) Sinceu^𝑡:𝑡^′ is a minimizer of (5.8)-(5.9), we obtain

𝐶(u𝑡:𝑡^′) −

𝑡^′

∑︁

𝜏=𝑡

𝛽log𝑝_𝜏(𝑢_𝜏|u<𝜏) ≤𝐶(bu𝑡:𝑡^′) −

𝑡^′

∑︁

𝜏=𝑡

𝛽log𝑝_𝜏(b𝑢_𝜏|u<𝜏), which implies

𝐶(u𝑡:𝑡^′) ≤𝐶(bu𝑡:𝑡^′) +𝛽log 𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) 𝑝^∗

𝑡:𝑡^′(bu𝑡:𝑡^′|u<𝑡). (5.20) Step 4. Bound the dynamic regretRegret(u)

Proof of Theorem 5.5.3. For the total cost, it follows that 𝐶_𝑇(u) =∑︁

𝑡∈I

𝑐_𝑡(u𝑡:𝑡^′) ≤∑︁

𝑡∈I

𝐶(bu𝑡:𝑡^′) +𝛽log 𝑝^∗

𝑡:𝑡^′(u𝑡:𝑡^′|u<𝑡) 𝑝^∗

𝑡:𝑡^′(bu𝑡:𝑡^′|u<𝑡)

≤∑︁

𝑡∈I

𝐶( bu𝑡:𝑡^′)

| {z }

:=(𝑎)

+𝛽

∑︁

𝑡∈I

log𝜇(S(u<𝑡,u𝑡:𝑡^′)) 𝜇 _S(u<𝑡,

bu𝑡:𝑡^′)

| {z }

:=(𝑏)

For (a), plugging in (5.19), the total cost for the sequencebuis bounded from above by

∑︁

𝑡∈I

𝐶(

bu𝑡:𝑡^′) ≤𝐶^∗

𝑇 +∑︁

𝑡∈I

𝐶(

bu𝑡:𝑡^′) −𝐶(u^∗𝑡:𝑡^′)

≤𝐶^∗

𝑇 +

2𝑇/𝑤

∑︁

𝑘=1

𝑑 𝛿 𝐿_c

𝐿_c 𝑘 𝑑

√ 𝑤 𝛽

+ 𝑘

√ 𝑤

log𝜆

₍𝑇−𝑘 𝑤¹+1)𝑚

=𝐶^∗

𝑇 +𝑂

𝑇

√ 𝑤

𝑑 𝛿log𝜆+ 𝑑²𝛿 𝛽

𝑇

√ 𝑤

where𝐶^∗

𝑇 :=Í𝑇 𝑡=1𝑐^∗

𝑡(𝑢^∗

𝑡) is the optimal total cost. Finally, (b) becomes𝑂(𝑇 𝛽/𝑤) since the safety sets are atomic andUis bounded by Assumption 6 and 7. Rearranging the terms,

Regret(u)=𝑂

𝑇 𝑑

𝛿log𝜆

√ 𝑤

+ 𝑑 𝛿

√ 𝑤 𝛽

+ 𝛽 𝑑 𝑤

. Setting𝛽 =𝑑

√

𝛿𝑤^3/4immediately implies Regret(u)=𝑂

𝑇 𝑑

(𝛿log𝜆)/√ 𝑤+

√

𝛿/𝑤^1/4 .

□

Learning, Inference, and Data

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 165-179)