• Tidak ada hasil yang ditemukan

Theoretical guarantees for Safe-OFUL

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 61-64)

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

2.3 Stochastic Linear Bandits with Unknown Safety Constraints

2.3.3 Theoretical guarantees for Safe-OFUL

Before presenting the theoretical guarantees, we place the following technical as- sumption on the safety feedback sets that the optimal safe action belongs to, denoted asΞ“π‘–βˆ—.

Assumption 2.3.5(Star convex optimal constraint sets). Ξ“π‘–βˆ— is star convex around the safe known action π‘₯𝑠

π‘–βˆ— such that the convex combination 𝛼π‘₯βˆ— + (1βˆ’ 𝛼)π‘₯𝑠

𝑖 ∈

Ξ“π‘–βˆ—,βˆ€π›Όβˆˆ [0,1].

Note that since the constraint sets are localized around a particular safe actionπ‘₯𝑠

𝑖, this assumption is reasonable in the safe stochastic linear bandit framework, and weaker than the prior work, e.g., [13], where the entire space of actions is considered to be star convex. In the regret analysis of Safe-OFUL, we follow the standard analysis of UCB and decompose the regret 𝑅𝑇 into two terms: (i)Í𝑑=𝑇

𝑑=1(πœ‡βŠ€π‘₯βˆ— βˆ’ucb(π‘₯𝑑, 𝑖𝑑, 𝑑)) and (ii)Í𝑑=𝑇

𝑑=1(ucb(π‘₯𝑑, 𝑖𝑑, 𝑑) βˆ’πœ‡βŠ€π‘₯𝑑).

In the unconstrained setting, the optimism principle is satisfied by construction, since the optimal action belongs to the decision set 𝐷0, yielding (i) to be non- positive. However, in the safe stochastic linear bandit framework, the optimal safe actionπ‘₯βˆ— may not belong to the constructed safe set 𝐷safe

𝑑 where optimistic action selection happens. Thus, we first show that the new construction of ucbin (2.32) still provides optimistic actions.

Theorem 2.3.6 (Optimism). For all𝑖 ∈ M, setting π‘˜π‘– β‰₯2𝐿 𝑆/(πœβˆ’πœπ‘ 

𝑖) guarantees optimism with high probability:

maxπ‘–βˆˆπ‘€ ,π‘₯βˆˆΞ“Λ†

𝑖 , 𝑑ucb(π‘₯ , 𝑖, 𝑑) β‰₯ πœ‡βŠ€π‘₯βˆ— βˆ€π‘‘ .

The proof is given in Appendix A.2.1. To sketch the proof idea, we consider two cases of whetherπ‘₯βˆ— ∈ 𝐷safe

𝑑 or not. If yes, via standard UCB arguments, we guar- antee that Safe-OFUL selects optimistic actions. If not, we show that the additional exploration bonusπ‘˜π‘–π›½π‘–

𝑑||π‘₯βˆ’π‘₯𝑠

𝑖||π΄π‘–βˆ’1

𝑑 ensures optimistic action selection for the given choice ofπ‘˜π‘–. This shows that adjusting the additional exploration bonus around the known safe actions ensures that the relevant constraint regions are well-explored, i.e.,π‘₯βˆ—eventually belongs to𝐷safe

𝑑 .

The choice ofπ‘˜π‘–highlights the key challenge in our proposed stochastic linear bandit framework. In contrast to prior works, the agent gets feedback from a constraint only if it plays an action within the associated feedback set. Therefore, while aiming to

learn the underlying reward function, Safe-OFUL needs to cautiously choose actions from the constraint sets where it wants to learn the constraints at the cost of not receiving any feedback from the non-active constraints. The newucbterm in (2.32) captures this trade-off and selecting π‘˜π‘– as in Theorem 2.3.6 balances it effectively.

In particular, we see that this exploration bonus is inversely proportional to the gap between the safety threshold and the value of the known safe action. Intuitively, this means that if the known safe action is close to violation, Safe-OFUL needs to explore more/act more optimistically to learn the optimal safe action. We pay an extra price in regret due to this additional effort.

Theorem 2.3.7(Regret Bound). Suppose Assumptions 2.3.1–2.3.3 and 2.3.5 hold.

Then for anyπ›Ώβˆˆ (0,1)andπ‘˜π‘–=2𝐿 𝑆/(πœβˆ’πœπ‘ 

𝑖), with probability at least1βˆ’2𝛿, the regret of Safe-OFUL is𝑅𝑇 ≀ π‘…πœ‡+𝑅𝛾, whereπ‘…πœ‡ =2𝛽𝑇

√︁2𝑇 𝑑log( (1+𝑇 𝐿2/(π‘‘πœ†))/𝛿)and 𝑅𝛾=(π‘˜π‘–

π‘šπ‘Ž π‘₯π›½π‘–π‘šπ‘Ž π‘₯

𝑇 +2)√︁

2|𝑀|𝑇 𝑑log( (1+𝑇 𝐿2/(π‘‘πœ†))/𝛿), for π›½π‘–π‘šπ‘Ž π‘₯

𝑇 = maxπ‘—βˆˆM𝛽

𝑗 𝑇 and π‘˜π‘–

π‘šπ‘Ž π‘₯ =maxπ‘—βˆˆMπ‘˜π‘—.

The proof is given in Appendix A.2.1. In the proof, since (i) is non-positive via Theorem 2.3.6, we study(ii)and decompose it into 2 terms. π‘…πœ‡results from learning the unknown reward parameter and 𝑅𝛾 is due to learning 𝑀 different constraints.

Notice that 𝑅𝛾 scales with the hardest, i.e., the most exploration needed, constraint throughπ›½π‘–π‘šπ‘Ž π‘₯

𝑇 and π‘˜π‘–

π‘šπ‘Ž π‘₯. Moreover, the regret rate of Safe-OFUL matches the prior unconstrained UCB results [3] and single linear constrained UCB results [12, 216], where the additional price of learning under𝑀 distinct constraints with local safety feedback, which generalizes the prior work, appears as√

𝑀. 2.3.4 Safe-LinTS

In many scenarios, solving the bilinear optimization problem of UCB-type algo- rithms, i.e., Line 6 of Algorithm 2, can be computationally challenging. To this end, Thompson Sampling (TS)-based methods are proposed, e.g., LinTS [5, 10].

These approaches sample a model within the constructed confidence set of plausible models and find the optimal action with respect to this sampled model. Therefore, they consider a linear optimization problem for decision-making, which can be solved efficiently. Because of this computational efficiency, simplicity, and possibly better empirical performance, they are adopted in many decision-making scenar- ios [7, 137]. In this section, we propose Safe-LinTS, the safe version of LinTS. The pseudocode of Safe-LinTS is given in Algorithm 3.

Algorithm 3Safe-LinTS

1: Input: πœπ‘ 

𝑖, π‘₯𝑠

𝑖, 𝜏, 𝐿 , 𝑆, 𝑅,P𝑇 𝑆,P𝑐𝑇 𝑆

2: for𝑑 =1,2, . . . , 𝑇 do

3: Compute Λ†πœ‡π‘‘ via (2.27) & ˆ𝛾𝑖,𝑑 via (2.29)

4: Construct 𝛽𝑑 in (2.28) &𝛽𝑖

π‘‘βˆ€π‘– βˆˆπ‘€ in (2.30)

5: Construct𝐷safe

𝑑 according to (2.31)

6: Sampleπœ‚π‘‘ ∼ P𝑇 𝑆 andπœ‚π‘

𝑑 ∼ P𝑐𝑇 𝑆

7: Compute Λœπœ‡π‘‘ =πœ‡Λ†π‘‘+π›½π‘‘π‘‰βˆ’1/2

𝑑 πœ‚π‘‘

8: Compute Λœπœ”π‘–,𝑑 =𝛽𝑖

π‘‘π΄βˆ’1/2

𝑖,𝑑 πœ‚π‘

𝑑,βˆ€π‘– ∈M

9: Findπ‘₯𝑑=argmaxπ‘–βˆˆ

M,π‘₯βˆˆΞ“Λ†π‘– , π‘‘πœ‡ΛœβŠ€

𝑑 π‘₯+πœ”Λœπ‘–,π‘‘βŠ€(π‘₯βˆ’π‘₯𝑠

𝑖)

10: Playπ‘₯𝑑and observe rewardπ‘Ÿπ‘‘

The construction of Safe-LinTS follows similarly to Safe-OFUL regarding the es- timation of the reward parameter and safety parameters and safety construction (Lines 3–5). After this, it draws two random perturbations πœ‚π‘‘ ∈ R𝑑 and πœ‚π‘

𝑑 ∈ R𝑑 from i.i.d. distributions P𝑇 𝑆 and P𝑐𝑇 𝑆 respectively (will be characterized shortly).

Among these perturbations, while Safe-LinTS usesπœ‚π‘‘ in a standard way to sample a reward parameter, it uses πœ‚π‘

𝑑 in a novel way to expand the estimated safe set to satisfy optimistic action selection.

The main novelty in the design of Safe-LinTS lies in this decoupling of the explo- ration for the reward parameter and the safety functions. In particular, the prior work in safe linear bandits [203] relies on using the same Gram matrix to learn both the safety and reward parameters simultaneously. However, learning in the affine setting involves separate Gram matrices, thus, Safe-LinTS explicitly balances the exploration trade-off between learning the unknown reward parameter and the safety parameters, ensuring safety and optimism for the entire horizon.

To this end, P𝑇 𝑆 and P𝑐𝑇 𝑆 are chosen to satisfy certain concentration and anti- concentration properties. In particular, for some 𝛿 ∈ (0,1) and constants 𝑐, 𝑐′, Safe-LinTS selectsP𝑇 𝑆 such thatP( βˆ₯πœ‚π‘‘βˆ₯2β‰€βˆšοΈ

𝑐 𝑑log(𝑐′𝑑/𝛿)) β‰₯1βˆ’π›Ώ

2, andP(π‘’βŠ€πœ‚π‘‘β‰₯ 1) = 𝑝1 > 0, for any 𝑒 ∈ R𝑑 with βˆ₯𝑒βˆ₯ =1. Similarly, P𝑐𝑇 𝑆 is chosen such that P( βˆ₯πœ‚π‘

𝑑βˆ₯2 ≀ 2𝐿 𝑆𝛾

πœβˆ’πœβˆ—π‘ 

√︁

𝑐 𝑑log(𝑐′𝑑/𝛿))) β‰₯ 1βˆ’ 𝛿

2 andP(π‘’βŠ€πœ‚π‘

𝑑 β‰₯ 2𝐿 𝑆𝛾

πœβˆ’πœβˆ—π‘ ) = 𝑝2 > 0, where 𝑆𝛾 > maxπ‘–βˆˆMβˆ₯𝛾𝑖βˆ₯ and πœβˆ—π‘  = maxπ‘–βˆˆMπœπ‘ 

𝑖. These requirements imply that these distributions with high probability should concentrate, yet, still provide a certain amount of exploration (anti-concentration), which is crucial in achieving low regret.

Natural candidates forP𝑇 𝑆 andP𝑐𝑇 𝑆 areN (0, 𝐼)andN (0,2𝐿 𝑆

𝛾

πœβˆ’πœβˆ—π‘ πΌ), respectively.

Safe-LinTS uses πœ‚π‘‘ ∼ P𝑇 𝑆 to sample Λœπœ‡π‘‘ around Λ†πœ‡π‘‘ which provides the balance

between exploration and exploitation while learning the unknown reward parameter, i.e., Λœπœ‡π‘‘ =πœ‡Λ†π‘‘+π›½π‘‘π‘‰βˆ’1/2

𝑑 πœ‚π‘‘. It then usesπœ‚π‘

𝑑 ∼ P𝑐𝑇 𝑆 to sample Λœπœ”π‘–,𝑑 = 𝛽𝑖

π‘‘π΄βˆ’1/2

𝑖,𝑑 πœ‚π‘

𝑑,βˆ€π‘– ∈M, which will be used to provide the exploration needed to expand the estimated safe set to include higher rewarding actions, i.e., optimistic actions.

At the final step, Safe-LinTS picks an action π‘₯𝑑 from 𝐷safe

𝑑 by maximizing Λœπœ‡βŠ€

𝑑 π‘₯ +

˜ πœ”βŠ€

𝑖,𝑑(π‘₯βˆ’π‘₯𝑠

𝑖). Note that this is a linear objective with transparent exploration goals.

In particular, the reward exploration is similar to LinTS in Abeille and Lazaric [5], whereas the second term adds exploration along the safety constraints using the known safe actions. Notice that this approach generalizes the algorithm proposed in [203] whose setting is a special case of the SLB framework considered in this study.

Theorem 2.3.8. Suppose Assumptions 2.3.1–2.3.3 and 2.3.5 hold. Then for anyπ›Ώβˆˆ (0,1), with probability at least1βˆ’π›Ώ, the regret of Safe-LinTS is𝑅𝑇 =O (˜ 𝑑3/2

√︁|𝑀|𝑇). The proof and the exact expressions are given in Appendix A.2.2. In the proof, we first show that Safe-LinTS selects safe (via Lemma 2.3.4) optimistic actions with at least 𝑝1𝑝2 probability by showing thatπœ‚π‘‘ and πœ‚π‘

𝑑 provide sufficient exploration.

Finally, we use the regret decomposition in [5] to give the regret upper bound.

Notably, this result matches the regret upper bound in [203] for their setting. In the exact regret expression, the leading term has 1/(𝑝1𝑝2)i.e., the inverse of optimistic action probability. This relation is similar to that ofπ‘˜π‘–

π‘šπ‘Ž π‘₯ in Safe-OFUL. In particu- lar, similar to Safe-OFUL, for a smaller worst-case safety gap of known safe actions, Safe-LinTS needs to explore more to learn the optimal safe action which results in increased regret through 𝑝2.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 61-64)