STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS
2.3 Stochastic Linear Bandits with Unknown Safety Constraints
2.3.3 Theoretical guarantees for Safe-OFUL
Before presenting the theoretical guarantees, we place the following technical as- sumption on the safety feedback sets that the optimal safe action belongs to, denoted asΞπβ.
Assumption 2.3.5(Star convex optimal constraint sets). Ξπβ is star convex around the safe known action π₯π
πβ such that the convex combination πΌπ₯β + (1β πΌ)π₯π
π β
Ξπβ,βπΌβ [0,1].
Note that since the constraint sets are localized around a particular safe actionπ₯π
π, this assumption is reasonable in the safe stochastic linear bandit framework, and weaker than the prior work, e.g., [13], where the entire space of actions is considered to be star convex. In the regret analysis of Safe-OFUL, we follow the standard analysis of UCB and decompose the regret π π into two terms: (i)Γπ‘=π
π‘=1(πβ€π₯β βucb(π₯π‘, ππ‘, π‘)) and (ii)Γπ‘=π
π‘=1(ucb(π₯π‘, ππ‘, π‘) βπβ€π₯π‘).
In the unconstrained setting, the optimism principle is satisfied by construction, since the optimal action belongs to the decision set π·0, yielding (i) to be non- positive. However, in the safe stochastic linear bandit framework, the optimal safe actionπ₯β may not belong to the constructed safe set π·safe
π‘ where optimistic action selection happens. Thus, we first show that the new construction of ucbin (2.32) still provides optimistic actions.
Theorem 2.3.6 (Optimism). For allπ β M, setting ππ β₯2πΏ π/(πβππ
π) guarantees optimism with high probability:
maxπβπ ,π₯βΞΛ
π , π‘ucb(π₯ , π, π‘) β₯ πβ€π₯β βπ‘ .
The proof is given in Appendix A.2.1. To sketch the proof idea, we consider two cases of whetherπ₯β β π·safe
π‘ or not. If yes, via standard UCB arguments, we guar- antee that Safe-OFUL selects optimistic actions. If not, we show that the additional exploration bonusπππ½π
π‘||π₯βπ₯π
π||π΄πβ1
π‘ ensures optimistic action selection for the given choice ofππ. This shows that adjusting the additional exploration bonus around the known safe actions ensures that the relevant constraint regions are well-explored, i.e.,π₯βeventually belongs toπ·safe
π‘ .
The choice ofππhighlights the key challenge in our proposed stochastic linear bandit framework. In contrast to prior works, the agent gets feedback from a constraint only if it plays an action within the associated feedback set. Therefore, while aiming to
learn the underlying reward function, Safe-OFUL needs to cautiously choose actions from the constraint sets where it wants to learn the constraints at the cost of not receiving any feedback from the non-active constraints. The newucbterm in (2.32) captures this trade-off and selecting ππ as in Theorem 2.3.6 balances it effectively.
In particular, we see that this exploration bonus is inversely proportional to the gap between the safety threshold and the value of the known safe action. Intuitively, this means that if the known safe action is close to violation, Safe-OFUL needs to explore more/act more optimistically to learn the optimal safe action. We pay an extra price in regret due to this additional effort.
Theorem 2.3.7(Regret Bound). Suppose Assumptions 2.3.1β2.3.3 and 2.3.5 hold.
Then for anyπΏβ (0,1)andππ=2πΏ π/(πβππ
π), with probability at least1β2πΏ, the regret of Safe-OFUL isπ π β€ π π+π πΎ, whereπ π =2π½π
βοΈ2π πlog( (1+π πΏ2/(ππ))/πΏ)and π πΎ=(ππ
ππ π₯π½πππ π₯
π +2)βοΈ
2|π|π πlog( (1+π πΏ2/(ππ))/πΏ), for π½πππ π₯
π = maxπβMπ½
π π and ππ
ππ π₯ =maxπβMππ.
The proof is given in Appendix A.2.1. In the proof, since (i) is non-positive via Theorem 2.3.6, we study(ii)and decompose it into 2 terms. π πresults from learning the unknown reward parameter and π πΎ is due to learning π different constraints.
Notice that π πΎ scales with the hardest, i.e., the most exploration needed, constraint throughπ½πππ π₯
π and ππ
ππ π₯. Moreover, the regret rate of Safe-OFUL matches the prior unconstrained UCB results [3] and single linear constrained UCB results [12, 216], where the additional price of learning underπ distinct constraints with local safety feedback, which generalizes the prior work, appears asβ
π. 2.3.4 Safe-LinTS
In many scenarios, solving the bilinear optimization problem of UCB-type algo- rithms, i.e., Line 6 of Algorithm 2, can be computationally challenging. To this end, Thompson Sampling (TS)-based methods are proposed, e.g., LinTS [5, 10].
These approaches sample a model within the constructed confidence set of plausible models and find the optimal action with respect to this sampled model. Therefore, they consider a linear optimization problem for decision-making, which can be solved efficiently. Because of this computational efficiency, simplicity, and possibly better empirical performance, they are adopted in many decision-making scenar- ios [7, 137]. In this section, we propose Safe-LinTS, the safe version of LinTS. The pseudocode of Safe-LinTS is given in Algorithm 3.
Algorithm 3Safe-LinTS
1: Input: ππ
π, π₯π
π, π, πΏ , π, π ,Pπ π,Pππ π
2: forπ‘ =1,2, . . . , π do
3: Compute Λππ‘ via (2.27) & ΛπΎπ,π‘ via (2.29)
4: Construct π½π‘ in (2.28) &π½π
π‘βπ βπ in (2.30)
5: Constructπ·safe
π‘ according to (2.31)
6: Sampleππ‘ βΌ Pπ π andππ
π‘ βΌ Pππ π
7: Compute Λππ‘ =πΛπ‘+π½π‘πβ1/2
π‘ ππ‘
8: Compute Λππ,π‘ =π½π
π‘π΄β1/2
π,π‘ ππ
π‘,βπ βM
9: Findπ₯π‘=argmaxπβ
M,π₯βΞΛπ , π‘πΛβ€
π‘ π₯+πΛπ,π‘β€(π₯βπ₯π
π)
10: Playπ₯π‘and observe rewardππ‘
The construction of Safe-LinTS follows similarly to Safe-OFUL regarding the es- timation of the reward parameter and safety parameters and safety construction (Lines 3β5). After this, it draws two random perturbations ππ‘ β Rπ and ππ
π‘ β Rπ from i.i.d. distributions Pπ π and Pππ π respectively (will be characterized shortly).
Among these perturbations, while Safe-LinTS usesππ‘ in a standard way to sample a reward parameter, it uses ππ
π‘ in a novel way to expand the estimated safe set to satisfy optimistic action selection.
The main novelty in the design of Safe-LinTS lies in this decoupling of the explo- ration for the reward parameter and the safety functions. In particular, the prior work in safe linear bandits [203] relies on using the same Gram matrix to learn both the safety and reward parameters simultaneously. However, learning in the affine setting involves separate Gram matrices, thus, Safe-LinTS explicitly balances the exploration trade-off between learning the unknown reward parameter and the safety parameters, ensuring safety and optimism for the entire horizon.
To this end, Pπ π and Pππ π are chosen to satisfy certain concentration and anti- concentration properties. In particular, for some πΏ β (0,1) and constants π, πβ², Safe-LinTS selectsPπ π such thatP( β₯ππ‘β₯2β€βοΈ
π πlog(πβ²π/πΏ)) β₯1βπΏ
2, andP(π’β€ππ‘β₯ 1) = π1 > 0, for any π’ β Rπ with β₯π’β₯ =1. Similarly, Pππ π is chosen such that P( β₯ππ
π‘β₯2 β€ 2πΏ ππΎ
πβπβπ
βοΈ
π πlog(πβ²π/πΏ))) β₯ 1β πΏ
2 andP(π’β€ππ
π‘ β₯ 2πΏ ππΎ
πβπβπ ) = π2 > 0, where ππΎ > maxπβMβ₯πΎπβ₯ and πβπ = maxπβMππ
π. These requirements imply that these distributions with high probability should concentrate, yet, still provide a certain amount of exploration (anti-concentration), which is crucial in achieving low regret.
Natural candidates forPπ π andPππ π areN (0, πΌ)andN (0,2πΏ π
πΎ
πβπβπ πΌ), respectively.
Safe-LinTS uses ππ‘ βΌ Pπ π to sample Λππ‘ around Λππ‘ which provides the balance
between exploration and exploitation while learning the unknown reward parameter, i.e., Λππ‘ =πΛπ‘+π½π‘πβ1/2
π‘ ππ‘. It then usesππ
π‘ βΌ Pππ π to sample Λππ,π‘ = π½π
π‘π΄β1/2
π,π‘ ππ
π‘,βπ βM, which will be used to provide the exploration needed to expand the estimated safe set to include higher rewarding actions, i.e., optimistic actions.
At the final step, Safe-LinTS picks an action π₯π‘ from π·safe
π‘ by maximizing Λπβ€
π‘ π₯ +
Λ πβ€
π,π‘(π₯βπ₯π
π). Note that this is a linear objective with transparent exploration goals.
In particular, the reward exploration is similar to LinTS in Abeille and Lazaric [5], whereas the second term adds exploration along the safety constraints using the known safe actions. Notice that this approach generalizes the algorithm proposed in [203] whose setting is a special case of the SLB framework considered in this study.
Theorem 2.3.8. Suppose Assumptions 2.3.1β2.3.3 and 2.3.5 hold. Then for anyπΏβ (0,1), with probability at least1βπΏ, the regret of Safe-LinTS isπ π =O (Λ π3/2
βοΈ|π|π). The proof and the exact expressions are given in Appendix A.2.2. In the proof, we first show that Safe-LinTS selects safe (via Lemma 2.3.4) optimistic actions with at least π1π2 probability by showing thatππ‘ and ππ
π‘ provide sufficient exploration.
Finally, we use the regret decomposition in [5] to give the regret upper bound.
Notably, this result matches the regret upper bound in [203] for their setting. In the exact regret expression, the leading term has 1/(π1π2)i.e., the inverse of optimistic action probability. This relation is similar to that ofππ
ππ π₯ in Safe-OFUL. In particu- lar, similar to Safe-OFUL, for a smaller worst-case safety gap of known safe actions, Safe-LinTS needs to explore more to learn the optimal safe action which results in increased regret through π2.