STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS
2.3 Stochastic Linear Bandits with Unknown Safety Constraints
2.3.5 Linear Bandits with Nonlinear Constraints
between exploration and exploitation while learning the unknown reward parameter, i.e., Λππ‘ =πΛπ‘+π½π‘πβ1/2
π‘ ππ‘. It then usesππ
π‘ βΌ Pππ π to sample Λππ,π‘ = π½π
π‘π΄β1/2
π,π‘ ππ
π‘,βπ βM, which will be used to provide the exploration needed to expand the estimated safe set to include higher rewarding actions, i.e., optimistic actions.
At the final step, Safe-LinTS picks an action π₯π‘ from π·safe
π‘ by maximizing Λπβ€
π‘ π₯ +
Λ πβ€
π,π‘(π₯βπ₯π
π). Note that this is a linear objective with transparent exploration goals.
In particular, the reward exploration is similar to LinTS in Abeille and Lazaric [5], whereas the second term adds exploration along the safety constraints using the known safe actions. Notice that this approach generalizes the algorithm proposed in [203] whose setting is a special case of the SLB framework considered in this study.
Theorem 2.3.8. Suppose Assumptions 2.3.1β2.3.3 and 2.3.5 hold. Then for anyπΏβ (0,1), with probability at least1βπΏ, the regret of Safe-LinTS isπ π =O (Λ π3/2
βοΈ|π|π). The proof and the exact expressions are given in Appendix A.2.2. In the proof, we first show that Safe-LinTS selects safe (via Lemma 2.3.4) optimistic actions with at least π1π2 probability by showing thatππ‘ and ππ
π‘ provide sufficient exploration.
Finally, we use the regret decomposition in [5] to give the regret upper bound.
Notably, this result matches the regret upper bound in [203] for their setting. In the exact regret expression, the leading term has 1/(π1π2)i.e., the inverse of optimistic action probability. This relation is similar to that ofππ
ππ π₯ in Safe-OFUL. In particu- lar, similar to Safe-OFUL, for a smaller worst-case safety gap of known safe actions, Safe-LinTS needs to explore more to learn the optimal safe action which results in increased regret through π2.
constraint,π₯π
π β π·safe
0 ,βπ βM, as well as their safety values ππ(π₯π
π) =ππ
π < π. Finally, we adopt the following simple regularity assumption on the nonlinear constraints.
Assumption 2.3.9(Smooth & Lipschitz Safety Constraints). ππ(π₯) isπ-smooth and π-Lipschitz,βπ₯βΞπ,βπβM.
The local smoothness assumption is a significantly weak assumption [24], while the local Lipschitzness is the nonlinear analog to Assumption 2.3.2 with affine con- straints. The setting characterized above subsumes and generalizes the affine case in Section 2.3.1. Using the first-order Taylor expansion about the known safe actions, we obtain ππ(π₯) = ππ(π₯π
π)+βππ(π₯π
π)β€(π₯βπ₯π
π)+ππ(π₯), whereππ(π₯)represents the remain- der terms. Notice that for small enough πΏπ, this expansion behaves very similarly to affine functions studied in previous sections, which motivates the following our algorithm design. To avoid any further structural assumptions and keep the setting as general as possible, while keeping the problem tractable, we assume some safety gap for the optimal safe action to account for the function approximation errors.
Assumption 2.3.10(Safety gap for optimal action). The optimal safe actionπ₯βhas at least Ξ safety gap from constraint boundary, i.e., ππβ(π₯β) β€ π β Ξ, such that Ξ> π πΏ2
π.
This is a mild assumption since for a nonlinear function the optimal action need not be at the boundary, unlike linear constraints. Moreover, this assumption holds in many safe decision-making tasks, where the optimal safe action might be a sig- nificantly safe one, yet, to learn this action one might need to consider a higher threshold in the learning process.
Safe-OFUL/LinTS with Pure Exploration
We propose an extension of our prior algorithms to achieve safe and effective decision-making for the SLB with multiple nonlinear safety constraints. Due to Assumption 2.3.9, we know that there exists a safe ball of actions around eachπ₯π
π
,βπ β M, i.e., ππ(π₯π‘) β€ π ifπ₯π‘ β {π₯ β Ξπ : ||π₯βπ₯π
π||2 < πΏπ} forπΏπβ€ (πβππ
π)/(π+π πΏπ).
The existence of this ball helps the agent to estimate the gradient of the nonlinear function around the known safe actionsπ₯π
π. The main idea in our algorithm design is to learn the first-order function approximation in eachΞπwhile taking into account the estimation error so that the agent can eventually get to the optimal action π₯β
Algorithm 4Safe-LinUCB/LinTS with Pure Exploration
1: Input: ππ
π
, π₯π
π
, π, π , π,Ξ, πΏπ
2: forπ βM do
3: forπ‘ =1,2, . . . , πβ²do
4: Playπ₯π‘ =arg maxπ₯βπ·π€
π β₯π₯βπ₯π
πβ₯π΄β1 π , π‘
5: Constructπ·safe
ππβ²
6: Run Safe-LinUCB/LinTS for the remainder with π·safe
ππβ²
without violating safety. The algorithm consists of two phases: (i) Pure Exploration and (ii) Safe-LinUCB/LinTS. The pseudocode is given in Algorithm 4.
Pure Exploration: In this phase, the agent samplesπβ²actions from each constraint setΞπ. It uniformly excites all the directions by playingπ₯π‘=arg maxπ₯βπ·π€
π
β₯π₯βπ₯π
πβ₯π΄β1 π , π‘
for πβ² steps, where π·π€
π is the π β 1 dimensional boundary surface of the πΏπ-ball around the known safe actionsπ₯π
π defined as π·π€
π ={π₯ β Ξπ : ||π₯βπ₯π
π||2 = πΏπ}, and π΄π,π‘ =π πΌ +Γππ(π‘)
π=1 (π₯πβπ₯π
π) (π₯πβπ₯π
π)β€. By construction of π·π€
π , the agent achieves safe exploration. Moreover, this exploration strategy ensures that the agent always picks the direction of the smallest eigenvalue, resulting in persistent excitation in all directions since actions inπ·π€
π have the same norm.
At the end of this phase, the algorithm estimates the gradient of the constraint functions using RLS such that βπΛππ‘ = π΄β1
π,π‘
Γππ(π‘)
π=1 π¦π
π(π₯π βπ₯π
π) for π¦π
π‘ = π¦Λπ
π‘ βππ
π,βπ‘. Note that ππ(π‘)is equal toπβ²for allπat the end of this phase. Next, the algorithm further decomposes
βπΛππ‘=βπΛπΏ π
ππ‘ +πΛππ‘, whereβπΛπΏ π
ππ‘ = π΄β1
π,π‘
Γππ(π‘)
π=1 (π₯πβπ₯π
π) (βπ(π₯π
π)β€(π₯πβπ₯π
π) +ππ
π‘)and Λπππ‘ = π΄β1
π,π‘
Γππ(π‘) π=1 (π₯πβ π₯π
π)ππ(π₯π). Notice that the expression for βπΛπΏ π
ππ‘ is the nonlinear analog of (2.29).
Thus, the algorithm builds confidence sets around the estimatesβπΛπΏ π
ππ‘ ,βπ βM: Cπ‘π ={π£ βRπ :β₯π£β βπΛπΏ π
ππ‘ β₯π΄π
π‘
β€ π½π
π‘}, with π½π
π‘ = π
βοΈ
πlog |π| (1+πβ²πΏ2/π)/πΏ
+π1/2π. It also defines the eventEβππ = {βππ(π₯π
π) β Cπ‘π}which holds with probability at least 1βπΏ, for allπ‘ >0 andπ βM. Safety Construction: Next, conditioned in the joint eventEβππ‘ := Γ
πβπEβππ π‘, the algorithm aims to satisfy safety constraints when picking actions. To achieve this, it conservatively constructs a safe set of actions ΛΞπ‘π = {π₯ β Ξπ : βπΛβ€
ππ‘ (π₯βπ₯π
π) + Ξ
2 β€ πβππ
π}, whereπ·safe
ππ =Γ
πβπΞΛπ,π‘.
Figure 2.4: π·0 andπ·safe
0 respectively for affine constraints. Different colors repre- sent different feedback setsΞπ.
Figure 2.5: π·0 and π·safe
0 respectively for nonlinear (β2 norm bound) constraints.
Different colors represent different feedback setsΞπ.
Theorem 2.3.11. Suppose Assumptions 2.3.9 & 2.3.10 hold. For any πΏ β (0,1), afterπβ²time steps of pure exploration per constraint set, we have i)π₯β β π·safe
ππβ² and ii)π·safe
ππβ² β π·safe
0 with probability at least1βπΏ, if π
β²
log2πβ² β₯
2π
4πΏ2
π
(Ξβπ πΏ2
π)2
2
.
The proof is in Appendix A.2.3. The main idea of the proof is to show that we can control the error from non-linearity using smoothness and simultaneously learn the gradient at that point by uniformly playing actions around and close to the known safe actions. We then buildπ·safe
|π|πβ² using Λβππ(π₯π
π)and add error margin to compen- sate for smoothness approximation error, away fromπ₯π
π. After this phase, the agent executes the previously proposed algorithms using π·safe
|π|πβ².
Corollary 2.3.12(Regret Bound). Suppose 2.3.9 & 2.3.10 hold. Then for the given duration ofπβ²in Theorem 2.3.11, for anyπΏ β (0,1), with probability at least1β2πΏ, the regret of Algorithm 4 the above algorithm isO (|Λ π|πβ²+β
π).
Figure 2.6: Left: Cumulative regret of Safe-OFUL (Safe-LinUCB) and Safe-LinTS for the setting in Figure 2.4 (Solid line is the average, shaded region is one std), Right: Cumulative regret of Algorithm 4 (Safe-OFUL with initial pure exploration) for the setting in Figure 2.5.
Figure 2.7: Cumulative regret in loan approval problem. Comparison of Safe- OFUL(Safe-LinUCB) with and without additional exploration.