Linear Bandits with Nonlinear Constraints

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

2.3 Stochastic Linear Bandits with Unknown Safety Constraints

2.3.5 Linear Bandits with Nonlinear Constraints

between exploration and exploitation while learning the unknown reward parameter, i.e., ˜𝜇_𝑡 =𝜇ˆ_𝑡+𝛽_𝑡𝑉^−1/2

𝑡 𝜂_𝑡. It then uses𝜂^𝑐

𝑡 ∼ P_𝑐^{𝑇 𝑆} to sample ˜𝜔_𝑖,𝑡 = 𝛽^𝑖

𝑡𝐴^−1/2

𝑖,𝑡 𝜂^𝑐

𝑡,∀𝑖 ∈M, which will be used to provide the exploration needed to expand the estimated safe set to include higher rewarding actions, i.e., optimistic actions.

At the final step, Safe-LinTS picks an action 𝑥_𝑡 from 𝐷^safe

𝑡 by maximizing ˜𝜇^⊤

𝑡 𝑥 +

˜ 𝜔^⊤

𝑖,𝑡(𝑥−𝑥^𝑠

𝑖). Note that this is a linear objective with transparent exploration goals.

In particular, the reward exploration is similar to LinTS in Abeille and Lazaric [5], whereas the second term adds exploration along the safety constraints using the known safe actions. Notice that this approach generalizes the algorithm proposed in [203] whose setting is a special case of the SLB framework considered in this study.

Theorem 2.3.8. Suppose Assumptions 2.3.1–2.3.3 and 2.3.5 hold. Then for any𝛿∈ (0,1), with probability at least1−𝛿, the regret of Safe-LinTS is𝑅_𝑇 =O (˜ 𝑑^3/2

√︁|𝑀|𝑇). The proof and the exact expressions are given in Appendix A.2.2. In the proof, we first show that Safe-LinTS selects safe (via Lemma 2.3.4) optimistic actions with at least 𝑝₁𝑝₂ probability by showing that𝜂_𝑡 and 𝜂^𝑐

𝑡 provide sufficient exploration.

Finally, we use the regret decomposition in [5] to give the regret upper bound.

Notably, this result matches the regret upper bound in [203] for their setting. In the exact regret expression, the leading term has 1/(𝑝₁𝑝₂)i.e., the inverse of optimistic action probability. This relation is similar to that of𝑘_𝑖

𝑚𝑎 𝑥 in Safe-OFUL. In particular, similar to Safe-OFUL, for a smaller worst-case safety gap of known safe actions, Safe-LinTS needs to explore more to learn the optimal safe action which results in increased regret through 𝑝₂.

constraint,𝑥^𝑠

𝑖 ∈ 𝐷^safe

0 ,∀𝑖 ∈M, as well as their safety values 𝑓_𝑖(𝑥^𝑠

𝑖) =𝜏^𝑠

𝑖 < 𝜏. Finally, we adopt the following simple regularity assumption on the nonlinear constraints.

Assumption 2.3.9(Smooth & Lipschitz Safety Constraints). 𝑓_𝑖(𝑥) is𝜁-smooth and 𝑆-Lipschitz,∀𝑥∈Γ𝑖,∀𝑖∈M.

The local smoothness assumption is a significantly weak assumption [24], while the local Lipschitzness is the nonlinear analog to Assumption 2.3.2 with affine constraints. The setting characterized above subsumes and generalizes the affine case in Section 2.3.1. Using the first-order Taylor expansion about the known safe actions, we obtain 𝑓_𝑖(𝑥) = 𝑓_𝑖(𝑥^𝑠

𝑖)+∇𝑓_𝑖(𝑥^𝑠

𝑖)^⊤(𝑥−𝑥^𝑠

𝑖)+𝜖_𝑖(𝑥), where𝜖_𝑖(𝑥)represents the remainder terms. Notice that for small enough 𝛿_𝑓, this expansion behaves very similarly to affine functions studied in previous sections, which motivates the following our algorithm design. To avoid any further structural assumptions and keep the setting as general as possible, while keeping the problem tractable, we assume some safety gap for the optimal safe action to account for the function approximation errors.

Assumption 2.3.10(Safety gap for optimal action). The optimal safe action𝑥^∗has at least Δ safety gap from constraint boundary, i.e., 𝑓_𝑖∗(𝑥^∗) ≤ 𝜏 − Δ, such that Δ> 𝜁 𝛿²

𝑓.

This is a mild assumption since for a nonlinear function the optimal action need not be at the boundary, unlike linear constraints. Moreover, this assumption holds in many safe decision-making tasks, where the optimal safe action might be a significantly safe one, yet, to learn this action one might need to consider a higher threshold in the learning process.

Safe-OFUL/LinTS with Pure Exploration

We propose an extension of our prior algorithms to achieve safe and effective decision-making for the SLB with multiple nonlinear safety constraints. Due to Assumption 2.3.9, we know that there exists a safe ball of actions around each𝑥^𝑠

𝑖

,∀𝑖 ∈ M, i.e., 𝑓_𝑖(𝑥_𝑡) ≤ 𝜏 if𝑥_𝑡 ∈ {𝑥 ∈ Γ𝑖 : ||𝑥−𝑥^𝑠

𝑖||₂ < 𝛿_𝑟} for𝛿_𝑟≤ (𝜏−𝜏^𝑠

𝑖)/(𝑆+𝜁 𝛿_𝑓).

The existence of this ball helps the agent to estimate the gradient of the nonlinear function around the known safe actions𝑥^𝑠

𝑖. The main idea in our algorithm design is to learn the first-order function approximation in eachΓ𝑖while taking into account the estimation error so that the agent can eventually get to the optimal action 𝑥^∗

Algorithm 4Safe-LinUCB/LinTS with Pure Exploration

1: Input: 𝜏^𝑠

𝑖

, 𝑥^𝑠

𝑖

, 𝜏, 𝜁 , 𝑆,Δ, 𝛿_𝑓

2: for𝑖 ∈M do

3: for𝑡 =1,2, . . . , 𝑇^′do

4: Play𝑥_𝑡 =arg max𝑥∈𝐷^𝑤

𝑖 ∥𝑥−𝑥^𝑠

𝑖∥_𝐴−1 𝑖 , 𝑡

5: Construct𝐷^safe

𝑀𝑇^′

6: Run Safe-LinUCB/LinTS for the remainder with 𝐷^safe

𝑀𝑇^′

without violating safety. The algorithm consists of two phases: (i) Pure Exploration and (ii) Safe-LinUCB/LinTS. The pseudocode is given in Algorithm 4.

Pure Exploration: In this phase, the agent samples𝑇^′actions from each constraint setΓ𝑖. It uniformly excites all the directions by playing𝑥_𝑡=arg max𝑥∈𝐷^𝑤

𝑖

∥𝑥−𝑥^𝑠

𝑖∥_𝐴−1 𝑖 , 𝑡

for 𝑇^′ steps, where 𝐷^𝑤

𝑖 is the 𝑑 − 1 dimensional boundary surface of the 𝛿_𝑟-ball around the known safe actions𝑥^𝑠

𝑖 defined as 𝐷^𝑤

𝑖 ={𝑥 ∈ Γ𝑖 : ||𝑥−𝑥^𝑠

𝑖||₂ = 𝛿_𝑟}, and 𝐴_𝑖,𝑡 =𝜆 𝐼 +Í^𝑁^𝑖^(𝑡)

𝑘=1 (𝑥_𝑘−𝑥^𝑠

𝑖) (𝑥_𝑘−𝑥^𝑠

𝑖)^⊤. By construction of 𝐷^𝑤

𝑖 , the agent achieves safe exploration. Moreover, this exploration strategy ensures that the agent always picks the direction of the smallest eigenvalue, resulting in persistent excitation in all directions since actions in𝐷^𝑤

𝑖 have the same norm.

At the end of this phase, the algorithm estimates the gradient of the constraint functions using RLS such that ∇𝑓ˆ_𝑖𝑡 = 𝐴⁻¹

𝑖,𝑡

Í^𝑁^𝑖^(𝑡)

𝑘=1 𝑦^𝑖

𝑘(𝑥_𝑘 −𝑥^𝑠

𝑖) for 𝑦^𝑖

𝑡 = 𝑦˜^𝑖

𝑡 −𝜏^𝑠

𝑖,∀𝑡. Note that 𝑁_𝑖(𝑡)is equal to𝑇^′for all𝑖at the end of this phase. Next, the algorithm further decomposes

∇𝑓ˆ_𝑖𝑡=∇𝑓ˆ^{𝐿 𝑆}

𝑖𝑡 +𝜖ˆ_𝑖𝑡, where∇𝑓ˆ^{𝐿 𝑆}

𝑖𝑡 = 𝐴⁻¹

𝑖,𝑡

Í^𝑁𝑖(𝑡)

𝜏=1 (𝑥_𝜏−𝑥^𝑠

𝑖) (∇𝑓(𝑥^𝑠

𝑖)^⊤(𝑥_𝜏−𝑥^𝑠

𝑖) +𝜂^𝑖

𝑡)and ˆ𝜖_𝑖𝑡 = 𝐴⁻¹

𝑖,𝑡

Í^𝑁𝑖(𝑡) 𝜏=1 (𝑥_𝜏− 𝑥^𝑠

𝑖)𝜖_𝑖(𝑥_𝜏). Notice that the expression for ∇𝑓ˆ^{𝐿 𝑆}

𝑖𝑡 is the nonlinear analog of (2.29).

Thus, the algorithm builds confidence sets around the estimates∇𝑓ˆ^{𝐿 𝑆}

𝑖𝑡 ,∀𝑖 ∈M: C_𝑡^𝑖 ={𝑣 ∈R^𝑑 :∥𝑣− ∇𝑓ˆ^{𝐿 𝑆}

𝑖𝑡 ∥_𝐴^𝑖

𝑡

≤ 𝛽^𝑖

𝑡}, with 𝛽^𝑖

𝑡 = 𝑅

√︃

𝑑log |𝑀| (1+𝑇^′𝐿²/𝜆)/𝛿

+𝜆¹^/²𝑆. It also defines the eventE_∇𝑓_𝑖 = {∇𝑓_𝑖(𝑥^𝑠

𝑖) ∈ C_𝑡^𝑖}which holds with probability at least 1−𝛿, for all𝑡 >0 and𝑖 ∈M. Safety Construction: Next, conditioned in the joint eventE_∇𝑓𝑡 := Ð

𝑖∈𝑀E_∇𝑓𝑖 𝑡, the algorithm aims to satisfy safety constraints when picking actions. To achieve this, it conservatively constructs a safe set of actions ˆΓ_𝑡^𝑖 = {𝑥 ∈ Γ𝑖 : ∇𝑓ˆ^⊤

𝑖𝑡 (𝑥−𝑥^𝑠

𝑖) + ^Δ

2 ≤ 𝜏−𝜏^𝑠

𝑖}, where𝐷^safe

𝑀𝑇 =Ð

𝑖∈𝑀Γˆ𝑖,𝑡.

Figure 2.4: 𝐷₀ and𝐷^safe

0 respectively for affine constraints. Different colors represent different feedback setsΓ𝑖.

Figure 2.5: 𝐷₀ and 𝐷^safe

0 respectively for nonlinear (ℓ₂ norm bound) constraints.

Different colors represent different feedback setsΓ𝑖.

Theorem 2.3.11. Suppose Assumptions 2.3.9 & 2.3.10 hold. For any 𝛿 ∈ (0,1), after𝑇^′time steps of pure exploration per constraint set, we have i)𝑥^∗ ∈ 𝐷^safe

𝑀𝑇^′ and ii)𝐷^safe

𝑀𝑇^′ ⊆ 𝐷^safe

0 with probability at least1−𝛿, if ^𝑇

′

log²𝑇^′ ≥

2𝑑

4𝛿²

𝑓

(Δ−𝜁 𝛿²

𝑓)²

The proof is in Appendix A.2.3. The main idea of the proof is to show that we can control the error from non-linearity using smoothness and simultaneously learn the gradient at that point by uniformly playing actions around and close to the known safe actions. We then build𝐷^safe

|𝑀|𝑇^′ using ˆ∇𝑓_𝑖(𝑥^𝑠

𝑖)and add error margin to compen- sate for smoothness approximation error, away from𝑥^𝑠

𝑖. After this phase, the agent executes the previously proposed algorithms using 𝐷^safe

|𝑀|𝑇^′.

Corollary 2.3.12(Regret Bound). Suppose 2.3.9 & 2.3.10 hold. Then for the given duration of𝑇^′in Theorem 2.3.11, for any𝛿 ∈ (0,1), with probability at least1−2𝛿, the regret of Algorithm 4 the above algorithm isO (|˜ 𝑀|𝑇^′+√

𝑇).

Figure 2.6: Left: Cumulative regret of Safe-OFUL (Safe-LinUCB) and Safe-LinTS for the setting in Figure 2.4 (Solid line is the average, shaded region is one std), Right: Cumulative regret of Algorithm 4 (Safe-OFUL with initial pure exploration) for the setting in Figure 2.5.

Figure 2.7: Cumulative regret in loan approval problem. Comparison of Safe- OFUL(Safe-LinUCB) with and without additional exploration.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 64-68)