• Tidak ada hasil yang ditemukan

Linear Bandits with Nonlinear Constraints

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 64-68)

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

2.3 Stochastic Linear Bandits with Unknown Safety Constraints

2.3.5 Linear Bandits with Nonlinear Constraints

between exploration and exploitation while learning the unknown reward parameter, i.e., Λœπœ‡π‘‘ =πœ‡Λ†π‘‘+π›½π‘‘π‘‰βˆ’1/2

𝑑 πœ‚π‘‘. It then usesπœ‚π‘

𝑑 ∼ P𝑐𝑇 𝑆 to sample Λœπœ”π‘–,𝑑 = 𝛽𝑖

π‘‘π΄βˆ’1/2

𝑖,𝑑 πœ‚π‘

𝑑,βˆ€π‘– ∈M, which will be used to provide the exploration needed to expand the estimated safe set to include higher rewarding actions, i.e., optimistic actions.

At the final step, Safe-LinTS picks an action π‘₯𝑑 from 𝐷safe

𝑑 by maximizing Λœπœ‡βŠ€

𝑑 π‘₯ +

˜ πœ”βŠ€

𝑖,𝑑(π‘₯βˆ’π‘₯𝑠

𝑖). Note that this is a linear objective with transparent exploration goals.

In particular, the reward exploration is similar to LinTS in Abeille and Lazaric [5], whereas the second term adds exploration along the safety constraints using the known safe actions. Notice that this approach generalizes the algorithm proposed in [203] whose setting is a special case of the SLB framework considered in this study.

Theorem 2.3.8. Suppose Assumptions 2.3.1–2.3.3 and 2.3.5 hold. Then for anyπ›Ώβˆˆ (0,1), with probability at least1βˆ’π›Ώ, the regret of Safe-LinTS is𝑅𝑇 =O (˜ 𝑑3/2

√︁|𝑀|𝑇). The proof and the exact expressions are given in Appendix A.2.2. In the proof, we first show that Safe-LinTS selects safe (via Lemma 2.3.4) optimistic actions with at least 𝑝1𝑝2 probability by showing thatπœ‚π‘‘ and πœ‚π‘

𝑑 provide sufficient exploration.

Finally, we use the regret decomposition in [5] to give the regret upper bound.

Notably, this result matches the regret upper bound in [203] for their setting. In the exact regret expression, the leading term has 1/(𝑝1𝑝2)i.e., the inverse of optimistic action probability. This relation is similar to that ofπ‘˜π‘–

π‘šπ‘Ž π‘₯ in Safe-OFUL. In particu- lar, similar to Safe-OFUL, for a smaller worst-case safety gap of known safe actions, Safe-LinTS needs to explore more to learn the optimal safe action which results in increased regret through 𝑝2.

constraint,π‘₯𝑠

𝑖 ∈ 𝐷safe

0 ,βˆ€π‘– ∈M, as well as their safety values 𝑓𝑖(π‘₯𝑠

𝑖) =πœπ‘ 

𝑖 < 𝜏. Finally, we adopt the following simple regularity assumption on the nonlinear constraints.

Assumption 2.3.9(Smooth & Lipschitz Safety Constraints). 𝑓𝑖(π‘₯) is𝜁-smooth and 𝑆-Lipschitz,βˆ€π‘₯βˆˆΞ“π‘–,βˆ€π‘–βˆˆM.

The local smoothness assumption is a significantly weak assumption [24], while the local Lipschitzness is the nonlinear analog to Assumption 2.3.2 with affine con- straints. The setting characterized above subsumes and generalizes the affine case in Section 2.3.1. Using the first-order Taylor expansion about the known safe actions, we obtain 𝑓𝑖(π‘₯) = 𝑓𝑖(π‘₯𝑠

𝑖)+βˆ‡π‘“π‘–(π‘₯𝑠

𝑖)⊀(π‘₯βˆ’π‘₯𝑠

𝑖)+πœ–π‘–(π‘₯), whereπœ–π‘–(π‘₯)represents the remain- der terms. Notice that for small enough 𝛿𝑓, this expansion behaves very similarly to affine functions studied in previous sections, which motivates the following our algorithm design. To avoid any further structural assumptions and keep the setting as general as possible, while keeping the problem tractable, we assume some safety gap for the optimal safe action to account for the function approximation errors.

Assumption 2.3.10(Safety gap for optimal action). The optimal safe actionπ‘₯βˆ—has at least Ξ” safety gap from constraint boundary, i.e., π‘“π‘–βˆ—(π‘₯βˆ—) ≀ 𝜏 βˆ’ Ξ”, such that Ξ”> 𝜁 𝛿2

𝑓.

This is a mild assumption since for a nonlinear function the optimal action need not be at the boundary, unlike linear constraints. Moreover, this assumption holds in many safe decision-making tasks, where the optimal safe action might be a sig- nificantly safe one, yet, to learn this action one might need to consider a higher threshold in the learning process.

Safe-OFUL/LinTS with Pure Exploration

We propose an extension of our prior algorithms to achieve safe and effective decision-making for the SLB with multiple nonlinear safety constraints. Due to Assumption 2.3.9, we know that there exists a safe ball of actions around eachπ‘₯𝑠

𝑖

,βˆ€π‘– ∈ M, i.e., 𝑓𝑖(π‘₯𝑑) ≀ 𝜏 ifπ‘₯𝑑 ∈ {π‘₯ ∈ Γ𝑖 : ||π‘₯βˆ’π‘₯𝑠

𝑖||2 < π›Ώπ‘Ÿ} forπ›Ώπ‘Ÿβ‰€ (πœβˆ’πœπ‘ 

𝑖)/(𝑆+𝜁 𝛿𝑓).

The existence of this ball helps the agent to estimate the gradient of the nonlinear function around the known safe actionsπ‘₯𝑠

𝑖. The main idea in our algorithm design is to learn the first-order function approximation in eachΓ𝑖while taking into account the estimation error so that the agent can eventually get to the optimal action π‘₯βˆ—

Algorithm 4Safe-LinUCB/LinTS with Pure Exploration

1: Input: πœπ‘ 

𝑖

, π‘₯𝑠

𝑖

, 𝜏, 𝜁 , 𝑆,Ξ”, 𝛿𝑓

2: for𝑖 ∈M do

3: for𝑑 =1,2, . . . , 𝑇′do

4: Playπ‘₯𝑑 =arg maxπ‘₯βˆˆπ·π‘€

𝑖 βˆ₯π‘₯βˆ’π‘₯𝑠

𝑖βˆ₯π΄βˆ’1 𝑖 , 𝑑

5: Construct𝐷safe

𝑀𝑇′

6: Run Safe-LinUCB/LinTS for the remainder with 𝐷safe

𝑀𝑇′

without violating safety. The algorithm consists of two phases: (i) Pure Exploration and (ii) Safe-LinUCB/LinTS. The pseudocode is given in Algorithm 4.

Pure Exploration: In this phase, the agent samples𝑇′actions from each constraint setΓ𝑖. It uniformly excites all the directions by playingπ‘₯𝑑=arg maxπ‘₯βˆˆπ·π‘€

𝑖

βˆ₯π‘₯βˆ’π‘₯𝑠

𝑖βˆ₯π΄βˆ’1 𝑖 , 𝑑

for 𝑇′ steps, where 𝐷𝑀

𝑖 is the 𝑑 βˆ’ 1 dimensional boundary surface of the π›Ώπ‘Ÿ-ball around the known safe actionsπ‘₯𝑠

𝑖 defined as 𝐷𝑀

𝑖 ={π‘₯ ∈ Γ𝑖 : ||π‘₯βˆ’π‘₯𝑠

𝑖||2 = π›Ώπ‘Ÿ}, and 𝐴𝑖,𝑑 =πœ† 𝐼 +Í𝑁𝑖(𝑑)

π‘˜=1 (π‘₯π‘˜βˆ’π‘₯𝑠

𝑖) (π‘₯π‘˜βˆ’π‘₯𝑠

𝑖)⊀. By construction of 𝐷𝑀

𝑖 , the agent achieves safe exploration. Moreover, this exploration strategy ensures that the agent always picks the direction of the smallest eigenvalue, resulting in persistent excitation in all directions since actions in𝐷𝑀

𝑖 have the same norm.

At the end of this phase, the algorithm estimates the gradient of the constraint functions using RLS such that βˆ‡π‘“Λ†π‘–π‘‘ = π΄βˆ’1

𝑖,𝑑

Í𝑁𝑖(𝑑)

π‘˜=1 𝑦𝑖

π‘˜(π‘₯π‘˜ βˆ’π‘₯𝑠

𝑖) for 𝑦𝑖

𝑑 = π‘¦Λœπ‘–

𝑑 βˆ’πœπ‘ 

𝑖,βˆ€π‘‘. Note that 𝑁𝑖(𝑑)is equal to𝑇′for all𝑖at the end of this phase. Next, the algorithm further decomposes

βˆ‡π‘“Λ†π‘–π‘‘=βˆ‡π‘“Λ†πΏ 𝑆

𝑖𝑑 +πœ–Λ†π‘–π‘‘, whereβˆ‡π‘“Λ†πΏ 𝑆

𝑖𝑑 = π΄βˆ’1

𝑖,𝑑

Í𝑁𝑖(𝑑)

𝜏=1 (π‘₯πœβˆ’π‘₯𝑠

𝑖) (βˆ‡π‘“(π‘₯𝑠

𝑖)⊀(π‘₯πœβˆ’π‘₯𝑠

𝑖) +πœ‚π‘–

𝑑)and Λ†πœ–π‘–π‘‘ = π΄βˆ’1

𝑖,𝑑

Í𝑁𝑖(𝑑) 𝜏=1 (π‘₯πœβˆ’ π‘₯𝑠

𝑖)πœ–π‘–(π‘₯𝜏). Notice that the expression for βˆ‡π‘“Λ†πΏ 𝑆

𝑖𝑑 is the nonlinear analog of (2.29).

Thus, the algorithm builds confidence sets around the estimatesβˆ‡π‘“Λ†πΏ 𝑆

𝑖𝑑 ,βˆ€π‘– ∈M: C𝑑𝑖 ={𝑣 ∈R𝑑 :βˆ₯π‘£βˆ’ βˆ‡π‘“Λ†πΏ 𝑆

𝑖𝑑 βˆ₯𝐴𝑖

𝑑

≀ 𝛽𝑖

𝑑}, with 𝛽𝑖

𝑑 = 𝑅

βˆšοΈƒ

𝑑log |𝑀| (1+𝑇′𝐿2/πœ†)/𝛿

+πœ†1/2𝑆. It also defines the eventEβˆ‡π‘“π‘– = {βˆ‡π‘“π‘–(π‘₯𝑠

𝑖) ∈ C𝑑𝑖}which holds with probability at least 1βˆ’π›Ώ, for all𝑑 >0 and𝑖 ∈M. Safety Construction: Next, conditioned in the joint eventEβˆ‡π‘“π‘‘ := Ð

π‘–βˆˆπ‘€Eβˆ‡π‘“π‘– 𝑑, the algorithm aims to satisfy safety constraints when picking actions. To achieve this, it conservatively constructs a safe set of actions ˆΓ𝑑𝑖 = {π‘₯ ∈ Γ𝑖 : βˆ‡π‘“Λ†βŠ€

𝑖𝑑 (π‘₯βˆ’π‘₯𝑠

𝑖) + Ξ”

2 ≀ πœβˆ’πœπ‘ 

𝑖}, where𝐷safe

𝑀𝑇 =Ð

π‘–βˆˆπ‘€Ξ“Λ†π‘–,𝑑.

Figure 2.4: 𝐷0 and𝐷safe

0 respectively for affine constraints. Different colors repre- sent different feedback setsΓ𝑖.

Figure 2.5: 𝐷0 and 𝐷safe

0 respectively for nonlinear (β„“2 norm bound) constraints.

Different colors represent different feedback setsΓ𝑖.

Theorem 2.3.11. Suppose Assumptions 2.3.9 & 2.3.10 hold. For any 𝛿 ∈ (0,1), after𝑇′time steps of pure exploration per constraint set, we have i)π‘₯βˆ— ∈ 𝐷safe

𝑀𝑇′ and ii)𝐷safe

𝑀𝑇′ βŠ† 𝐷safe

0 with probability at least1βˆ’π›Ώ, if 𝑇

β€²

log2𝑇′ β‰₯

2𝑑

4𝛿2

𝑓

(Ξ”βˆ’πœ 𝛿2

𝑓)2

2

.

The proof is in Appendix A.2.3. The main idea of the proof is to show that we can control the error from non-linearity using smoothness and simultaneously learn the gradient at that point by uniformly playing actions around and close to the known safe actions. We then build𝐷safe

|𝑀|𝑇′ using Λ†βˆ‡π‘“π‘–(π‘₯𝑠

𝑖)and add error margin to compen- sate for smoothness approximation error, away fromπ‘₯𝑠

𝑖. After this phase, the agent executes the previously proposed algorithms using 𝐷safe

|𝑀|𝑇′.

Corollary 2.3.12(Regret Bound). Suppose 2.3.9 & 2.3.10 hold. Then for the given duration of𝑇′in Theorem 2.3.11, for any𝛿 ∈ (0,1), with probability at least1βˆ’2𝛿, the regret of Algorithm 4 the above algorithm isO (|˜ 𝑀|𝑇′+√

𝑇).

Figure 2.6: Left: Cumulative regret of Safe-OFUL (Safe-LinUCB) and Safe-LinTS for the setting in Figure 2.4 (Solid line is the average, shaded region is one std), Right: Cumulative regret of Algorithm 4 (Safe-OFUL with initial pure exploration) for the setting in Figure 2.5.

Figure 2.7: Cumulative regret in loan approval problem. Comparison of Safe- OFUL(Safe-LinUCB) with and without additional exploration.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 64-68)