Theoretical Analysis of PSLB - Stochastic Linear Bandits with Hidden Low-Rank Structure

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

2.2 Stochastic Linear Bandits with Hidden Low-Rank Structure

2.2.3 Theoretical Analysis of PSLB

In this section, we state the regret upper bound of PSLB and provide the theoretical components that build up to this result. Recalling the quantities defined in (2.4), defineΥsuch that

Υ =O

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆₋+𝜎² +𝑚

! !

. (2.5)

It represents the overall effect of the deploying subspace recovery on the regret in terms of structural properties of the stochastic linear bandit setting.

Theorem 2.2.3(Regret Upper Bound of PSLB). Fix any𝛿 ∈ (0,1). Assume that for all𝑥ˆ_{𝑡 ,𝑖} ∈ 𝐷_𝑡, 𝑥ˆ^𝑇

𝑡 ,𝑖𝜃_∗ ∈ [−1,1]. Under Assumptions 2.2.1 & 2.2.2,∀𝑡 ≥ 1, with probability at least1−6𝛿, the regret of PSLB satisfies

𝑅_𝑡 =minn Oe

Υ√ 𝑡

,Oe

𝑑

√ 𝑡

. (2.6)

The proof of the theorem involves three main pieces: the projection error analysis, the construction of projected confidence sets, and the regret analysis.

Projection Error Analysis

Consider the matrix ˆ𝑉^𝑇

𝑡 𝑉 and its𝑖th singular value denoted as 𝜎_𝑖(𝑉ˆ^𝑇

𝑡 𝑉), such that 𝜎₁(𝑉ˆ^𝑇

𝑡 𝑉) ≥ . . . ≥ 𝜎_𝑚(𝑉ˆ^𝑇

𝑡 𝑉). Using the definition of the aperture of two linear manifolds [11], we write the following equivalence:

∥𝑃ˆ_𝑡−𝑃∥₂=max

𝑥∈spanmax(𝑉),∥𝑥∥₂=1

∥ (𝐼_𝑑−𝑃ˆ_𝑡)𝑥∥₂, max

𝑦∈span(𝑉ˆ𝑡),∥𝑦∥₂=1

∥ (𝐼_𝑑−𝑃)𝑦∥₂

=max

∥ (𝐼−𝑉ˆ_𝑡𝑉ˆ^𝑇

𝑡 )𝑉∥₂, ∥ (𝐼 −𝑉 𝑉^𝑇)𝑉ˆ_𝑡∥₂

√︄

𝜆_{𝑚 𝑎𝑥}

𝑉^𝑇(𝐼_𝑑−𝑉ˆ_𝑡𝑉ˆ^𝑇

𝑡 ) (𝐼_𝑑−𝑉ˆ_𝑡𝑉ˆ^𝑇

𝑡 )𝑉

(2.7)

√︄

𝜆_{𝑚 𝑎𝑥}

𝐼_𝑚 − (𝑉ˆ^𝑇

𝑡 𝑉)^𝑇(𝑉ˆ^𝑇

𝑡 𝑉)

√︃

1−𝜎²

𝑚 where𝜎_𝑚 is the smallest singular value of ˆ𝑉^𝑇

𝑡 𝑉 ,

√︃

1−cos²Θ𝑚(span(𝑉),span(𝑉ˆ)) =sinΘ𝑚, (2.8)

where (2.7) follows since𝑉, and ˆ𝑉_𝑡 have same dimensions, and (2.8) follows from the fact that cosΘ𝑖(span(𝑉),span(𝑉ˆ)) =𝜎_𝑖(𝑉ˆ^𝑇

𝑡 𝑉)whereΘ𝑚 is the largest principal angle between the column spans of𝑉 and ˆ𝑉_𝑡. Thus, bounding the projection error between two projection matrices is equivalent to bounding the sine of the largest principal angle between the subspaces that they project. In light of this relation, and the prior analysis of Davis-Kahan sinΘTheorem [68], we provide the following Lemma on the concentration of the sine and the finite sample projection error.

Lemma 2.2.4(Finite Sample Projection Error). Fix any 𝛿 ∈ (0,1). Let𝑡_{𝑤 ,𝛿} = ^𝑛^𝛿

𝐾. Suppose Assumption 2.2.1 holds. Then with probability at least1−3𝛿,∀𝑡 ≥ 𝑡_{𝑤 ,𝛿},

∥𝑃ˆ_𝑡−𝑃∥₂ ≤ 𝜙_𝛿

√ 𝑡

, where𝜙_𝛿 =2Γ

√︂

𝛼 𝐾

log2𝑑 𝛿

. (2.9)

Lemma 2.2.4 improves the existing bounds on the projection error (Corollary 2.9 in Vaswani and Narayanamurthy [284]) by using the matrix Chernoff inequality [268].

It also provides the precise problem-dependent quantities in the bound which are required for defining the minimum number of samples required to construct tight confidence sets by using subspace estimation. The formal and detailed version of the Lemma 2.2.4 and the details of the proof are provided in Appendix A.1.1.

Note that, as discussed in Section 2.2.2, we define the confidence set C𝑝,𝑡 in (2.9) for all𝑡 ≥ 𝑡_{𝑤 ,𝛿}. Due to the equivalence ∥𝑃ˆ_𝑡 −𝑃∥₂ = sinΘ𝑚, ∀𝑡 ≥ 1 we have that

∥𝑃ˆ_𝑡 −𝑃∥₂ is always less than or equal to 1, i.e., ∥𝑃ˆ_𝑡 − 𝑃∥₂ ≤ 1. Therefore, any projection error bound greater than 1 is vacuous. Consequently, we state that, with high probability, the bound on the projection error in (2.9) becomes less than 1 when 𝑡 ≥ 𝑡_{𝑤 ,𝛿}. After round𝑡_{𝑤 ,𝛿}, PSLB starts to produce non-trivial confidence sets C𝑝,𝑡

around ˆ𝑃_𝑡. However, note that𝑡_{𝑤 ,𝛿} can be significantly large for problems that have latent structures that are hard to recover, e.g., having𝛼linear in𝑑.

The term 𝜙_𝛿 in Lemma 2.2.4 also provides several important intuitions about the subspace estimation problem in terms of the problem structure. Recalling the definition ofΓin (2.4), as𝑔_𝜓 decreases, the projection error shrinks since the underlying subspace becomes more distinguishable. Conversely, as 𝑔_𝑥 diverges from 1, it becomes harder to recover the underlying𝑚-dimensional subspace. Additionally, since 𝛼 is the maximum of the effective dimensions of the true action vector and the perturbation vector, having large𝛼makes the subspace recovery harder and the projection error bound looser, whereas observing more action vectors, 𝐾, in each round produces tighter bound on∥𝑃ˆ_𝑡−𝑃∥₂. The effects of these structural properties

on the subspace estimation translate to confidence set construction and ultimately to the regret upper bound.

Projected Confidence Sets

In this section, we analyze the construction of C𝑚,𝑡 and C𝑑 ,𝑡. For any round𝑡 ≥ 1, define ˆΣ𝑡 BÍ^𝑡

𝑖=1𝑋ˆ_𝑖𝑋ˆ^𝑇

𝑖 =Xˆ^𝑡Xˆ^𝑇𝑡. At round𝑡, let𝐴_𝑡 B 𝑃ˆ_𝑡(Σˆ_𝑡−1+𝜆 𝐼_𝑑)𝑃ˆ_𝑡for𝜆 >0. Let 𝐵_𝑡be a symmetric matrix such that 𝐴_𝑡 =𝑉ˆ_𝑡𝐵_𝑡𝑉ˆ^𝑇

𝑡 . Notice that𝐵_𝑡is a full rank𝑚×𝑚 matrix. The rewards obtained up to round𝑡 are denoted asr^𝑡−1. At round 𝑡, after estimating the projection matrix ˆ𝑃_𝑡associated with the underlying subspace, PSLB finds𝜃_𝑡, an estimate of𝜃_∗, while having𝜃_∗living within the estimated subspace with high probability. Therefore,𝜃_𝑡is the solution to the following Tikhonov-regularized least squares problem with regularization parameters𝜆 >0 and ˆ𝑃_𝑡,

𝜃_𝑡 =argmin

𝜃

∥ (𝑃ˆ_𝑡Xˆ𝑡−1)^𝑇𝜃−r𝑡−1∥²₂+𝜆∥𝑃ˆ_𝑡𝜃∥²₂.

Notice that regularization is applied along the estimated subspace. Solving for 𝜃 gives 𝜃_𝑡 = 𝐴^†

𝑡 𝑃ˆ_𝑡Xˆ𝑡−1r𝑡−1

. Let 𝑆_𝑡 B Í^𝑡

𝑖=1𝑃ˆ_𝑡𝑋ˆ_𝑖−1𝜂_𝑖−1 = 𝑃ˆ_𝑡X𝑡−1𝜂𝜂𝜂_𝑡−1. Before pre- senting the confidence set construction, we provide a self-normalized bound on𝑆_𝑡. Theorem 2.2.5(Self-Normalized Bound for Vector-Valued Martingales). For any 𝛿 ∈ (0,1), with probability at least1−𝛿, for all𝑡 ≥ 1,

∥𝑆_𝑡∥²

𝐴^†

𝑡

≤ 2𝑅²log

det(𝐵_𝑡)¹^/²det(𝜆 𝐼_𝑚)⁻¹^/² 𝛿

This result is a similar self-normalized bound for vector-valued martingales in Abbasi-Yadkori et al. [3], and it can be considered as the projected version of their Theorem 1. The proof of Theorem 2.2.5 is given in Appendix A.1.2. Define𝐿such that for all 𝑡 ≥ 1 and𝑖 ∈ [𝐾], ∥𝑥ˆ_{𝑡 ,𝑖}∥₂ ≤ 𝐿 and let 𝛾 = ^𝐿²

𝜆log 1+^𝐿2𝜆

. Consider the following lemmas that will be useful in proving confidence set construction, and their proofs are given in Appendix A.1.3.

Lemma 2.2.6. Suppose Assumptions 2.2.1 & 2.2.2 hold. Then,det(𝐵_𝑡) ≤

𝜆+^{𝑡 𝐿}²

𝑚

^𝑚 . Lemma 2.2.7. Suppose Assumptions2.2.1 & 2.2.2 hold. Then,

∥ (𝐴^†

𝑡)¹^/²𝑃ˆ_𝑡Σˆ𝑡−1∥₂ ≤ 𝐿

√ 𝑡

√ 𝛾 𝑚

√︄

log

1+ 𝑡 𝐿² 𝑚𝜆

The following theorem gives the construction of the projected confidence set,C𝑚,𝑡.

Theorem 2.2.8 (Projected Confidence Set Construction,C𝑚,𝑡). Fix any 𝛿 ∈ (0,1). Let Assumptions 2.2.1 & 2.2.2 hold, and ∀𝑡 ≥ 1 and 𝑖 ∈ [𝐾], ∥𝑥ˆ_{𝑡 ,𝑖}∥₂ ≤ 𝐿. If

∥𝜃_∗∥₂ ≤ 𝑆 then, with probability at least 1 − 4𝛿, ∀𝑡 ≥ 𝑡_{𝑤 ,𝛿}, 𝜃_∗ lies in the set C𝑚,𝑡 =

𝜃 ∈R^𝑑 : ∥𝜃_𝑡−𝜃∥𝐴𝑡 ≤ 𝛽_{𝑡 ,𝛿}

,where

𝛽_{𝑡 ,𝛿} =𝑅

√︄

2 log 1

𝛿

+𝑚log

1+ 𝑡 𝐿² 𝑚𝜆

+𝐿 𝑆 𝜙_𝛿

√︄

𝛾 𝑚log

1+ 𝑡 𝐿² 𝑚𝜆

+𝑆

√

𝜆. (2.10) Proof. From the definition of𝜃_𝑡 and𝑟_𝑡, we get the following:

𝜃_𝑡 = 𝐴^†

𝑡𝑆_𝑡+𝐴^†

𝑡𝑃ˆ_𝑡Σˆ𝑡−1𝑃𝜃_∗ since𝜃_∗ ∈span(𝑉)

= 𝐴^†

𝑡𝑆_𝑡+𝐴^†

𝑡 𝑃ˆ_𝑡Σˆ𝑡−1(𝑃ˆ_𝑡+𝑃−𝑃ˆ_𝑡) +𝜆𝑃ˆ_𝑡−𝜆𝑃ˆ_𝑡 𝜃_∗

= 𝐴^†

𝑡𝑆_𝑡+𝑃ˆ_𝑡𝜃_∗+ 𝐴^†

𝑡(𝑃ˆ_𝑡Σˆ𝑡−1(𝑃−𝑃ˆ_𝑡))𝜃_∗−𝜆 𝐴^†

𝑡𝜃_∗. Using this, we derive the following for𝑥 = 𝐴_𝑡(𝜃_𝑡 −𝜃_∗):

𝑥^𝑇𝜃_𝑡 −𝑥^𝑇𝜃_∗=𝑥^𝑇𝐴^†

𝑡𝑆_𝑡+𝑥^𝑇𝐴^†

𝑡(𝑃ˆ_𝑡Σˆ_𝑡−1(𝑃−𝑃ˆ_𝑡))𝜃_∗−𝜆𝑥^𝑇𝐴^†

𝑡𝜃_∗

= ⟨𝑥 , 𝑆_𝑡⟩

𝐴^†

𝑡

+ ⟨𝑥 ,𝑃ˆ_𝑡Σˆ𝑡−1(𝑃−𝑃ˆ_𝑡)𝜃_∗⟩

𝐴^†

𝑡

−𝜆⟨𝑥 , 𝜃_∗⟩

𝐴^†

𝑡

Using Cauchy-Schwarz inequality, we can upper bound the magnitude of the differ- ence as follows:

|𝑥^𝑇𝜃_𝑡−𝑥^𝑇𝜃_∗| ≤ ∥𝑥∥

𝐴^†

𝑡

∥𝑆_𝑡∥

𝐴^†

𝑡

+ ∥𝑃ˆ_𝑡Σˆ𝑡−1(𝑃−𝑃ˆ_𝑡)𝜃_∗∥

𝐴^†

𝑡

+𝜆∥𝜃_∗∥

𝐴^†

𝑡

≤ ∥𝑥∥

𝐴^†

𝑡

∥𝑆_𝑡∥

𝐴^†

𝑡

+ ∥ (𝐴^†

𝑡)^1/2𝑃ˆ_𝑡Σˆ𝑡−1(𝑃−𝑃ˆ_𝑡)𝜃_∗∥₂+

√

𝜆∥𝜃_∗∥₂

(2.11)

≤ ∥𝑥∥

𝐴^†

𝑡

∥𝑆_𝑡∥

𝐴^†

𝑡

+ ∥ (𝐴^†

𝑡)^1/2𝑃ˆ_𝑡Σˆ𝑡−1∥₂∥𝑃−𝑃ˆ_𝑡∥₂∥𝜃_∗∥₂+

√

𝜆∥𝜃_∗∥₂ . Plugging in𝑥 =𝐴_𝑡(𝜃_𝑡−𝜃_∗), we get

∥𝜃_𝑡−𝜃_∗∥²

𝐴𝑡

≤ ∥𝐴_𝑡(𝜃_𝑡−𝜃_∗) ∥

𝐴^†

𝑡

∥𝑆_𝑡∥

𝐴^†

𝑡

+∥ (𝐴^†

𝑡)^1/2𝑃ˆ_𝑡Σˆ𝑡−1∥₂∥𝑃−𝑃ˆ_𝑡∥₂∥𝜃_∗∥₂+

√ 𝜆∥𝜃_∗∥₂

. Since∥𝐴_𝑡(𝜃_𝑡−𝜃_∗) ∥

𝐴^†

𝑡

= ∥𝜃_𝑡−𝜃_∗∥𝐴𝑡, dividing both sides with∥𝜃_𝑡−𝜃_∗∥𝐴𝑡 gives and using the fact that∥𝜃_∗∥ ≤ 𝑆,

∥𝜃_𝑡−𝜃_∗∥𝐴𝑡 ≤ ∥𝑆_𝑡∥

𝐴^†

𝑡

+𝑆∥ (𝐴^†

𝑡)^1/2𝑃ˆ_𝑡Σˆ_𝑡−1∥₂∥𝑃−𝑃ˆ_𝑡∥₂+𝑆

√

𝜆. (2.12)

Notice that the first term is the projected version of Theorem 1 in [3] and the second term is the additional term appearing in the confidence interval construction due to non-zero projection error. As it can be seen with the knowledge of true projection

matrix, the confidence interval reduces to the one in [3] with replacement of𝑑with 𝑚.

Using Theorem 2.2.5 and Lemma 2.2.4, we get:

∥𝜃_𝑡−𝜃_∗∥𝐴_𝑡 ≤ 𝑅

√︂

2 logdet(𝐵_𝑡)^1/2det(𝜆 𝐼_𝑚)^−1/2 𝛿

+ 𝑆 𝜙_𝛿

√ 𝑡

∥ (𝐴^†

𝑡)^1/2𝑃ˆ_𝑡Σˆ𝑡−1∥₂+𝑆

√ 𝜆.

Finally, combining this with Lemma 2.2.6 and Lemma 2.2.7 gives the statement of

the Theorem 2.2.8. □

Notice that the overall proof follows a similar machinery used by [3]. Specifically, the first term of𝛽_{𝑡 ,𝛿} in (2.10) is derived similarly by a self-normalized tail inequality, Theorem 2.2.5. However, since at each round PSLB projects the supervised actions to an estimated 𝑚-dimensional subspace to estimate 𝜃_∗, 𝑑 is replaced by 𝑚 in the bound using Lemma 2.2.6. While enjoying the benefit of projection, this construction of the confidence set suffers from the finite sample projection error, i.e., uncertainty in the subspace estimation. This effect is observed via the second term in (2.10). The second term involves the confidence bound for the estimated projection matrix,𝜙_𝛿. This is critical in determining the tightness of the confidence set on𝜃_∗. As discussed before,𝜙_𝛿reflects the difficulty of subspace recovery of the given problem and it depends on the underlying structure of the problem and SLB. This shows that as estimating the underlying subspace gets easier, having a projection-based approach in the construction of the confidence sets on𝜃_∗provides tighter bounds.

In order to tolerate the possible difficulty of subspace recovery, PSLB also constructs C𝑑 ,𝑡, which is the confidence set for 𝜃_∗ without having subspace recovery. The construction ofC𝑑 ,𝑡 follows OFUL [3]. Let𝑍_𝑡 =Σˆ𝑡−1+𝜆 𝐼_𝑑. The algorithm tries to find ˆ𝜃_𝑡which is theℓ²-regularized least squares estimate of𝜃_∗in the ambient space.

Construction ofC𝑑 ,𝑡is done under the same assumptions of Theorem 2.2.8, such that with probability at least 1−𝛿,𝜃_∗lies in the setC𝑑 ,𝑡 =

𝜃 ∈R^𝑑 : ∥𝜃ˆ_𝑡−𝜃∥𝑍𝑡 ≤ Ω𝑡 ,𝛿 , where Ω𝑡 ,𝛿 = 𝑅

√︄

2 log

1 𝛿

+𝑑log

1+ ^{𝑡 𝐿}²

𝑚𝜆

+𝑆

√

𝜆. The search for an optimistic parameter vector happens inC𝑚,𝑡∩ C𝑑 ,𝑡. Notice that𝜃_∗ ∈ C𝑚,𝑡∩ C𝑑 ,𝑡 with probability at least 1− 5𝛿. Optimistically choosing the pair, (𝑋ˆ_𝑡,𝜃˜_𝑡), within the described confidence sets gives PSLB a way to tolerate the possibility of failure in recovering an underlying structure. If confidence set C𝑚,𝑡 is loose or PSLB is not able to recover an underlying structure, then C𝑑 ,𝑡 provides the useful confidence set to obtain desirable learning behavior.

Regret Analysis

PSLB uses the intersection of C𝑚,𝑡 andC𝑑 ,𝑡 as the confidence set at round𝑡. Using only C𝑑 ,𝑡 is equivalent to following OFUL and the regret analysis can be found in [3]. The regret analysis of using only the projected confidence setC𝑚,𝑡 is the main contribution of this work. The following lemmas will be key in obtaining the regret analysis.

Lemma 2.2.9. At round 𝑘, for any 𝑥ˆ ∈ 𝐷_𝑘, if 𝜈 ∈ C𝑘, then | (𝑃ˆ_𝑘𝑥ˆ)^𝑇(𝜈 −𝜃_𝑘) | ≤ 𝛽_{𝑘 ,𝛿}∥𝑥ˆ∥

𝐴^†

𝑘

Define𝑡_{𝑟 ,𝛿}such that𝑡_{𝑟 ,𝛿} =1+

2𝑚−1 2𝑚

₄𝐿²Γ

√︃

𝛼

𝐾log²𝛿^𝑑+√

2𝐿(𝜆₋+𝜎²)log^𝑚𝛿

𝜆₋+𝜎²

. Lemma 2.2.10. For all𝑡 ≥ 𝑡_{𝑤 ,𝛿}, with probability at least1−𝛿,

𝜆_𝑚(𝑃ˆ_𝑡Σˆ𝑡−1𝑃ˆ_𝑡) ≥ (𝑡−1) (𝜆₋+𝜎²) −

√ 𝑡−1

4𝐿²Γ

√︂

𝛼

𝐾 log2𝑑 𝛿

√︂

2𝐿(𝜆₋+𝜎²)log𝑚 𝛿

. (2.13) Also, for all𝑡 ≥𝑡_{𝑟 ,𝛿}, with probability at least1−𝛿,

𝜆_𝑚(𝑃ˆ_𝑡Σˆ_𝑡−1𝑃ˆ_𝑡) ≥ (𝜆₋+𝜎²) 2𝑚

(𝑡−1). (2.14)

The proofs of Lemma 2.2.9 and 2.2.10 are in Supplementary Material A.1.4. The following theorem gives the regret upper bound for using only the projected confidence setC𝑚,𝑡.

Theorem 2.2.11 (Regret Upper Bound of using only C𝑚,𝑡). Fix any 𝛿 ∈ (0,1). Assume that for all𝑥ˆ_{𝑡 ,𝑖} ∈ 𝐷_𝑡,𝑥ˆ^𝑇

𝑡 ,𝑖𝜃_∗ ∈ [−1,1]. Under Assumptions 1 and 2,∀𝑡 ≥ 1, with probability at least1−6𝛿, the regret of using onlyC𝑚,𝑡 satisfies

𝑅_{𝑡 ,}_C

𝑚, 𝑡 ≤ Oe

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆₋+𝜎² +𝑚

!√ 𝑡

. (2.15)

Proof. The instantaneous regret,𝑙_𝑖 = 𝑋ˆ^∗𝑇

𝑖 𝜃_∗− 𝑋ˆ^𝑇

𝑖 𝜃_∗, of the algorithm at𝑖th round can be decomposed as follows:

ˆ 𝑋^∗^𝑇

𝑖 𝜃_∗−𝑋ˆ^𝑇

𝑖 𝜃_∗

≤ (𝑃˜_𝑖𝑋ˆ_𝑖)^𝑇𝜃˜_𝑖− (𝑃𝑋ˆ_𝑖)^𝑇𝜃_∗ (2.16)

= 𝑋ˆ^𝑇

𝑖 (𝑃˜_𝑖−𝑃ˆ_𝑖+𝑃ˆ_𝑖)𝜃˜_𝑖−𝑋ˆ^𝑇

𝑖 (𝑃ˆ_𝑖+𝑃−𝑃ˆ_𝑖)𝜃_∗

= (𝑃ˆ_𝑖𝑋ˆ_𝑖)^𝑇(𝜃˜_𝑖−𝜃_𝑖) + (𝑃ˆ_𝑖𝑋ˆ_𝑖)^𝑇(𝜃_𝑖−𝜃_∗) + ( (𝑃ˆ_𝑖−𝑃)𝑋ˆ_𝑖)^𝑇𝜃_∗+ ( (𝑃˜_𝑖−𝑃ˆ_𝑖)𝑋ˆ_𝑖)^𝑇𝜃˜_𝑖

≤ 2𝛽_𝑖,𝛿∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

+2𝐿 𝑆∥𝑃ˆ_𝑖−𝑃∥₂, (2.17)

where (2.16) follows since (𝑃˜_𝑖,𝑋ˆ_𝑖,𝜃˜_𝑖) is optimistic and (2.17) holds for all 𝑖 with probability at least 1−4𝛿due to Lemma 2.2.9 and Theorem 2.2.8. Combining this decomposition with the fact that𝑙_𝑖 ≤ 2, we get

𝑙_𝑖 ≤ 2 min 𝛽_𝑖,𝛿∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

+𝐿 𝑆∥𝑃ˆ_𝑖−𝑃∥₂,1

(2.18)

≤ 2𝛽_𝑖,𝛿min( ∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

,1) +2𝐿 𝑆min( ∥𝑃ˆ_𝑖−𝑃∥₂,1).

Now we can provide an upper bound on the regret. For all𝑡 ≥ 1, with probability at least 1−5𝛿,

𝑅_𝑡 ≤

𝑡

∑︁

𝑖=1

2𝛽_𝑖,𝛿min( ∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

,1) +2𝐿 𝑆min( ∥𝑃ˆ_𝑖−𝑃∥₂,1)

=2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ_𝑖−𝑃∥₂,1) +

𝑡

∑︁

𝑖=1

2𝛽_𝑖,𝛿min( ∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

,1)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ_𝑖−𝑃∥₂,1) +2𝛽_{𝑡 ,𝛿}

𝑡

∑︁

𝑖=1

min( ∥𝑋ˆ_𝑖∥

𝐴^†

𝑖

,1) (2.19)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ_𝑖−𝑃∥₂,1) +2𝛽_{𝑡 ,𝛿} vu t 𝑡

𝑡

∑︁

𝑖=1

min( ∥𝑋ˆ_𝑖∥²

𝐴^†

𝑖

,1)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ_𝑖−𝑃∥₂,1) +2√ 𝑡 𝛽_{𝑡 ,𝛿}

vu t 𝑡

∑︁

𝑖=1

min 𝜆_max(𝐴^†

𝑖)𝐿²,1

(2.20)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ_𝑖−𝑃∥₂,1) +2√ 𝑡 𝛽_{𝑡 ,𝛿}

vu t 𝑡

∑︁

𝑖=1

min

𝐿²

𝜆+𝜆_𝑚(𝑃ˆ_𝑖Σˆ𝑖−1𝑃ˆ_𝑖) ,1

(2.21)

≤ 2𝐿 𝑆

𝑡_{𝑤 ,𝛿}+2Γ

√︂

𝛼

𝐾 log2𝑑 𝛿

𝑡

∑︁

𝑖=𝑡𝑤 , 𝛿

√1 𝑖

+2𝐿

√ 𝑡 𝛽_{𝑡 ,𝛿}

vu t𝑡_{𝑟 ,𝛿}

𝜆

+ 2𝑚 𝜆₋+𝜎²

𝑡

∑︁

𝑖=𝑡𝑟 , 𝛿

1 𝑖

, (2.22) where (2.19) follows from the fact that𝛽₁_,𝛿≤ · · · ≤𝛽_{𝑡 ,𝛿}. Since∥𝑥∥𝑀≤𝜆_max(𝑀) ∥𝑥∥₂, we get (2.20). The maximum eigenvalue of 𝐴^†

𝑡 is equivalent to𝑚th eigenvalue of 𝐴_𝑡, thus (2.21) is obtained. Recall that∥𝑃ˆ_𝑖−𝑃∥₂ < 1 for𝑡 ≥ 𝑡_{𝑤 ,𝛿}. Using Lemma 2.2.4 and the second statement of Lemma 2.2.10 we get (2.22). Finally, Lemma

A.1.5 provides the following regret upper bound 𝑅_𝑡 ≤ 2𝐿 𝑆𝑡_{𝑤 ,𝛿}+4𝐿 𝑆Γ

√︂

𝛼 𝐾

log2𝑑 𝛿

(2√ 𝑡−2√︁

𝑡_{𝑤 ,𝛿}+1+1) +2𝐿

√ 𝑡 𝛽_{𝑡 ,𝛿}

√︄

𝑡_{𝑟 ,𝛿} 𝜆

+ 2𝑚+2𝑚log𝑡−2𝑚log(𝑡_{𝑟 ,𝛿}+1) 𝜆₋+𝜎²

. (2.23)

Recall that 𝛽_{𝑡 ,𝛿} = O Γ√︁𝛼𝑚

𝐾 log𝑡+√︁

𝑚log𝑡

. Therefore, last term dominates the asymptotic upper bound on regret. Using the definition of 𝑡_{𝑟 ,𝛿} and the fact that

√

𝑎+𝑏 ≤ √ 𝑎+√

𝑏for𝑎, 𝑏 >0, we get that the regret of the algorithm is 𝑅_𝑡 =O

√ 𝑚 𝜆₋+𝜎²

√︂𝛼 𝐾

+ 𝛼Γ² 𝐾

√︁

𝑡log𝑡+ 𝑚

√︁

𝜆₋+𝜎²

1+Γ

√︂𝛼 𝐾

√ 𝑡log𝑡

=O˜

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆₋ +𝜎² +𝑚

!√ 𝑡

=Oe Υ√

𝑡

□ Proof of Theorem 2.2.3: Using the intersection ofC𝑚,𝑡 andC𝑑 ,𝑡 as the confidence set at round 𝑡, gives PSLB the ability to obtain the lowest possible instantaneous regret among both confidence sets. Therefore, the regret of PSLB is upper bounded by the minimum of the regret upper bounds on the individual strategies. Thus, Theorem 2.2.11 and Theorem 3 of Abbasi-Yadkori et al. [3] give the statement of Theorem 2.2.3.

Interpreting the Regret Bound

Υ is the reflection of the finite sample projection error at the beginning of the algorithm. It captures the difficulty of subspace recovery based on the structural properties of the problem and determines the regret of deploying projection-based methods in SLBs. Recall that𝛼is the maximum of the effective dimensions of the true action vectors and the perturbation vectors. Depending on the structure of the problem, 𝛼 can be O (𝑑), e.g., the perturbation can be uniform in all dimensions, which prevents the projection error from shrinking; thus, causes Υ = O (𝑑

√ 𝑚) resulting inO (e 𝑑

√

𝑚𝑡)regret. The eigengap within the true action vectors𝑔_𝑥and the eigengap between the true action vectors and the perturbation vectors𝑔_𝜓 are critical factors that determine the identifiability of the hidden subspace. As𝜎² increases, the subspace recovery becomes harder since the effect of perturbation increases.

Conversely, as 𝜆₋ increases, the underlying subspace becomes easier to identify.

These effects are significant and translate to the regret of PSLB viaΓinΥ.

Moreover, having finite samples to estimate the subspace affects the regret bound through Υ. Due to the nature of SLB, i.e., finite action vectors in decision sets, this is unavoidable. Note that if the decision set contained infinitely many actions, the subspace recovery would be accomplished perfectly. Thus, the problem would reduce to a𝑚-dimensional SLB which has a regret upper bound ofO (e 𝑚

√

𝑡). This behavior can be seen in Υ. As 𝐾 → ∞, Υ = O (𝑚) which gives the regret upper bound ofO (e 𝑚

√

𝑡)as expected.

Theorem 2.2.3 states that if the underlying structure is easily recoverable, e.g.,Υ = O (𝑚), then using PCA-based dimension reduction and construction of confidence sets provides substantially better regret upper bound for large 𝑑. If that is not the case, then due to the best-of-the-both-worlds approach provided by PSLB, the agent obtains the best possible regret upper bound. Note that the bound for using only C𝑚,𝑡 is a worst-case bound, and as we present in Section 2.2.4, in practice PSLB can give significantly better results.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 43-51)