• Tidak ada hasil yang ditemukan

Theoretical Analysis of PSLB

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 43-51)

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

2.2 Stochastic Linear Bandits with Hidden Low-Rank Structure

2.2.3 Theoretical Analysis of PSLB

In this section, we state the regret upper bound of PSLB and provide the theoretical components that build up to this result. Recalling the quantities defined in (2.4), defineΥsuch that

Υ =O

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆+𝜎2 +𝑚

! !

. (2.5)

It represents the overall effect of the deploying subspace recovery on the regret in terms of structural properties of the stochastic linear bandit setting.

Theorem 2.2.3(Regret Upper Bound of PSLB). Fix any𝛿 ∈ (0,1). Assume that for all𝑥ˆ𝑡 ,𝑖 ∈ 𝐷𝑡, 𝑥ˆ𝑇

𝑡 ,𝑖𝜃 ∈ [−1,1]. Under Assumptions 2.2.1 & 2.2.2,∀𝑡 ≥ 1, with probability at least1−6𝛿, the regret of PSLB satisfies

𝑅𝑡 =minn Oe

Υ√ 𝑡

,Oe

𝑑

√ 𝑡

o

. (2.6)

The proof of the theorem involves three main pieces: the projection error analysis, the construction of projected confidence sets, and the regret analysis.

Projection Error Analysis

Consider the matrix ˆ𝑉𝑇

𝑡 𝑉 and its𝑖th singular value denoted as 𝜎𝑖(𝑉ˆ𝑇

𝑡 𝑉), such that 𝜎1(𝑉ˆ𝑇

𝑡 𝑉) ≥ . . . ≥ 𝜎𝑚(𝑉ˆ𝑇

𝑡 𝑉). Using the definition of the aperture of two linear manifolds [11], we write the following equivalence:

∥𝑃ˆ𝑡−𝑃∥2=max

𝑥∈spanmax(𝑉),∥𝑥∥2=1

∥ (𝐼𝑑−𝑃ˆ𝑡)𝑥∥2, max

𝑦∈span(𝑉ˆ𝑡),𝑦2=1

∥ (𝐼𝑑−𝑃)𝑦∥2

=max

∥ (𝐼−𝑉ˆ𝑡𝑉ˆ𝑇

𝑡 )𝑉∥2, ∥ (𝐼 −𝑉 𝑉𝑇)𝑉ˆ𝑡2

=

√︄

𝜆𝑚 𝑎𝑥

𝑉𝑇(𝐼𝑑−𝑉ˆ𝑡𝑉ˆ𝑇

𝑡 ) (𝐼𝑑−𝑉ˆ𝑡𝑉ˆ𝑇

𝑡 )𝑉

(2.7)

=

√︄

𝜆𝑚 𝑎𝑥

𝐼𝑚 − (𝑉ˆ𝑇

𝑡 𝑉)𝑇(𝑉ˆ𝑇

𝑡 𝑉)

=

√︃

1−𝜎2

𝑚 where𝜎𝑚 is the smallest singular value of ˆ𝑉𝑇

𝑡 𝑉 ,

=

√︃

1−cos2Θ𝑚(span(𝑉),span(𝑉ˆ)) =sinΘ𝑚, (2.8)

where (2.7) follows since𝑉, and ˆ𝑉𝑡 have same dimensions, and (2.8) follows from the fact that cosΘ𝑖(span(𝑉),span(𝑉ˆ)) =𝜎𝑖(𝑉ˆ𝑇

𝑡 𝑉)whereΘ𝑚 is the largest principal angle between the column spans of𝑉 and ˆ𝑉𝑡. Thus, bounding the projection error between two projection matrices is equivalent to bounding the sine of the largest principal angle between the subspaces that they project. In light of this relation, and the prior analysis of Davis-Kahan sinΘTheorem [68], we provide the following Lemma on the concentration of the sine and the finite sample projection error.

Lemma 2.2.4(Finite Sample Projection Error). Fix any 𝛿 ∈ (0,1). Let𝑡𝑤 ,𝛿 = 𝑛𝛿

𝐾. Suppose Assumption 2.2.1 holds. Then with probability at least1−3𝛿,∀𝑡 ≥ 𝑡𝑤 ,𝛿,

∥𝑃ˆ𝑡−𝑃∥2 ≤ 𝜙𝛿

√ 𝑡

, where𝜙𝛿 =2Γ

√︂

𝛼 𝐾

log2𝑑 𝛿

. (2.9)

Lemma 2.2.4 improves the existing bounds on the projection error (Corollary 2.9 in Vaswani and Narayanamurthy [284]) by using the matrix Chernoff inequality [268].

It also provides the precise problem-dependent quantities in the bound which are required for defining the minimum number of samples required to construct tight confidence sets by using subspace estimation. The formal and detailed version of the Lemma 2.2.4 and the details of the proof are provided in Appendix A.1.1.

Note that, as discussed in Section 2.2.2, we define the confidence set C𝑝,𝑡 in (2.9) for all𝑡 ≥ 𝑡𝑤 ,𝛿. Due to the equivalence ∥𝑃ˆ𝑡 −𝑃∥2 = sinΘ𝑚, ∀𝑡 ≥ 1 we have that

∥𝑃ˆ𝑡 −𝑃∥2 is always less than or equal to 1, i.e., ∥𝑃ˆ𝑡 − 𝑃∥2 ≤ 1. Therefore, any projection error bound greater than 1 is vacuous. Consequently, we state that, with high probability, the bound on the projection error in (2.9) becomes less than 1 when 𝑡 ≥ 𝑡𝑤 ,𝛿. After round𝑡𝑤 ,𝛿, PSLB starts to produce non-trivial confidence sets C𝑝,𝑡

around ˆ𝑃𝑡. However, note that𝑡𝑤 ,𝛿 can be significantly large for problems that have latent structures that are hard to recover, e.g., having𝛼linear in𝑑.

The term 𝜙𝛿 in Lemma 2.2.4 also provides several important intuitions about the subspace estimation problem in terms of the problem structure. Recalling the defini- tion ofΓin (2.4), as𝑔𝜓 decreases, the projection error shrinks since the underlying subspace becomes more distinguishable. Conversely, as 𝑔𝑥 diverges from 1, it becomes harder to recover the underlying𝑚-dimensional subspace. Additionally, since 𝛼 is the maximum of the effective dimensions of the true action vector and the perturbation vector, having large𝛼makes the subspace recovery harder and the projection error bound looser, whereas observing more action vectors, 𝐾, in each round produces tighter bound on∥𝑃ˆ𝑡−𝑃∥2. The effects of these structural properties

on the subspace estimation translate to confidence set construction and ultimately to the regret upper bound.

Projected Confidence Sets

In this section, we analyze the construction of C𝑚,𝑡 and C𝑑 ,𝑡. For any round𝑡 ≥ 1, define ˆΣ𝑡𝑡

𝑖=1𝑋ˆ𝑖𝑋ˆ𝑇

𝑖 =𝑡𝑇𝑡. At round𝑡, let𝐴𝑡 B 𝑃ˆ𝑡(Σˆ𝑡−1+𝜆 𝐼𝑑)𝑃ˆ𝑡for𝜆 >0. Let 𝐵𝑡be a symmetric matrix such that 𝐴𝑡 =𝑉ˆ𝑡𝐵𝑡𝑉ˆ𝑇

𝑡 . Notice that𝐵𝑡is a full rank𝑚×𝑚 matrix. The rewards obtained up to round𝑡 are denoted asr𝑡−1. At round 𝑡, after estimating the projection matrix ˆ𝑃𝑡associated with the underlying subspace, PSLB finds𝜃𝑡, an estimate of𝜃, while having𝜃living within the estimated subspace with high probability. Therefore,𝜃𝑡is the solution to the following Tikhonov-regularized least squares problem with regularization parameters𝜆 >0 and ˆ𝑃𝑡,

𝜃𝑡 =argmin

𝜃

∥ (𝑃ˆ𝑡𝑡−1)𝑇𝜃−r𝑡−122+𝜆∥𝑃ˆ𝑡𝜃∥22.

Notice that regularization is applied along the estimated subspace. Solving for 𝜃 gives 𝜃𝑡 = 𝐴

𝑡 𝑃ˆ𝑡𝑡−1r𝑡−1

. Let 𝑆𝑡 B Í𝑡

𝑖=1𝑃ˆ𝑡𝑋ˆ𝑖−1𝜂𝑖−1 = 𝑃ˆ𝑡X𝑡−1𝜂𝜂𝜂𝑡−1. Before pre- senting the confidence set construction, we provide a self-normalized bound on𝑆𝑡. Theorem 2.2.5(Self-Normalized Bound for Vector-Valued Martingales). For any 𝛿 ∈ (0,1), with probability at least1−𝛿, for all𝑡 ≥ 1,

∥𝑆𝑡2

𝐴

𝑡

≤ 2𝑅2log

det(𝐵𝑡)1/2det(𝜆 𝐼𝑚)1/2 𝛿

.

This result is a similar self-normalized bound for vector-valued martingales in Abbasi-Yadkori et al. [3], and it can be considered as the projected version of their Theorem 1. The proof of Theorem 2.2.5 is given in Appendix A.1.2. Define𝐿such that for all 𝑡 ≥ 1 and𝑖 ∈ [𝐾], ∥𝑥ˆ𝑡 ,𝑖2 ≤ 𝐿 and let 𝛾 = 𝐿2

𝜆log 1+𝐿2𝜆

. Consider the following lemmas that will be useful in proving confidence set construction, and their proofs are given in Appendix A.1.3.

Lemma 2.2.6. Suppose Assumptions 2.2.1 & 2.2.2 hold. Then,det(𝐵𝑡) ≤

𝜆+𝑡 𝐿2

𝑚

𝑚 . Lemma 2.2.7. Suppose Assumptions2.2.1 & 2.2.2 hold. Then,

∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−12 ≤ 𝐿

√ 𝑡

√ 𝛾 𝑚

√︄

log

1+ 𝑡 𝐿2 𝑚𝜆

.

The following theorem gives the construction of the projected confidence set,C𝑚,𝑡.

Theorem 2.2.8 (Projected Confidence Set Construction,C𝑚,𝑡). Fix any 𝛿 ∈ (0,1). Let Assumptions 2.2.1 & 2.2.2 hold, and ∀𝑡 ≥ 1 and 𝑖 ∈ [𝐾], ∥𝑥ˆ𝑡 ,𝑖2 ≤ 𝐿. If

∥𝜃2 ≤ 𝑆 then, with probability at least 1 − 4𝛿, ∀𝑡 ≥ 𝑡𝑤 ,𝛿, 𝜃 lies in the set C𝑚,𝑡 =

𝜃 ∈R𝑑 : ∥𝜃𝑡−𝜃∥𝐴𝑡 ≤ 𝛽𝑡 ,𝛿

,where

𝛽𝑡 ,𝛿 =𝑅

√︄

2 log 1

𝛿

+𝑚log

1+ 𝑡 𝐿2 𝑚𝜆

+𝐿 𝑆 𝜙𝛿

√︄

𝛾 𝑚log

1+ 𝑡 𝐿2 𝑚𝜆

+𝑆

𝜆. (2.10) Proof. From the definition of𝜃𝑡 and𝑟𝑡, we get the following:

𝜃𝑡 = 𝐴

𝑡𝑆𝑡+𝐴

𝑡𝑃ˆ𝑡Σˆ𝑡−1𝑃𝜃 since𝜃 ∈span(𝑉)

= 𝐴

𝑡𝑆𝑡+𝐴

𝑡 𝑃ˆ𝑡Σˆ𝑡−1(𝑃ˆ𝑡+𝑃−𝑃ˆ𝑡) +𝜆𝑃ˆ𝑡−𝜆𝑃ˆ𝑡 𝜃

= 𝐴

𝑡𝑆𝑡+𝑃ˆ𝑡𝜃+ 𝐴

𝑡(𝑃ˆ𝑡Σˆ𝑡−1(𝑃−𝑃ˆ𝑡))𝜃−𝜆 𝐴

𝑡𝜃. Using this, we derive the following for𝑥 = 𝐴𝑡(𝜃𝑡 −𝜃):

𝑥𝑇𝜃𝑡 −𝑥𝑇𝜃=𝑥𝑇𝐴

𝑡𝑆𝑡+𝑥𝑇𝐴

𝑡(𝑃ˆ𝑡Σˆ𝑡−1(𝑃−𝑃ˆ𝑡))𝜃−𝜆𝑥𝑇𝐴

𝑡𝜃

= ⟨𝑥 , 𝑆𝑡

𝐴

𝑡

+ ⟨𝑥 ,𝑃ˆ𝑡Σˆ𝑡−1(𝑃−𝑃ˆ𝑡)𝜃

𝐴

𝑡

−𝜆⟨𝑥 , 𝜃

𝐴

𝑡

.

Using Cauchy-Schwarz inequality, we can upper bound the magnitude of the differ- ence as follows:

|𝑥𝑇𝜃𝑡−𝑥𝑇𝜃| ≤ ∥𝑥∥

𝐴

𝑡

∥𝑆𝑡

𝐴

𝑡

+ ∥𝑃ˆ𝑡Σˆ𝑡−1(𝑃−𝑃ˆ𝑡)𝜃

𝐴

𝑡

+𝜆∥𝜃

𝐴

𝑡

≤ ∥𝑥∥

𝐴

𝑡

∥𝑆𝑡

𝐴

𝑡

+ ∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−1(𝑃−𝑃ˆ𝑡)𝜃2+

𝜆∥𝜃2

(2.11)

≤ ∥𝑥∥

𝐴

𝑡

∥𝑆𝑡

𝐴

𝑡

+ ∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−12∥𝑃−𝑃ˆ𝑡2∥𝜃2+

𝜆∥𝜃2 . Plugging in𝑥 =𝐴𝑡(𝜃𝑡−𝜃), we get

∥𝜃𝑡−𝜃2

𝐴𝑡

≤ ∥𝐴𝑡(𝜃𝑡−𝜃) ∥

𝐴

𝑡

∥𝑆𝑡

𝐴

𝑡

+∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−12∥𝑃−𝑃ˆ𝑡2∥𝜃2+

√ 𝜆∥𝜃2

. Since∥𝐴𝑡(𝜃𝑡−𝜃) ∥

𝐴

𝑡

= ∥𝜃𝑡−𝜃𝐴𝑡, dividing both sides with∥𝜃𝑡−𝜃𝐴𝑡 gives and using the fact that∥𝜃∥ ≤ 𝑆,

∥𝜃𝑡−𝜃𝐴𝑡 ≤ ∥𝑆𝑡

𝐴

𝑡

+𝑆∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−12∥𝑃−𝑃ˆ𝑡2+𝑆

𝜆. (2.12)

Notice that the first term is the projected version of Theorem 1 in [3] and the second term is the additional term appearing in the confidence interval construction due to non-zero projection error. As it can be seen with the knowledge of true projection

matrix, the confidence interval reduces to the one in [3] with replacement of𝑑with 𝑚.

Using Theorem 2.2.5 and Lemma 2.2.4, we get:

∥𝜃𝑡−𝜃𝐴𝑡 ≤ 𝑅

√︂

2 logdet(𝐵𝑡)1/2det(𝜆 𝐼𝑚)−1/2 𝛿

+ 𝑆 𝜙𝛿

√ 𝑡

∥ (𝐴

𝑡)1/2𝑃ˆ𝑡Σˆ𝑡−12+𝑆

√ 𝜆.

Finally, combining this with Lemma 2.2.6 and Lemma 2.2.7 gives the statement of

the Theorem 2.2.8. □

Notice that the overall proof follows a similar machinery used by [3]. Specifically, the first term of𝛽𝑡 ,𝛿 in (2.10) is derived similarly by a self-normalized tail inequal- ity, Theorem 2.2.5. However, since at each round PSLB projects the supervised actions to an estimated 𝑚-dimensional subspace to estimate 𝜃, 𝑑 is replaced by 𝑚 in the bound using Lemma 2.2.6. While enjoying the benefit of projection, this construction of the confidence set suffers from the finite sample projection error, i.e., uncertainty in the subspace estimation. This effect is observed via the second term in (2.10). The second term involves the confidence bound for the estimated projection matrix,𝜙𝛿. This is critical in determining the tightness of the confidence set on𝜃. As discussed before,𝜙𝛿reflects the difficulty of subspace recovery of the given prob- lem and it depends on the underlying structure of the problem and SLB. This shows that as estimating the underlying subspace gets easier, having a projection-based approach in the construction of the confidence sets on𝜃provides tighter bounds.

In order to tolerate the possible difficulty of subspace recovery, PSLB also constructs C𝑑 ,𝑡, which is the confidence set for 𝜃 without having subspace recovery. The construction ofC𝑑 ,𝑡 follows OFUL [3]. Let𝑍𝑡 =Σˆ𝑡−1+𝜆 𝐼𝑑. The algorithm tries to find ˆ𝜃𝑡which is theℓ2-regularized least squares estimate of𝜃in the ambient space.

Construction ofC𝑑 ,𝑡is done under the same assumptions of Theorem 2.2.8, such that with probability at least 1−𝛿,𝜃lies in the setC𝑑 ,𝑡 =

𝜃 ∈R𝑑 : ∥𝜃ˆ𝑡−𝜃∥𝑍𝑡 ≤ Ω𝑡 ,𝛿 , where Ω𝑡 ,𝛿 = 𝑅

√︄

2 log

1 𝛿

+𝑑log

1+ 𝑡 𝐿2

𝑚𝜆

+𝑆

𝜆. The search for an optimistic parameter vector happens inC𝑚,𝑡∩ C𝑑 ,𝑡. Notice that𝜃 ∈ C𝑚,𝑡∩ C𝑑 ,𝑡 with probability at least 1− 5𝛿. Optimistically choosing the pair, (𝑋ˆ𝑡,𝜃˜𝑡), within the described confidence sets gives PSLB a way to tolerate the possibility of failure in recovering an underlying structure. If confidence set C𝑚,𝑡 is loose or PSLB is not able to recover an underlying structure, then C𝑑 ,𝑡 provides the useful confidence set to obtain desirable learning behavior.

Regret Analysis

PSLB uses the intersection of C𝑚,𝑡 andC𝑑 ,𝑡 as the confidence set at round𝑡. Using only C𝑑 ,𝑡 is equivalent to following OFUL and the regret analysis can be found in [3]. The regret analysis of using only the projected confidence setC𝑚,𝑡 is the main contribution of this work. The following lemmas will be key in obtaining the regret analysis.

Lemma 2.2.9. At round 𝑘, for any 𝑥ˆ ∈ 𝐷𝑘, if 𝜈 ∈ C𝑘, then | (𝑃ˆ𝑘𝑥ˆ)𝑇(𝜈 −𝜃𝑘) | ≤ 𝛽𝑘 ,𝛿∥𝑥ˆ∥

𝐴

𝑘

.

Define𝑡𝑟 ,𝛿such that𝑡𝑟 ,𝛿 =1+

2𝑚−1 2𝑚

4𝐿2Γ

√︃

𝛼

𝐾log2𝛿𝑑+

2𝐿(𝜆+𝜎2)log𝑚𝛿

𝜆+𝜎2

2

. Lemma 2.2.10. For all𝑡 ≥ 𝑡𝑤 ,𝛿, with probability at least1−𝛿,

𝜆𝑚(𝑃ˆ𝑡Σˆ𝑡−1𝑃ˆ𝑡) ≥ (𝑡−1) (𝜆+𝜎2) −

√ 𝑡−1

4𝐿2Γ

√︂

𝛼

𝐾 log2𝑑 𝛿

+

√︂

2𝐿(𝜆+𝜎2)log𝑚 𝛿

. (2.13) Also, for all𝑡 ≥𝑡𝑟 ,𝛿, with probability at least1−𝛿,

𝜆𝑚(𝑃ˆ𝑡Σˆ𝑡−1𝑃ˆ𝑡) ≥ (𝜆+𝜎2) 2𝑚

(𝑡−1). (2.14)

The proofs of Lemma 2.2.9 and 2.2.10 are in Supplementary Material A.1.4. The following theorem gives the regret upper bound for using only the projected confi- dence setC𝑚,𝑡.

Theorem 2.2.11 (Regret Upper Bound of using only C𝑚,𝑡). Fix any 𝛿 ∈ (0,1). Assume that for all𝑥ˆ𝑡 ,𝑖 ∈ 𝐷𝑡,𝑥ˆ𝑇

𝑡 ,𝑖𝜃 ∈ [−1,1]. Under Assumptions 1 and 2,∀𝑡 ≥ 1, with probability at least1−6𝛿, the regret of using onlyC𝑚,𝑡 satisfies

𝑅𝑡 ,C

𝑚, 𝑡 ≤ Oe

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆+𝜎2 +𝑚

!√ 𝑡

!

. (2.15)

Proof. The instantaneous regret,𝑙𝑖 = 𝑋ˆ∗𝑇

𝑖 𝜃− 𝑋ˆ𝑇

𝑖 𝜃, of the algorithm at𝑖th round can be decomposed as follows:

ˆ 𝑋𝑇

𝑖 𝜃−𝑋ˆ𝑇

𝑖 𝜃

≤ (𝑃˜𝑖𝑋ˆ𝑖)𝑇𝜃˜𝑖− (𝑃𝑋ˆ𝑖)𝑇𝜃 (2.16)

= 𝑋ˆ𝑇

𝑖 (𝑃˜𝑖−𝑃ˆ𝑖+𝑃ˆ𝑖)𝜃˜𝑖−𝑋ˆ𝑇

𝑖 (𝑃ˆ𝑖+𝑃−𝑃ˆ𝑖)𝜃

= (𝑃ˆ𝑖𝑋ˆ𝑖)𝑇(𝜃˜𝑖−𝜃𝑖) + (𝑃ˆ𝑖𝑋ˆ𝑖)𝑇(𝜃𝑖−𝜃) + ( (𝑃ˆ𝑖−𝑃)𝑋ˆ𝑖)𝑇𝜃+ ( (𝑃˜𝑖−𝑃ˆ𝑖)𝑋ˆ𝑖)𝑇𝜃˜𝑖

≤ 2𝛽𝑖,𝛿∥𝑋ˆ𝑖

𝐴

𝑖

+2𝐿 𝑆∥𝑃ˆ𝑖−𝑃∥2, (2.17)

where (2.16) follows since (𝑃˜𝑖,𝑋ˆ𝑖,𝜃˜𝑖) is optimistic and (2.17) holds for all 𝑖 with probability at least 1−4𝛿due to Lemma 2.2.9 and Theorem 2.2.8. Combining this decomposition with the fact that𝑙𝑖 ≤ 2, we get

𝑙𝑖 ≤ 2 min 𝛽𝑖,𝛿∥𝑋ˆ𝑖

𝐴

𝑖

+𝐿 𝑆∥𝑃ˆ𝑖−𝑃∥2,1

!

(2.18)

≤ 2𝛽𝑖,𝛿min( ∥𝑋ˆ𝑖

𝐴

𝑖

,1) +2𝐿 𝑆min( ∥𝑃ˆ𝑖−𝑃∥2,1).

Now we can provide an upper bound on the regret. For all𝑡 ≥ 1, with probability at least 1−5𝛿,

𝑅𝑡

𝑡

∑︁

𝑖=1

2𝛽𝑖,𝛿min( ∥𝑋ˆ𝑖

𝐴

𝑖

,1) +2𝐿 𝑆min( ∥𝑃ˆ𝑖−𝑃∥2,1)

=2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ𝑖−𝑃∥2,1) +

𝑡

∑︁

𝑖=1

2𝛽𝑖,𝛿min( ∥𝑋ˆ𝑖

𝐴

𝑖

,1)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ𝑖−𝑃∥2,1) +2𝛽𝑡 ,𝛿

𝑡

∑︁

𝑖=1

min( ∥𝑋ˆ𝑖

𝐴

𝑖

,1) (2.19)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ𝑖−𝑃∥2,1) +2𝛽𝑡 ,𝛿 vu t 𝑡

𝑡

∑︁

𝑖=1

min( ∥𝑋ˆ𝑖2

𝐴

𝑖

,1)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ𝑖−𝑃∥2,1) +2√ 𝑡 𝛽𝑡 ,𝛿

vu t 𝑡

∑︁

𝑖=1

min 𝜆max(𝐴

𝑖)𝐿2,1

(2.20)

≤ 2𝐿 𝑆

𝑡

∑︁

𝑖=1

min( ∥𝑃ˆ𝑖−𝑃∥2,1) +2√ 𝑡 𝛽𝑡 ,𝛿

vu t 𝑡

∑︁

𝑖=1

min

𝐿2

𝜆+𝜆𝑚(𝑃ˆ𝑖Σˆ𝑖−1𝑃ˆ𝑖) ,1

(2.21)

≤ 2𝐿 𝑆

𝑡𝑤 ,𝛿+2Γ

√︂

𝛼

𝐾 log2𝑑 𝛿

𝑡

∑︁

𝑖=𝑡𝑤 , 𝛿

√1 𝑖

+2𝐿

√ 𝑡 𝛽𝑡 ,𝛿

vu t𝑡𝑟 ,𝛿

𝜆

+ 2𝑚 𝜆+𝜎2

𝑡

∑︁

𝑖=𝑡𝑟 , 𝛿

1 𝑖

, (2.22) where (2.19) follows from the fact that𝛽1,𝛿≤ · · · ≤𝛽𝑡 ,𝛿. Since∥𝑥∥𝑀≤𝜆max(𝑀) ∥𝑥∥2, we get (2.20). The maximum eigenvalue of 𝐴

𝑡 is equivalent to𝑚th eigenvalue of 𝐴𝑡, thus (2.21) is obtained. Recall that∥𝑃ˆ𝑖−𝑃∥2 < 1 for𝑡 ≥ 𝑡𝑤 ,𝛿. Using Lemma 2.2.4 and the second statement of Lemma 2.2.10 we get (2.22). Finally, Lemma

A.1.5 provides the following regret upper bound 𝑅𝑡 ≤ 2𝐿 𝑆𝑡𝑤 ,𝛿+4𝐿 𝑆Γ

√︂

𝛼 𝐾

log2𝑑 𝛿

(2√ 𝑡−2√︁

𝑡𝑤 ,𝛿+1+1) +2𝐿

√ 𝑡 𝛽𝑡 ,𝛿

√︄

𝑡𝑟 ,𝛿 𝜆

+ 2𝑚+2𝑚log𝑡−2𝑚log(𝑡𝑟 ,𝛿+1) 𝜆+𝜎2

. (2.23)

Recall that 𝛽𝑡 ,𝛿 = O Γ√︁𝛼𝑚

𝐾 log𝑡+√︁

𝑚log𝑡

. Therefore, last term dominates the asymptotic upper bound on regret. Using the definition of 𝑡𝑟 ,𝛿 and the fact that

𝑎+𝑏 ≤ √ 𝑎+√

𝑏for𝑎, 𝑏 >0, we get that the regret of the algorithm is 𝑅𝑡 =O

√ 𝑚 𝜆+𝜎2

Γ

√︂𝛼 𝐾

+ 𝛼Γ2 𝐾

√︁

𝑡log𝑡+ 𝑚

√︁

𝜆+𝜎2

1+Γ

√︂𝛼 𝐾

√ 𝑡log𝑡

!

=O˜

1+Γ

√︂𝛼 𝐾

Γ√

𝑚 𝛼

√ 𝐾

√︁

𝜆 +𝜎2 +𝑚

!√ 𝑡

!

=Oe Υ√

𝑡

.

Proof of Theorem 2.2.3: Using the intersection ofC𝑚,𝑡 andC𝑑 ,𝑡 as the confidence set at round 𝑡, gives PSLB the ability to obtain the lowest possible instantaneous regret among both confidence sets. Therefore, the regret of PSLB is upper bounded by the minimum of the regret upper bounds on the individual strategies. Thus, Theorem 2.2.11 and Theorem 3 of Abbasi-Yadkori et al. [3] give the statement of Theorem 2.2.3.

Interpreting the Regret Bound

Υ is the reflection of the finite sample projection error at the beginning of the algorithm. It captures the difficulty of subspace recovery based on the structural properties of the problem and determines the regret of deploying projection-based methods in SLBs. Recall that𝛼is the maximum of the effective dimensions of the true action vectors and the perturbation vectors. Depending on the structure of the problem, 𝛼 can be O (𝑑), e.g., the perturbation can be uniform in all dimensions, which prevents the projection error from shrinking; thus, causes Υ = O (𝑑

√ 𝑚) resulting inO (e 𝑑

𝑚𝑡)regret. The eigengap within the true action vectors𝑔𝑥and the eigengap between the true action vectors and the perturbation vectors𝑔𝜓 are critical factors that determine the identifiability of the hidden subspace. As𝜎2 increases, the subspace recovery becomes harder since the effect of perturbation increases.

Conversely, as 𝜆 increases, the underlying subspace becomes easier to identify.

These effects are significant and translate to the regret of PSLB viaΓinΥ.

Moreover, having finite samples to estimate the subspace affects the regret bound through Υ. Due to the nature of SLB, i.e., finite action vectors in decision sets, this is unavoidable. Note that if the decision set contained infinitely many actions, the subspace recovery would be accomplished perfectly. Thus, the problem would reduce to a𝑚-dimensional SLB which has a regret upper bound ofO (e 𝑚

𝑡). This behavior can be seen in Υ. As 𝐾 → ∞, Υ = O (𝑚) which gives the regret upper bound ofO (e 𝑚

𝑡)as expected.

Theorem 2.2.3 states that if the underlying structure is easily recoverable, e.g.,Υ = O (𝑚), then using PCA-based dimension reduction and construction of confidence sets provides substantially better regret upper bound for large 𝑑. If that is not the case, then due to the best-of-the-both-worlds approach provided by PSLB, the agent obtains the best possible regret upper bound. Note that the bound for using only C𝑚,𝑡 is a worst-case bound, and as we present in Section 2.2.4, in practice PSLB can give significantly better results.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 43-51)