• Tidak ada hasil yang ditemukan

Asymptotic Consistency of the Transition Dynamics in DPS in the

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 177-188)

Chapter VI: Conclusions and Future Directions

B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the

B.1 Facts about Convergence in Distribution

Before proceeding with the asymptotic consistency proofs, two facts about conver- gence in distribution are reviewed; these will be applied later.

Recall that for a random variable๐‘‹and a sequence of random variables(๐‘‹๐‘›),๐‘› โˆˆN, ๐‘‹๐‘›

โˆ’โ†’๐ท ๐‘‹ denotes that๐‘‹๐‘› converges to ๐‘‹ in distribution, while ๐‘‹๐‘›

โˆ’โ†’๐‘ƒ ๐‘‹ denotes that๐‘‹๐‘›converges to ๐‘‹ in probability.

Fact 8(Billingsley, 1968). For random variables๐’™,๐’™๐‘›,โˆˆR๐‘‘, where๐‘›โˆˆN, and any continuous function๐‘”:R๐‘‘ โˆ’โ†’ R, if๐’™๐‘›

โˆ’โ†’๐ท ๐’™, then๐‘”(๐’™๐‘›)โˆ’โ†’๐ท ๐‘”(๐’™).

Fact 9 (Billingsley, 1968). For random variables ๐’™๐‘› โˆˆ R๐‘‘, ๐‘› โˆˆ N, and constant vector ๐’„ โˆˆ R๐‘‘, ๐’™๐‘›

โˆ’โ†’๐ท ๐’„ is equivalent to ๐’™๐‘›

โˆ’โ†’๐‘ƒ ๐’„. Convergence in probability means that for any๐œ€ >0, ๐‘ƒ(||๐’™๐‘›โˆ’๐’„||2 โ‰ฅ ๐œ€) โˆ’โ†’0as๐‘›โˆ’โ†’ โˆž.

B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-

dynamics parameters, respectively (hiding the dependency on the DPS episode๐‘–1 or ๐‘–2 for the latter three quantities); thus, [๐’‘(๐‘—)]๐‘˜ denotes the true probability of transitioning from state-action pair หœ๐‘ ๐‘— to the ๐‘˜th state, and analogously for the ๐‘˜th elements of ๐’‘หœ(๐‘—), ๐’‘ห†(๐‘—), and ๐’‘ห†0(๐‘—). Then, from the Dirichlet model,

[๐’‘ห†(๐‘—)]๐‘˜ =

๐‘›๐‘— ๐‘˜ +๐›ผ๐‘— ๐‘˜ ,0 ๐‘›๐‘—+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ,

where the prior for ๐’‘(๐‘—)isร๐‘† 1 ๐‘š=1๐›ผ๐‘— ๐‘š,0

[๐›ผ๐‘—1,0, . . . , ๐›ผ๐‘— ๐‘†,0]๐‘‡ for user-defined hyperparam- eters๐›ผ๐‘— ๐‘˜ ,0> 0. Meanwhile, the maximum likelihood is given by[๐’‘ห†0(๐‘—)]๐‘˜ = max(๐‘›๐‘›๐‘— ๐‘˜

๐‘—,1)

(this is equivalent to[๐’‘ห†(๐‘—)]๐‘˜, except with the prior parameters set to zero). Consider the sampled dynamics at state-action pair หœ๐‘ ๐‘—. For any๐œ€ >0,

๐‘ƒ ||๐’‘หœ(๐‘—)โˆ’ ๐’‘(๐‘—)||1 โ‰ฅ ๐œ€

= ๐‘ƒ ||๐’‘หœ(๐‘—)โˆ’ ๐’‘ห†(๐‘—) + ๐’‘ห†(๐‘—) โˆ’ ๐’‘ห†0(๐‘—)+ ๐’‘ห†0(๐‘—) โˆ’ ๐’‘(๐‘—)||1โ‰ฅ ๐œ€

(๐‘Ž)

โ‰ค ๐‘ƒ ||๐’‘หœ(๐‘—) โˆ’ ๐’‘ห†(๐‘—)||1+ ||๐’‘ห†(๐‘—) โˆ’ ๐’‘ห†0(๐‘—)||1+ ||๐’‘ห†0(๐‘—)โˆ’ ๐’‘(๐‘—)||1 โ‰ฅ ๐œ€

โ‰ค ๐‘ƒ

||๐’‘หœ(๐‘—) โˆ’ ๐’‘ห†(๐‘—)||1 โ‰ฅ ๐œ€ 3

ร˜ ||๐’‘ห†(๐‘—)โˆ’ ๐’‘ห†0(๐‘—)||1 โ‰ฅ ๐œ€ 3

ร˜ ||๐’‘ห†0(๐‘—) โˆ’ ๐’‘(๐‘—)||1โ‰ฅ ๐œ€ 3

(๐‘)

โ‰ค ๐‘ƒ

||๐’‘หœ(๐‘—) โˆ’ ๐’‘ห†(๐‘—)||1 โ‰ฅ ๐œ€ 3

+๐‘ƒ

||๐’‘ห†(๐‘—) โˆ’ ๐’‘ห†0(๐‘—)||1โ‰ฅ ๐œ€ 3

+๐‘ƒ

||๐’‘ห†0(๐‘—) โˆ’ ๐’‘(๐‘—)||1 โ‰ฅ ๐œ€ 3

, (B.2) where (a) holds due to the triangle inequality and (b) follows from the union bound.

This proof will upper-bound each term in Eq. (B.2) in terms of๐‘›๐‘— and show that it decays as๐‘›๐‘— โˆ’โ†’ โˆž, that is, as หœ๐‘ ๐‘— is visited infinitely often. For the first term, this bound is achieved via Chebyshevโ€™s inequality:

๐‘ƒ

||๐’‘หœ(๐‘—) โˆ’ ๐’‘ห†(๐‘—)||1 โ‰ฅ ๐œ€ 3

โ‰ค ๐‘ƒ

๐‘†

ร˜

๐‘˜=1

n

[๐’‘หœ(๐‘—)]๐‘˜ โˆ’ [๐’‘ห†(๐‘—)]๐‘˜

โ‰ฅ ๐œ€

3๐‘† o

!

(๐‘Ž)

โ‰ค

๐‘†

ร•

๐‘˜=1

๐‘ƒ

[๐’‘หœ(๐‘—)]๐‘˜ โˆ’ [๐’‘ห†(๐‘—)]๐‘˜

โ‰ฅ ๐œ€

3๐‘† (๐‘)

โ‰ค

๐‘†

ร•

๐‘˜=1

9๐‘†2 ๐œ€2

Varh

[๐’‘หœ(๐‘—)]๐‘˜

i ,

where (a) follows from the union bound and (b) is an application of Chebyshevโ€™s inequality. For a Dirichlet variable๐‘‹ with parameters(๐›ผ1, . . . , ๐›ผ๐‘†),๐›ผ๐‘˜ >0 for each ๐‘˜, the variance of the๐‘˜thcomponent๐‘‹๐‘˜ is given by:

Var[๐‘‹๐‘˜] = ๐›ผหœ๐‘˜(1โˆ’๐›ผหœ๐‘˜) 1+ร๐‘†

๐‘š=1๐›ผ๐‘š

โ‰ค 1

2โˆ— 1

1+ร๐‘†

๐‘š=1๐›ผ๐‘š ,

where หœ๐›ผ๐‘˜ := ร๐‘†๐›ผ๐‘˜ ๐‘š=1๐›ผ๐‘š

. In the DPS algorithm, ๐’‘หœ(๐‘—) is drawn from a Dirichlet distribu-

tion with parameters (๐›ผ๐‘—1, . . . , ๐›ผ๐‘— ๐‘†) =(๐›ผ๐‘—1,0+๐‘›๐‘—1, . . . , ๐›ผ๐‘— ๐‘†,0+๐‘›๐‘— ๐‘†), so that, Varh

[๐’‘หœ(๐‘—)]๐‘˜

i

โ‰ค 1

2 โˆ— 1

1+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š

= 1

2 โˆ— 1

1+ร๐‘†

๐‘š=1(๐›ผ๐‘— ๐‘š,0+๐‘›๐‘— ๐‘š)

โ‰ค 1

2 โˆ— 1

1+ร๐‘† ๐‘š=1๐‘›๐‘— ๐‘š

= 1

2(1+๐‘›๐‘—). Thus,

๐‘ƒ ||๐’‘หœ(๐‘—)โˆ’ ๐’‘ห†(๐‘—)||1 โ‰ฅ ๐œ€ 3

!

โ‰ค

๐‘†

ร•

๐‘˜=1

9๐‘†2 ๐œ€2

1

2(1+๐‘›๐‘—) = 9๐‘†3 2๐œ€2(1+๐‘›๐‘—). Considering the second term in Eq. (B.2),

๐‘ƒ ||๐’‘ห†(๐‘—) โˆ’ ๐’‘ห†0(๐‘—)||1 โ‰ฅ ๐œ€ 3

!

โ‰ค ๐‘ƒ

๐‘†

ร˜

๐‘˜=1

n

[๐’‘ห†(๐‘—)โˆ’ ๐’‘ห†0(๐‘—)]๐‘˜

โ‰ฅ ๐œ€

3๐‘† o

!

(๐‘Ž)

โ‰ค

๐‘†

ร•

๐‘˜=1

๐‘ƒ

[๐’‘ห†(๐‘—)]๐‘˜ โˆ’ [๐’‘ห†0(๐‘—)]๐‘˜

โ‰ฅ ๐œ€

3๐‘† (๐‘)

โ‰ค

๐‘†

ร•

๐‘˜=1

๐‘ƒ

๐›ผ๐‘— ๐‘˜ ,0+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ๐‘›๐‘— +ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0

โ‰ฅ ๐œ€ 3๐‘†

! ,

where (a) holds via the union bound and (b) follows for๐‘›๐‘— โ‰ฅ1 because when๐‘›๐‘— โ‰ฅ 1:

[๐’‘ห†(๐‘—)]๐‘˜โˆ’ [๐’‘ห†0(๐‘—)]๐‘˜

=

๐‘›๐‘— ๐‘˜ +๐›ผ๐‘— ๐‘˜ ,0 ๐‘›๐‘—+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0

โˆ’ ๐‘›๐‘— ๐‘˜ ๐‘›๐‘—

=

๐›ผ๐‘— ๐‘˜ ,0 ๐‘›๐‘—+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0

โˆ’

๐‘›๐‘— ๐‘˜ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ๐‘›๐‘—(๐‘›๐‘—+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0)

โ‰ค

๐›ผ๐‘— ๐‘˜ ,0 ๐‘›๐‘— +ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 +

๐‘›๐‘— ๐‘˜ ๐‘›๐‘—

ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ๐‘›๐‘— +ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0

โ‰ค

๐›ผ๐‘— ๐‘˜ ,0+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ๐‘›๐‘—+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 .

For the third term in Eq. (B.2), one can apply the following concentration inequality for Dirichlet variables (see Appendix C.1 in Jaksch, Ortner, and Auer, 2010):

๐‘ƒ(||๐’‘ห†0(๐‘—)โˆ’ ๐’‘(๐‘—)||1 โ‰ฅ ๐œ€) โ‰ค (2๐‘†โˆ’2)exp โˆ’๐‘›๐‘—๐œ€2 2

! . Therefore:

๐‘ƒ

||๐’‘ห†0(๐‘—) โˆ’ ๐’‘(๐‘—)||1โ‰ฅ ๐œ€ 3

โ‰ค (2๐‘†โˆ’2)exp โˆ’๐‘›๐‘—๐œ€2 18

! . Thus, to upper-bound the right-hand side of Eq. (B.2), for any๐œ€ > 0:

๐‘ƒ ||๐’‘หœ(๐‘—)โˆ’๐’‘(๐‘—)||1 โ‰ฅ ๐œ€

โ‰ค 9๐‘†3 2๐œ€2(๐‘›๐‘— +1)+

๐‘†

ร•

๐‘˜=1

๐‘ƒ

๐›ผ๐‘— ๐‘˜ ,0+ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0 ๐‘›๐‘— +ร๐‘†

๐‘š=1๐›ผ๐‘— ๐‘š,0

โ‰ฅ ๐œ€ 3๐‘†

!

+(2๐‘†โˆ’2)exp โˆ’๐‘›๐‘—๐œ€2 18

! .

On the right hand side, the first and third terms clearly decay as ๐‘›๐‘— โˆ’โ†’ โˆž. The middle term is identically zero for๐‘›๐‘— large enough, since the๐›ผ๐‘— ๐‘˜ ,0values are user- defined constants. Given this inequality, it is clear that for any๐œ€ > 0, as๐‘›๐‘— โˆ’โ†’ โˆž,

๐‘ƒ ||๐’‘หœ(๐‘—) โˆ’ ๐’‘(๐‘—)||1 โ‰ฅ ๐œ€

โˆ’โ†’ 0. If every state-action pair is visited infinitely often, then ๐‘›๐‘— โˆ’โ†’ โˆž for each ๐‘—, and therefore, ๐’‘หœ(๐‘—) converges in probability to ๐’‘(๐‘—): ๐’‘หœ(๐‘—) โˆ’โ†’๐‘ƒ ๐’‘(๐‘—). Convergence in probability implies convergence in distribution, the desired result.

To continue proving that DPSโ€™s model of the transition dynamics converges, this analysis uses that the magnitude of the utility estimator||๐’“ห†๐‘›||2, the mean of the utility posterior sampling distribution, is uniformly upper-bounded; in other words, there exists๐‘ < โˆžsuch that ||๐’“ห†๐‘›||2 โ‰ค ๐‘.

Lemma 4. When preferences are given by a linear or logistic link function, across all ๐‘› โ‰ฅ 1, there exists some ๐‘ < โˆž such that estimated reward at DPS trial ๐‘› is bounded by๐‘:||๐’“ห†๐‘›||2 โ‰ค ๐‘.

Proof. Firstly, if the link function is logistic, the desired result holds automatically by the definition of๐’“ห†๐‘›given in Eq. (4.10): the quantity is projected onto the compact setฮ˜ โŠ‚ R๐‘‘ of all possible values of๐’“, and a compact set onR๐‘‘must be bounded.

Secondly, the result is proven in the case of a linear link function. In this case, recall that the MAP reward estimate๐’“ห†๐‘›is the solution to a ridge regression problem:

๐’“ห†๐‘›=arg inf๐’“ (๐‘›โˆ’1

ร•

๐‘–=1

(๐’™๐‘‡๐‘– ๐’“โˆ’๐‘ฆ๐‘–)2+๐œ†||๐’“||22 )

=arg inf๐’“ (๐‘›โˆ’1

ร•

๐‘–=1

(๐’™๐‘‡๐‘– ๐’“โˆ’๐‘ฆ๐‘–)2+ 1

๐‘›โˆ’1๐œ†||๐’“||22 )

. (B.3) The desired result is proven by contradiction. Assuming that there exists no upper bound๐‘, the proof will identify a subsequence(๐’“ห†๐‘›๐‘–)of MAP estimates whose lengths increase unboundedly, but whose directions converge. Then, it will show that such vectors fail to minimize the objective in Eq. (B.3), achieving a contradiction.

Firstly, the vectors ๐’™๐‘– = ๐’™๐‘–2 โˆ’ ๐’™๐‘–1 have bounded magnitude: in the bandit case, ๐’™๐‘–1,๐’™๐‘–2โˆˆ A, and the action spaceAis compact, while in the RL setting,||๐’™๐‘– ๐‘—||1= โ„Ž for ๐‘— โˆˆ {1,2}. The binary labels๐‘ฆ๐‘–are also bounded, as they take values in

โˆ’1

2,1

2 . Note that for ๐’“ = 0, (๐’™๐‘‡๐‘– ๐’“โˆ’ ๐‘ฆ๐‘–)2+ ๐‘›โˆ’11 ๐œ†||๐’“||22 = 14. The desired statement is proven by contradiction: assume that there is no ๐‘ < โˆž such that ||๐’“ห†๐‘›||2 โ‰ค ๐‘ for all ๐‘›. Then, the sequence ๐’“ห†1,๐’“ห†2, . . . must have a subsequence indexed by (๐‘›๐‘–) such that lim๐‘–โˆ’โ†’โˆž||๐’“ห†๐‘›๐‘–||2 = โˆž. Consider the sequence of unit vectors ||๐’“ห†๐’“ห†๐‘›๐‘–

๐‘›๐‘–||2. This sequence lies within the compact set of unit vectors in R๐‘‘, so it must have a convergent subsequence; we index this subsequence of the sequence (๐‘›๐‘–) by (๐‘›๐‘–

๐‘—). Then, the

sequence(๐’“ห†๐‘–๐‘—) is such that lim๐‘—โˆ’โ†’โˆž||๐’“ห†๐‘–๐‘—||2 =โˆžand lim๐‘—โˆ’โ†’โˆž

๐’“ห†๐‘– ๐‘—

||๐’“ห†๐‘–

๐‘—||2 = ๐’“ห†๐‘ข๐‘›๐‘–๐‘ก, where ๐’“ห†๐‘ข๐‘›๐‘–๐‘ก โˆˆR๐‘‘ is a fixed unit vector.

For any ๐’™๐‘– such that |๐’™๐‘–๐‘‡๐’“ห†๐‘ข๐‘›๐‘–๐‘ก| โ‰  0, lim๐‘›๐‘–

๐‘—โˆ’โ†’โˆž(๐’™๐‘‡๐‘– ๐’“ห†๐‘›๐‘– ๐‘—

โˆ’ ๐‘ฆ๐‘–)2 = โˆž, and thus, the corresponding terms in Eq. (B.3) approach infinity. However, a lower value of the optimization objective in Eq. (B.3) can be realized by replacing ๐’“ห†๐‘›๐‘–

๐‘— with the assignment ๐’“ = 0. Meanwhile, for any ๐’™๐‘– such that |๐’™๐‘‡๐‘– ๐’“ห†| = 0, replacing ๐’“ห†๐‘›๐‘–

๐‘— with

๐’“ = 0 would also decrease the value of the optimization objective in Eq. (B.3).

Therefore, for large ๐‘—, ๐’“ =0results in a smaller objective function value than๐’“ห†๐‘›๐‘– ๐‘—. This is a contradiction, proving that the elements of the sequence ๐’“ห†๐‘›๐‘–

๐‘— cannot have arbitrarily large magnitudes. Thus, the elements of the original sequence ๐’“ห†๐‘– also cannot become arbitrarily large, and||๐’“ห†๐‘–|| โ‰ค ๐‘ for some๐‘ < โˆž.

The next intermediate result relates the matrix หœ๐‘€๐‘›, defined in Eq. (B.1), and the matrix ๐‘€๐‘› =๐œ† ๐ผ+ร๐‘›โˆ’1

๐‘–=1 ๐’™๐‘–๐’™๐‘–๐‘‡.

Lemma 5. On iteration๐‘›of DPS, the posterior covariance matrix for the rewards is ฮฃ(๐‘›) = ๐›ฝ๐‘›(๐›ฟ)2๐‘€หœ๐‘›; if the link function๐‘”is linear, then๐‘€หœ๐‘› =๐‘€๐‘›, while if๐‘”is logistic, then ๐‘€หœ๐‘› = ๐œ† ๐ผ + ร๐‘›โˆ’1

๐‘–=1 ๐‘”หœ(2๐‘ฆ๐‘–๐’™๐‘–๐‘‡๐’“ห†๐‘›)๐’™๐‘–๐’™๐‘‡๐‘–. In both cases, there exist two constants ๐‘šmin, ๐‘šmaxsuch that0 < ๐‘šmin โ‰ค ๐‘šmax< โˆžand๐‘šmin๐‘€๐‘› ๐‘€หœ๐‘› ๐‘šmax๐‘€๐‘›.

Proof. Firstly, if๐‘” is linear, then หœ๐‘€๐‘› = ๐‘€๐‘›, so the desired result clearly holds with ๐‘šmin =๐‘šmax =1.

If๐‘”is logistic, the desired statement is equivalent to:

๐‘šmin ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘–

!

๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

หœ

๐‘”(2๐‘ฆ๐‘–๐’™๐‘‡๐‘– ๐’“ห†๐‘›)๐’™๐‘–๐’™๐‘‡๐‘– ๐‘šmax ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘–

! .

By definition of หœ๐‘”, หœ๐‘”(๐‘ฅ) โˆˆ (0,โˆž) for all ๐‘ฅ โˆˆ R. Moreover, the domain of หœ๐‘” has bounded magnitude, since all possible inputs to หœ๐‘”are of the form 2๐‘ฆ๐‘–๐’™๐‘‡๐‘– ๐’“ห†๐‘›, in which

|๐‘ฆ๐‘–| = 12, ๐’™๐‘– belongs to a compact set, and ||๐’“ห†๐‘›|| โ‰ค ๐‘ by Lemma 4. Therefore, all possible inputs to หœ๐‘”belong to a compact set. A continuous function over a compact set always attains its maximum and minimum values; therefore, there exist values

หœ

๐‘”min,๐‘”หœmax such that 0 < ๐‘”หœmin โ‰ค ๐‘”หœ(๐‘ฅ) โ‰ค ๐‘”หœmax < โˆžfor all possible inputs๐‘ฅ to หœ๐‘”(๐‘ฅ).

Therefore, ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

หœ

๐‘”(2๐‘ฆ๐‘–๐’™๐‘‡๐‘– ๐’“ห†๐‘›)๐’™๐‘–๐’™๐‘‡๐‘– ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

หœ

๐‘”min๐’™๐‘–๐’™๐‘‡๐‘– min{๐‘”หœmin,1}

"

๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘–

# , and

๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

หœ

๐‘”(2๐‘ฆ๐‘–๐’™๐‘‡๐‘– ๐’“ห†๐‘›)๐’™๐‘–๐’™๐‘‡๐‘– ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

หœ

๐‘”max๐’™๐‘–๐’™๐‘‡๐‘– max{๐‘”หœmax,1}

"

๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘–

# ,

which proves the desired result for๐‘šmin =min{๐‘”หœmin,1}and๐‘šmax =max{๐‘”หœmax,1}.

To finish proving convergence of the transition dynamics Bayesian model, Lemma 6 demonstrates that every state-action pair is visited infinitely often.

Lemma 6. Under DPS with preference-based RL, assume that the dynamics are modeled via a Dirichlet model and that the utilities are modeled either via the linear or logistic link functions, with posterior sampling distributions given in Eq.(4.13).

Then, every state-action pair is visited infinitely often.

This consistency result also holds when removing the๐›ฝ๐‘›(๐›ฟ)and๐›ฝ0

๐‘›(๐›ฟ)factors from the distributions in Eq.(4.13).

Proof. The proof proceeds by assuming that there exists a state-action pair that is visited only finitely-many times. This assumption will lead to a contradiction1: once this state-action pair is no longer visited, the posterior sampling distribution for the utilities ๐’“ is no longer updated with respect to it. Then, DPS is guaranteed to eventually sample a high enough reward for this state-action that the resultant policy will prioritize visiting it.

First, note that DPS is guaranteed to reach at least one state-action pair infinitely often: given the problemโ€™s finite state and action spaces, at least one state-action pair must be visited infinitely often during DPS execution. If all state-actions arenot visited infinitely often, there must exist a state-action pair(๐‘ , ๐‘Ž)such that๐‘ is visited infinitely often, while (๐‘ , ๐‘Ž) is not. Otherwise, if all actions are selected infinitely often in all infinitely-visited states, the finitely-visited states are unreachable (in which case these states are irrelevant to the learning process and regret minimization,

1Note that in finite-horizon MDPs, the concept of visiting a state finitely-many times is not the same as that of a transient state in an infinite Markov chain, because: 1) due to a finite horizon, the state is resampled from the initial state distribution ๐‘0(๐‘ ) everyโ„Ž timesteps, and 2) the policyโ€”

which determines which state-action pairs can be reached in an episodeโ€”is also resampled everyโ„Ž timesteps.

and can be ignored). Without loss of generality, this state-action pair(๐‘ , ๐‘Ž)is labeled as หœ๐‘ 1. To reach a contradiction, it suffices to show that หœ๐‘ 1is visited infinitely often.

Let ๐’“1 be the utility vector with a reward of 1 in state-action pair หœ๐‘ 1 and rewards of zero elsewhere. From Definition 6, ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1) is the policy that maximizes the expected number of visits to หœ๐‘ 1under dynamics ๐’‘หœand utility vector๐’“1:

๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1) =argmax๐œ‹๐‘‰(๐’‘หœ,๐’“1, ๐œ‹),

where๐‘‰(๐’‘หœ,๐’“1, ๐œ‹) is the expected total reward of a length-โ„Žtrajectory under ๐’‘หœ,๐’“1, and๐œ‹, or equivalently (by definition of ๐’“1), the expected number of visits to state- action หœ๐‘ 1.

Next, it will be shown that there exists a ๐œŒ > 0 such that ๐‘ƒ(๐œ‹ = ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) > ๐œŒ for all possible values of ๐’‘. That is, for any sampled parametersหœ ๐’‘, the probabilityหœ of selecting policy๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1) is uniformly lower-bounded, implying that DPS must eventually select๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1).

Let หœ๐‘Ÿ๐‘— be the sampled utility (also referred to as reward) associated with state- action pair หœ๐‘ ๐‘— in a particular DPS episode, for each state-action ๐‘— โˆˆ {1, . . . , ๐‘‘}, with ๐‘‘ = ๐‘† ๐ด. The proof will show that conditioned on ๐’‘, there existsหœ ๐‘ฃ > 0 such that if หœ๐‘Ÿ1 exceeds max{๐‘ฃ๐‘Ÿหœ2, ๐‘ฃ๐‘Ÿหœ3, . . . , ๐‘ฃ๐‘Ÿหœ๐‘‘}, then value iteration returns the policy ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1), which is the policy maximizing the expected amount of time spent in

หœ

๐‘ 1. This can be seen by setting ๐‘ฃ := ๐œŒโ„Ž

1, where โ„Žis the time horizon and ๐œŒ1 is the expected number of visits to หœ๐‘ 1under๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1). Under this definition of๐‘ฃ, the event {๐‘Ÿหœ1 โ‰ฅmax{๐‘ฃ๐‘Ÿหœ2, ๐‘ฃ๐‘Ÿหœ3, . . . , ๐‘ฃ๐‘Ÿหœ๐‘‘}}is equivalent to{๐‘Ÿหœ1๐œŒ1 โ‰ฅ โ„Žmax{๐‘Ÿหœ2,๐‘Ÿหœ3, . . . ,๐‘Ÿหœ๐‘‘}}; the latter inequality implies that given ๐’‘หœand๐’“, the expected reward accumulated solelyหœ in state-action หœ๐‘ 1exceeds the reward gained by repeatedly (during all โ„Žtime-steps) visiting the state-action pair in the set {๐‘ หœ2, . . . ,๐‘ หœ๐‘‘} having the highest sampled reward. Clearly, in this situation, value iteration results in the policy๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1).

Next, it is shown that๐‘ฃ = โ„Ž

๐œŒ1 is continuous in the sampled dynamics ๐’‘หœ by showing that๐œŒ1is continuous in ๐’‘. Recall thatหœ ๐œŒ1is defined as expected number of visits to หœ๐‘ 1 under๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1). This is equivalent to the expected reward for following ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1) under dynamics ๐’‘หœand rewards ๐’“1:

๐œŒ1 =๐‘‰(๐’‘หœ,๐’“1, ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) =max

๐œ‹

๐‘‰(๐’‘หœ,๐’“1, ๐œ‹). (B.4) The value of any policy ๐œ‹ is continuous in the transition dynamics parameters, so ๐‘‰(๐’‘หœ,๐’“1, ๐œ‹)is continuous in ๐’‘. The maximum in Eq. (B.4) is taken over the finite setหœ

of deterministic policies; because a maximum over a finite number of continuous functions is also continuous, ๐œŒ1is continuous in ๐’‘.หœ

Next, recall that a continuous function on a compact set achieves its maximum and minimum values on that set. The set of all possible dynamics parameters ๐’‘หœ is such that for each state-action pair ๐‘—,ร๐‘†

๐‘˜=1๐‘๐‘— ๐‘˜ =1 and๐‘๐‘— ๐‘˜ โ‰ฅ 0โˆ€๐‘˜; the set of all possible vectors ๐’‘หœ is clearly closed and bounded, and hence compact. Therefore,๐‘ฃ achieves its maximum and minimum values on this set, and for any ๐’‘,หœ ๐‘ฃ โˆˆ [๐‘ฃmin, ๐‘ฃmax], where ๐‘ฃmin >0 (๐‘ฃis nonnegative by definition, and๐‘ฃ =0 is impossible, as it would imply that หœ๐‘ 1is unreachable).

Then, ๐‘ƒ(๐œ‹ = ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) can be expressed in terms of ๐‘ฃ and the parameters of the reward posterior. Firstly,

๐‘ƒ(๐œ‹ =๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) โ‰ฅ ๐‘ƒ(๐‘Ÿหœ1 >max{๐‘ฃ๐‘Ÿหœ2, ๐‘ฃ๐‘Ÿหœ3, . . . , ๐‘ฃ๐‘Ÿหœ๐‘‘}) โ‰ฅ

๐‘‘

ร–

๐‘—=2

๐‘ƒ(๐‘Ÿหœ1> ๐‘ฃ๐‘Ÿหœ๐‘—).

In the ๐‘›๐‘ก โ„Ž DPS iteration, the sampled rewards are drawn from a jointly Gaussian posterior: ๐’“หœ โˆผ N (๐(๐‘›),ฮฃ(๐‘›)) for some ๐(๐‘›) and ฮฃ(๐‘›), where [๐(๐‘›)]๐‘— = ๐œ‡(๐‘›)

๐‘— and

[ฮฃ(๐‘›)]๐‘— ๐‘˜ = ฮฃ(๐‘›)

๐‘— ๐‘˜ . Then,(๐‘Ÿหœ1โˆ’๐‘ฃ๐‘Ÿหœ๐‘—) โˆผ N (๐œ‡(

๐‘›) 1 โˆ’๐‘ฃ ๐œ‡(

๐‘›)

๐‘— , ฮฃ11(๐‘›)+๐‘ฃ2ฮฃ(๐‘— ๐‘—๐‘›)โˆ’2๐‘ฃฮฃ(1๐‘›)

๐‘— ), so that:

๐‘ƒ(๐œ‹๐‘›1 =๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) โ‰ฅ

๐‘‘

ร–

๐‘—=2

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

1โˆ’ฮฆยฉ

ยญ

ยญ

ยซ

โˆ’๐œ‡(๐‘›)

1 +๐‘ฃ ๐œ‡(๐‘›)

๐‘—

q ฮฃ(๐‘›)

11 +๐‘ฃ2ฮฃ(๐‘›)๐‘— ๐‘— โˆ’2๐‘ฃฮฃ(๐‘›)

1๐‘—

ยช

ยฎ

ยฎ

ยฌ

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

=

๐‘‘

ร–

๐‘—=2

ฮฆยฉ

ยญ

ยญ

ยซ

๐œ‡(

๐‘›) 1 โˆ’๐‘ฃ ๐œ‡(

๐‘›) ๐‘—

q ฮฃ(๐‘›)

11 +๐‘ฃ2ฮฃ(๐‘›)

๐‘— ๐‘— โˆ’2๐‘ฃฮฃ(๐‘›)

1๐‘—

ยช

ยฎ

ยฎ

ยฌ

, (B.5)

where ฮฆis the standard Gaussian cumulative distribution function. For the right- hand expression in Eq. (B.5) to have a lower bound greater than zero, the argument of ฮฆ(ยท)must be lower-bounded. It suffices to upper-bound the numeratorโ€™s magnitude and to lower-bound the denominator above zero for each product factor ๐‘— and over all iterations๐‘›.

The numerator can be upper-bounded using Lemma 4. Since ๐(๐‘›) equals ๐’“ห†๐‘› at iteration ๐‘›, ||๐(๐‘›)||2 โ‰ค ๐‘; therefore, |๐œ‡(๐‘›)

1 |,|๐œ‡(๐‘›)

๐‘— | โ‰ค ๐‘. Because 0 < ๐‘ฃ โ‰ค ๐‘ฃmax,

|๐œ‡1โˆ’๐‘ฃ ๐œ‡๐‘—| โ‰ค |๐œ‡(

๐‘›)

1 | +๐‘ฃ|๐œ‡(

๐‘›)

๐‘— | โ‰ค (1+๐‘ฃmax)๐‘.

To lower-bound the denominator, first note that it is equal toq

๐’˜๐‘‡๐‘—ฮฃ(๐‘›)๐’˜๐‘—, in which ๐’˜๐‘— โˆˆR๐‘‘is defined as a vector with 1 in the first position,โˆ’๐‘ฃin the ๐‘—thposition for

some ๐‘— โˆˆ {2, . . . , ๐‘‘}, and zero elsewhere:

๐’˜๐‘— :=[1,0, . . . ,0,โˆ’๐‘ฃ ,0, . . . ,0]๐‘‡. (B.6) Equivalently, it must be shown that ๐’˜๐‘‡๐‘—ฮฃ(๐‘›)๐’˜๐‘— is lower-bounded above zero. By Lemma 5, it holds thatฮฃ(๐‘›) ๐›ฝ๐‘›(๐›ฟ)2

๐‘šmax

๐‘€โˆ’1

๐‘› , implying that๐’˜๐‘‡๐‘—ฮฃ(๐‘›)๐’˜๐‘— โ‰ฅ ๐›ฝ๐‘›(๐›ฟ)2

๐‘šmax ๐’˜๐‘‡๐‘—๐‘€โˆ’1

๐‘› ๐’˜๐‘—. Because๐‘šmaxis a constant and ๐›ฝ๐‘›(๐›ฟ), defined in Eq. (4.4), is non-decreasing in๐‘›, it suffices to prove that ๐’˜๐‘‡๐‘—๐‘€โˆ’1

๐‘› ๐’˜๐‘— is lower-bounded above zero. (Thus, the result holds regardless of the presence of ๐›ฝ๐‘›(๐›ฟ)in the utility sampling distribution.) Recall from Definition 7 that the eigenvectors of๐‘€โˆ’1

๐‘› are๐’–(1๐‘›), . . . ,๐’–(๐‘›)

๐‘‘ , with corre- sponding eigenvalues

๐œˆ(๐‘›)

1

โˆ’1

, . . . ,

๐œˆ(๐‘›)

๐‘‘

โˆ’1

. The vector๐’˜๐‘—can be written in terms of the orthonormal basis formed by the eigenvectors{๐’–(๐‘›)

๐‘˜ }: ๐’˜๐‘— =

๐‘‘

ร•

๐‘˜=1

๐›ผ(

๐‘›) ๐‘˜ ๐’–(๐‘›)

๐‘˜

, (B.7)

for some coefficients ๐›ผ(๐‘›)

๐‘˜ โˆˆ R. Using Eq. (B.7), the quantity to be lower-bounded can now be written as:

๐’˜๐‘‡๐‘—๐‘€โˆ’1

๐‘› ๐’˜๐‘— =

๐‘‘

ร•

๐‘˜=1

๐›ผ(

๐‘›) ๐‘˜ ๐’–(๐‘›)๐‘‡

๐‘˜

! ๐‘‘ ร•

๐‘™=1

1 ๐œˆ(

๐‘›) ๐‘™

๐’–(๐‘›)

๐‘™ ๐’–(๐‘›)๐‘‡

๐‘™

! ๐‘‘ ร•

๐‘š=1

๐›ผ(

๐‘›) ๐‘š ๐’–(๐‘š๐‘›)

!

(๐‘Ž)=

๐‘‘

ร•

๐‘˜=1

๐›ผ(๐‘›)

๐‘˜

2 1 ๐œˆ(๐‘›)

๐‘˜ (๐‘)

โ‰ฅ

๐›ผ(๐‘›)

๐‘˜0

2 1 ๐œˆ(๐‘›)

๐‘˜0

, (B.8)

where equality (a) follows by orthonormality of the eigenvector basis, and (b) holds for any ๐‘˜0 โˆˆ {1, . . . , ๐‘‘} due to positivity of the eigenvalues (๐œˆ๐‘˜)โˆ’1. Therefore, to show that the denominator is bounded away from zero, it suffices to show that for every๐‘›, there exists some๐‘˜0such that

๐›ผ(๐‘›)

๐‘˜0

2 ๐œˆ(๐‘›)

๐‘˜0

โˆ’1

is bounded away from zero.

To prove the previous statement, note that by definition of ๐‘€๐‘›, the eigenvalues (๐œˆ(๐‘›)

๐‘˜ )โˆ’1are non-increasing in๐‘›. Below, the proof will show that for any eigenvalue (๐œˆ(๐‘›)

๐‘˜ )โˆ’1 such that lim๐‘›โˆ’โ†’โˆž(๐œˆ(๐‘›)

๐‘˜ )โˆ’1 = 0, the first element of its corresponding eigenvector, h

๐’–(๐‘›)๐‘˜ i

1, also converges to zero. Since the first element of๐’˜๐‘— equals 1, Eq. (B.6) implies that there must exist some๐‘˜0such that

h ๐’–(๐‘›)

๐‘˜0

i

1

6

โˆ’โ†’0 and๐›ผ(

๐‘›) ๐‘˜0 is bounded away from 0. If these implications did not hold, then๐’˜๐‘— would not have a value of 1 in its first element, contradicting its definition. These observations imply that for every ๐‘›, there must be some ๐‘˜0 such that as ๐‘› โˆ’โ†’ โˆž, (๐œˆ(๐‘›)

๐‘˜0 )โˆ’1 โˆ’โ†’6 0 and ๐›ผ(

๐‘›)

๐‘˜0 is bounded away from zero.

Let๐‘‹๐‘›denote the observation matrix after๐‘›โˆ’1 observations:๐‘‹๐‘› := h

๐’™1 . . . ๐’™๐‘›โˆ’1

i๐‘‡ . Then, ๐‘€โˆ’1

๐‘› = (๐‘‹๐‘‡

๐‘›๐‘‹๐‘›+๐œ† ๐ผ)โˆ’1. The matrices ๐‘€โˆ’1

๐‘› and ๐‘‹๐‘‡

๐‘›๐‘‹๐‘› have the same eigen- vectors. Meanwhile, for each eigenvalue (๐œˆ(

๐‘›)

๐‘– )โˆ’1of ๐‘€โˆ’1

๐‘› , ๐‘‹๐‘‡

๐‘›๐‘‹๐‘› has an eigenvalue ๐œ‰(๐‘›)

๐‘– :=๐œˆ(๐‘›)

๐‘– โˆ’๐œ†โ‰ฅ 0 corresponding to the same eigenvector. We aim to characterize the eigenvectors of๐‘€โˆ’1

๐‘› whose eigenvalues approach zero. Since these eigenvectors are identical to those of๐‘‹๐‘‡

๐‘›๐‘‹๐‘›whose eigenvalues approach infinity, the latter can be considered instead.

Without loss of generality, assume that all finitely-visited state-action pairs (in- cluding หœ๐‘ 1) occur in the first ๐‘š < ๐‘›โˆ’1 iterations, and index these finitely-visited state-action pairs from 1 to ๐‘Ÿ โ‰ฅ 1, so that the finitely-visited state-actions are:

{๐‘ หœ1,๐‘ หœ2,ยท ยท ยท ,๐‘ หœ๐‘Ÿ}. Let๐‘‹1:๐‘š โˆˆR๐‘šร—๐‘‘ denote the matrix containing the first๐‘š rows of ๐‘‹๐‘›, while๐‘‹๐‘š+1:๐‘›โˆˆR๐‘›โˆ’๐‘šร—๐‘‘ denotes the remaining rows of๐‘‹๐‘›. With this notation,

๐‘‹๐‘‡

๐‘›๐‘‹๐‘›=

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘– = ๐‘‹๐‘‡

1:๐‘š๐‘‹1:๐‘š +๐‘‹๐‘‡

๐‘š+1:๐‘›๐‘‹๐‘š+1:๐‘›.

Because the first ๐‘Ÿ state-action pairs, {๐‘ หœ1,๐‘ หœ2,ยท ยท ยท ,๐‘ หœ๐‘Ÿ}, are unvisited after iteration ๐‘š, the first๐‘Ÿ elements of๐’™๐‘–are zero for all๐‘– > ๐‘š. Therefore, ๐‘‹๐‘‡

๐‘š+1:๐‘›๐‘‹๐‘š+1:๐‘›can be written in the following block matrix form:

๐‘‹๐‘‡

๐‘š+1:๐‘›๐‘‹๐‘š+1:๐‘›=

"

๐‘‚๐‘Ÿร—๐‘Ÿ ๐‘‚๐‘Ÿร—(๐‘‘โˆ’๐‘Ÿ)

๐‘‚(๐‘‘โˆ’๐‘Ÿ)ร—๐‘Ÿ ๐ด๐‘›

# ,

where ๐‘‚๐‘Žร—๐‘ denotes the all-zero matrix with dimensions ๐‘Ž ร— ๐‘. The matrix ๐ด๐‘› includes elements that are unbounded as ๐‘› โˆ’โ†’ โˆž. In particular, the diagonal elements of ๐ด๐‘› approach infinity as๐‘› โˆ’โ†’ โˆž. The matrix ๐‘‹๐‘‡

๐‘›๐‘‹๐‘› can be written in the following block matrix form:

๐‘‹๐‘‡

๐‘›๐‘‹๐‘› =๐‘‹๐‘‡

1:๐‘š๐‘‹1:๐‘š+ ๐‘‹๐‘‡

๐‘š+1:๐‘›๐‘‹๐‘š+1:๐‘›

=

"

[๐‘‹๐‘‡

1:๐‘š

๐‘‹1:๐‘š](1:๐‘Ÿ ,1:๐‘Ÿ) [๐‘‹๐‘‡

1:๐‘š

๐‘‹1:๐‘š](1:๐‘Ÿ ,๐‘Ÿ+1:๐‘‘)

[๐‘‹๐‘‡

1:๐‘š๐‘‹1:๐‘š](๐‘Ÿ+1:๐‘‘ ,1:๐‘Ÿ) [๐‘‹๐‘‡

1:๐‘š๐‘‹1:๐‘š](๐‘Ÿ+1:๐‘‘ ,๐‘Ÿ+1:๐‘‘)+ ๐ด๐‘›

# :=

"

๐ต ๐ถ

๐ถ๐‘‡ ๐ท๐‘›

# , where๐‘€(๐‘Ž:๐‘,๐‘:๐‘‘) denotes the submatrix of๐‘€ obtained by extracting rows๐‘Žthrough ๐‘and columns๐‘through๐‘‘. Because matrices๐ตand๐ถonly depend upon๐‘‹1:๐‘š, they are fixed as๐‘›increases, while matrix๐ท๐‘›contains values that grow towards infinity with increasing ๐‘›. In particular, all elements along ๐ท๐‘›โ€™s diagonal are unbounded.

Intuitively, in the limit,๐ตand๐ถare close to zero compared to๐ท๐‘›, and๐‘‹๐‘‡

๐‘›๐‘‹๐‘›(when normalized) increasingly resembles a matrix in which only the bottom-right block is nonzero. This intuitive notion is formalized next.

Consider an eigenpair (๐’–๐‘–(๐‘›), ๐œ‰(๐‘›)

๐‘– ) of ๐‘‹๐‘‡

๐‘›๐‘‹๐‘› such that lim๐‘›โˆ’โ†’โˆž๐œ‰(๐‘›)

๐‘– = โˆž. The fol- lowing argument shows that the first element of ๐’–๐‘–(๐‘›) must approach 0. Letting ๐’–๐‘–(๐‘›) =

h

๐’›๐‘–(๐‘›)๐‘‡ ๐’’๐‘–(๐‘›)๐‘‡ i๐‘‡

, where๐’›๐‘–(๐‘›) โˆˆR๐‘š and๐’’๐‘–(๐‘›) โˆˆR๐‘›โˆ’1โˆ’๐‘š: (๐‘‹๐‘‡

๐‘›๐‘‹๐‘›)๐’–๐‘–(๐‘›) = ๐‘‹๐‘‡

๐‘›๐‘‹๐‘›

"

๐’›๐‘–(๐‘›) ๐’’๐‘–(๐‘›)

#

=

"

๐ต ๐ถ

๐ถ๐‘‡ ๐ท๐‘›

# "

๐’›๐‘–(๐‘›) ๐’’๐‘–(๐‘›)

#

=

"

๐ต๐’›๐‘–(๐‘›) +๐ถ๐’’๐‘–(๐‘›) ๐ถ๐‘‡๐’›๐‘–(๐‘›) +๐ท๐‘›๐’’๐‘–(๐‘›)

#

=๐œ‰(

๐‘›) ๐‘–

"

๐’›๐‘–(๐‘›) ๐’’๐‘–(๐‘›)

# .

Dividing both sides by๐œ‰(

๐‘›) ๐‘– , 1

๐œ‰(๐‘›)

๐‘–

๐‘‹๐‘‡

๐‘›๐‘‹๐‘›

"

๐’›๐‘–(๐‘›) ๐’’๐‘–(๐‘›)

#

=

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

1 ๐œ‰(๐‘›)

๐‘–

๐ต๐’›๐‘–(๐‘›) +๐ถ๐’’(๐‘›)๐‘–

1 ๐œ‰(๐‘›)

๐‘–

๐ถ๐‘‡๐’›๐‘–(๐‘›) +๐ท๐‘›๐’’๐‘–(๐‘›)

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

=

"

๐’›๐‘–(๐‘›) ๐’’๐‘–(๐‘›)

# .

In the upper matrix block: lim๐‘›โˆ’โ†’โˆž๐œ‰(๐‘›)

๐‘– = โˆž, ๐ต and ๐ถ are fixed as ๐‘› increases, and๐’›๐‘–(๐‘›) and๐’’๐‘–(๐‘›) have upper-bounded elements because๐’–๐‘–(๐‘›) is a unit vector. Thus, lim๐‘›โˆ’โ†’โˆž๐’›๐‘–(๐‘›) =lim๐‘›โˆ’โ†’โˆž 1

๐œ‰(๐‘›)

๐‘–

๐ต๐’›๐‘–(๐‘›)+๐ถ๐’’๐‘–(๐‘›)

=0. In particular, the first element of

๐’›๐‘–(๐‘›) converges to zero, implying that the same is true of๐’–๐‘–(๐‘›).

As justified above, this result implies that for each iteration๐‘›, there exists an index ๐‘˜0 โˆˆ {1, . . . , ๐‘‘}such that the right-hand side of Eq. (B.8) has a lower bound above zero. This completes the proof that the denominator in Eq. (B.5) does not decay to zero. As a result, there exists some ๐œŒ >0 such that ๐‘ƒ(๐œ‹ =๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1)) โ‰ฅ ๐œŒ >0.

In consequence, DPS is guaranteed to infinitely often sample pairs(๐’‘หœ, ๐œ‹)such that ๐œ‹ = ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1). As a result, DPS infinitely often samples policies that prioritize reaching หœ๐‘ 1 as quickly as possible. Such a policy always takes action ๐‘Ž in state ๐‘ . Furthermore, because ๐‘  is visited infinitely often, either a) ๐‘0(๐‘ ) > 0 or b) the infinitely-visited state-action pairs include a path with a nonzero probability of reaching ๐‘ . In case a), since the initial state distribution is fixed, the MDP will infinitely often begin in state๐‘ under the policy๐œ‹ =๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1), so หœ๐‘ 1will be visited infinitely often. In case b), due to Lemma 3, the transition dynamics parameters for state-actions along the path to ๐‘  converge to their true values (intuitively, the algorithm knows how to reach๐‘ ). In episodes with the policy๐œ‹ =๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1), DPS is thus guaranteed to reach หœ๐‘ 1infinitely often. Since DPS selects ๐œ‹๐‘ฃ๐‘–(๐’‘หœ,๐’“1) infinitely often, it must reach หœ๐‘ 1 infinitely often. This presents a contradiction, proving that every state-action pair must be visited infinitely often.

The direct combination of Lemmas 3 and 6 prove asymptotic consistency of the transition dynamics model:

Proposition 3.Assume that DPS is executed in the preference-based RL setting, with transition dynamics modeled via a Dirichlet model, utilities modeled via either the linear or logistic link function, and utility posterior sampling distributions given in Eq.(4.13). Then, the sampled transition dynamics ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2converge in distribution to the true dynamics, ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2

โˆ’โ†’๐ท ๐’‘. This consistency result also holds when removing the๐›ฝ๐‘›(๐›ฟ)factors from the distributions in Eq.(4.13).

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 177-188)