• Tidak ada hasil yang ditemukan

Asymptotic Consistency of the Utilities in DPS

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 188-200)

Chapter VI: Conclusions and Future Directions

B.3 Asymptotic Consistency of the Utilities in DPS

Proposition 3.Assume that DPS is executed in the preference-based RL setting, with transition dynamics modeled via a Dirichlet model, utilities modeled via either the linear or logistic link function, and utility posterior sampling distributions given in Eq.(4.13). Then, the sampled transition dynamics ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2converge in distribution to the true dynamics, ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2

โˆ’โ†’๐ท ๐’‘. This consistency result also holds when removing the๐›ฝ๐‘›(๐›ฟ)factors from the distributions in Eq.(4.13).

Because the probability distribution of||๐’›๐‘–||2is fixed, there exists some fixed๐‘Ž > 0 such that with probability at least 1โˆ’๐›ฟ, ||๐’›๐‘–||2 โ‰ค ๐‘Ž. So, for each๐‘–, with probability at least 1โˆ’๐›ฟ,

||๐’“หœ๐‘–1โˆ’๐’“ห†๐‘–||๐‘€หœ

๐‘– = ๐›ฝ๐‘–(๐›ฟ) ||๐’›๐‘–||2 โ‰ค ๐›ฝ๐‘–(๐›ฟ)๐‘Ž .

The previous statement can be combined with the high-probability inequality||๐’“ห†๐‘–โˆ’ ๐’“||๐‘€หœ

๐‘– โ‰ค ๐‘šmax๐›ฝ๐‘–(๐›ฟ)to obtain that for each๐‘–, with probability at least 1โˆ’๐›ฟ,

||๐’“หœ๐‘–1โˆ’๐’“||๐‘€หœ

๐‘– โ‰ค ||๐’“หœ๐‘–1โˆ’๐’“ห†๐‘–||๐‘€หœ

๐‘– + ||๐’“ห†๐‘–โˆ’๐’“||๐‘€หœ

๐‘– โ‰ค (๐‘Ž+๐‘šmax)๐›ฝ๐‘–(๐›ฟ).

Taking squares and dividing by๐›ฝ๐‘–(๐›ฟ) yields that for each๐‘–, with probability at least 1โˆ’๐›ฟ,

1

๐›ฝ๐‘–(๐›ฟ)2(๐’“หœ๐‘–1โˆ’๐’“)๐‘‡๐‘€หœ๐‘–(๐’“หœ๐‘–1โˆ’๐’“) โ‰ค (๐‘Ž+๐‘šmax)2. By assumption, ๐œ†

(๐‘–) ๐‘‘

๐›ฝ๐‘–(๐›ฟ)2

โˆ’โ†’ โˆž๐ท as ๐‘– โˆ’โ†’ โˆž. Recall from Definition 7 that ๐’—(๐‘—๐‘–), ๐‘— โˆˆ {1, . . . , ๐‘‘}, represent the eigenvectors of หœ๐‘€๐‘– corresponding to the eigenvalues๐œ†(๐‘–)

๐‘— . Then, with probability at least 1โˆ’๐›ฟ for each๐‘–:

1

๐›ฝ๐‘–(๐›ฟ)2(๐’“หœ๐‘–1โˆ’ ๐’“)๐‘‡๐‘€หœ๐‘–(๐’“หœ๐‘–1โˆ’๐’“) = 1

๐›ฝ๐‘–(๐›ฟ)2(๐’“หœ๐‘–1โˆ’๐’“)๐‘‡ ยฉ

ยญ

ยซ

๐‘‘

ร•

๐‘—=1

๐œ†(๐‘–)

๐‘— ๐’—(๐‘–)๐‘— ๐’—(๐‘–)๐‘‡๐‘— ยช

ยฎ

ยฌ

(๐’“หœ๐‘–1โˆ’๐’“)

= 1 ๐›ฝ๐‘–(๐›ฟ)2

๐‘‘

ร•

๐‘—=1

๐œ†(

๐‘–) ๐‘—

(๐’“หœ๐‘–1โˆ’ ๐’“)๐‘‡๐’—(๐‘—๐‘–)2

โ‰ค (๐‘Ž+๐‘šmax)2. (B.11)

Since ๐œ†

(๐‘–) ๐‘—

๐›ฝ๐‘–(๐›ฟ)2 โˆ’โ†’ โˆž as ๐‘– โˆ’โ†’ โˆž for each ๐‘—, and ๐’—(๐‘–)๐‘— is an orthonormal basis, the constant bound of(๐‘Ž+๐‘šmax)2in Eq. (B.11) is violated if we do not have๐’“หœ๐‘–1โˆ’๐’“ โˆ’โ†’๐ท 0.

Eq. (B.11) must hold with probability at least 1โˆ’๐›ฟindependently for each iteration ๐‘–, with the (1โˆ’๐›ฟ)-probability due entirely to randomness in the posterior sampling distribution, given by Eq. (B.9). Therefore, it follows that๐’“หœ๐‘–1

โˆ’โ†’๐ท ๐’“. The proof that ๐’“หœ๐‘–2

โˆ’โ†’๐ท ๐’“with high probability is identical.

The analysis will show convergence in distribution of the reward samples,๐’“หœ๐‘–1,๐’“หœ๐‘–2

โˆ’โ†’๐ท

๐’“, by applying Lemma 7 and demonstrating that 1

๐›ฝ๐‘–(๐›ฟ)2๐œ†(๐‘–)

๐‘‘ โˆ’โ†’ โˆžas๐‘– โˆ’โ†’ โˆž. This result is proven by contradiction: intuitively, if 1

๐›ฝ๐‘–(๐›ฟ)2๐œ†(

๐‘–)

๐‘‘ is upper-bounded, then

DPS has a lower-bounded probability of selecting policies that increase๐œ†(๐‘–)

๐‘‘ . First, Lemma 8 demonstrates that under the contradiction hypothesis, there is a non- decaying probability of sampling rewards ๐’“หœ๐‘–1,๐’“หœ๐‘–2 that are highly-aligned with the eigenvector๐’—(๐‘–)

๐‘‘ of หœ๐‘€โˆ’1

๐‘– corresponding to its largest eigenvalue, (๐œ†(

๐‘–) ๐‘‘ )โˆ’1. Lemma 8. Assume that for a given iteration๐‘–, ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(๐‘–)

๐‘‘

โˆ’1

โ‰ฅ ๐›ผ. Then for any ๐‘Ž > 0, the reward samples๐’“หœ๐‘–1,๐’“หœ๐‘–2satisfy:

๐‘ƒ

หœ

๐’“๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ โ‰ฅ ๐‘Žmax

๐‘— < ๐‘‘

|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— |

โ‰ฅ ๐‘(๐‘Ž) > 0, (B.12) ๐‘ƒ

๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘‘ โ‰ค โˆ’๐‘Žmax

๐‘— < ๐‘‘

|๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘— |

โ‰ฅ ๐‘(๐‘Ž) > 0, (B.13) where๐‘:R+ โˆ’โ†’R+is a continuous, monotonically-decreasing function.

Proof. Recall that the reward samples ๐’“หœ๐‘–1,๐’“หœ๐‘–2 are drawn according to the distribu- tion in Eq. (B.9). The proof begins by demonstrating that the reward samples can equivalently be expressed as:

๐’“หœ๐‘–1= ๐’“ห†๐‘–+๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(

๐‘–) ๐‘—

โˆ’12

๐‘ง๐‘– ๐‘—๐’—(๐‘—๐‘–), ๐‘ง๐‘– ๐‘— โˆผ N (0,1) i.i.d., (B.14) and similarly for ๐’“หœ๐‘–2. Similarly to Eq. (B.9), the expression in Eq. (B.14) has a multivariate Gaussian distribution. One can take the expectation and covariance of Eq. (B.14) with respect to the variables{๐‘ง๐‘– ๐‘—}to show that they match the expressions in Eq. (B.9):

E

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ห†

๐’“๐‘–+๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(๐‘–)

๐‘—

โˆ’12 ๐‘ง๐‘– ๐‘—๐’—(๐‘–)๐‘—

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

= ๐’“ห†๐‘–+๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(๐‘–)

๐‘—

โˆ’12

E[๐‘ง๐‘– ๐‘—]๐’—(๐‘–)๐‘— = ๐’“ห†๐‘–,

Cov

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐’“ห†๐‘–+๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(

๐‘–) ๐‘—

โˆ’12

๐‘ง๐‘– ๐‘—๐’—(๐‘—๐‘–)

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

(๐‘Ž)= E

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

ยฉ

ยญ

ยซ ๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(

๐‘–) ๐‘—

โˆ’12

๐‘ง๐‘– ๐‘—๐’—(๐‘—๐‘–)ยช

ยฎ

ยฌ ๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘˜=1

๐œ†(

๐‘–) ๐‘˜

โˆ’12

๐‘ง๐‘– ๐‘˜๐’—(๐‘–)

๐‘˜

!๐‘‡

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

= ๐›ฝ๐‘–(๐›ฟ)2

๐‘‘

ร•

๐‘—=1 ๐‘‘

ร•

๐‘˜=1

๐œ†(๐‘–)

๐‘—

โˆ’12 ๐œ†(๐‘–)

๐‘˜

โˆ’12

๐’—(๐‘–)๐‘— ๐’—(๐‘–)๐‘‡

๐‘˜ E[๐‘ง๐‘– ๐‘—๐‘ง๐‘– ๐‘˜]

(๐‘)

= ๐›ฝ๐‘–(๐›ฟ)2

๐‘‘

ร•

๐‘—=1

๐œ†(๐‘–)

๐‘—

โˆ’1

๐’—(๐‘–)๐‘— ๐’—(๐‘–)๐‘‡๐‘— = ๐›ฝ๐‘–(๐›ฟ)2๐‘€หœโˆ’1

๐‘– , which match the expectation and covariance in Eq. (B.9). In the above, (a) applies the definition Cov[๐’™] =E[(๐’™โˆ’E[๐’™]) (๐’™โˆ’E[๐’™])๐‘‡], and (b) holds becauseE[๐‘ง๐‘– ๐‘—๐‘ง๐‘– ๐‘˜] = Cov[๐‘ง๐‘– ๐‘—๐‘ง๐‘– ๐‘˜] =๐›ฟ๐‘— ๐‘˜, where๐›ฟ๐‘— ๐‘˜ is the Kronecker delta function.

Next, it is shown that the probability that๐’“หœ๐‘–1is arbitrarily-aligned with๐’—(๐‘–)๐‘‘ is lower- bounded above zero: that is, there exists ๐‘ : R+ โˆ’โ†’ R+ such that for any ๐‘Ž > 0, ๐‘ƒ(๐’“หœ๐‘‡

๐‘–1๐’—(๐‘–)

๐‘‘ โ‰ฅ ๐‘Žmax๐‘— < ๐‘‘|๐’“หœ๐‘‡

๐‘–1๐’—(๐‘—๐‘–)|) โ‰ฅ ๐‘(๐‘Ž) > 0. This can be shown by bounding the terms |๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— |, ๐‘— < ๐‘‘, and ๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ . Firstly, the term |๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— |, ๐‘— < ๐‘‘, can be upper- bounded:

|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— | (=๐‘Ž) ห†

๐’“๐‘‡๐‘–๐’—(๐‘–)๐‘— +๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘˜=1

๐œ†(๐‘–)

๐‘˜

โˆ’12

๐‘ง๐‘– ๐‘˜๐’—(๐‘–)๐‘‡๐‘˜ ๐’—(๐‘–)๐‘—

(๐‘)

= ห†

๐’“๐‘‡๐‘– ๐’—(๐‘–)๐‘— +๐›ฝ๐‘–(๐›ฟ) ๐œ†(๐‘–)

๐‘—

โˆ’12 ๐‘ง๐‘– ๐‘—

โ‰ค |๐’“ห†๐‘‡๐‘– ๐’—(๐‘—๐‘–)| + ๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘—

โˆ’12

|๐‘ง๐‘– ๐‘—| (

๐‘)

โ‰ค ||๐’“ห†๐‘–||2||๐’—(๐‘—๐‘–)||2+๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘—

โˆ’12

|๐‘ง๐‘– ๐‘—|

(๐‘‘)

โ‰ค ๐‘+๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘—

โˆ’12

|๐‘ง๐‘– ๐‘—|,

where (a) applies Eq. (B.14), (b) follows from orthonormality of the eigenbasis, (c) follows from the Cauchy-Schwarz inequality, and (d) uses Lemma 4, which states that||๐’“ห†๐‘–||2 โ‰ค ๐‘. Similarly,๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ can be lower-bounded:

๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)

๐‘‘

(๐‘Ž)= ๐’“ห†๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘ +๐›ฝ๐‘–(๐›ฟ)

๐‘‘

ร•

๐‘—=1

๐œ†(

๐‘–) ๐‘—

โˆ’12

๐‘ง๐‘– ๐‘—๐’—(๐‘—๐‘–)๐‘‡๐’—(๐‘–)

๐‘‘ (๐‘)

= ๐’“ห†๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘ +๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘‘

โˆ’12

๐‘ง๐‘–1

โ‰ฅ โˆ’|๐’“ห†๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘ | +๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘‘

โˆ’12

๐‘ง๐‘–1

(๐‘)

โ‰ฅ โˆ’||๐’“ห†๐‘–||2||๐’—(๐‘–)

๐‘‘ ||2+๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘‘

โˆ’12

๐‘ง๐‘–1

(๐‘‘)

โ‰ฅ โˆ’๐‘+๐›ฝ๐‘–(๐›ฟ) ๐œ†(

๐‘–) ๐‘‘

โˆ’12

๐‘ง๐‘–1,

where as before, (a) applies Eq. (B.14), (b) follows from orthonormality of the eigen- basis, (c) follows from the Cauchy-Schwarz inequality, and (d) holds via Lemma 4.

Given these upper and lower bounds, the probability๐‘ƒ

๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ โ‰ฅ ๐‘Žmax๐‘— < ๐‘‘|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— | can be lower-bounded:

๐‘ƒ

หœ

๐’“๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ โ‰ฅ ๐‘Žmax

๐‘— < ๐‘‘

|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— | (๐‘Ž)

โ‰ฅ ๐‘ƒ

โˆ’๐‘+๐›ฝ๐‘–(๐›ฟ) ๐œ†(๐‘–)

๐‘‘

โˆ’12

๐‘ง๐‘–1 โ‰ฅ ๐‘Žmax

๐‘— < ๐‘‘

๐‘+๐›ฝ๐‘–(๐›ฟ) ๐œ†(๐‘–)

๐‘—

โˆ’12

|๐‘ง๐‘– ๐‘—|

=๐‘ƒ

ยฉ

ยญ

ยญ

ยซ ๐‘ง๐‘–1 โ‰ฅ

๐‘ q

๐œ†(๐‘–)

๐‘‘

๐›ฝ๐‘–(๐›ฟ) +๐‘Žmax

๐‘— < ๐‘‘

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐‘

q ๐œ†(๐‘–)

๐‘‘

๐›ฝ๐‘–(๐›ฟ) + vu ut๐œ†(

๐‘–) ๐‘‘

๐œ†(

๐‘–) ๐‘—

|๐‘ง๐‘– ๐‘—|

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ยช

ยฎ

ยฎ

ยฌ

(๐‘)

โ‰ฅ ๐‘ƒ

๐‘ง๐‘–1 โ‰ฅ ๐‘

โˆš ๐›ผ

+๐‘Žmax

๐‘— < ๐‘‘

๐‘

โˆš ๐›ผ

+ |๐‘ง๐‘– ๐‘—|

=๐‘ƒ

๐‘ง๐‘–1 โ‰ฅ ๐‘(1+๐‘Ž)

โˆš ๐›ผ

+๐‘Žmax

๐‘— < ๐‘‘

|๐‘ง๐‘– ๐‘—|

:=๐‘(๐‘Ž) > 0, where (a) results from the upper and lower bounds derived above, and (b) follows because ๐œ†

(๐‘–) ๐‘‘

๐œ†(๐‘–)

๐‘—

โ‰ค 1 and ๐›ฝ๐‘–(๐›ฟ)

๐œ†(๐‘–)

๐‘‘

โˆ’12

โ‰ฅ โˆš

๐›ผby assumption. The function๐‘(๐‘Ž) >0 is continuous and decreasing in๐‘Ž.

By identical arguments, ๐‘ƒ(๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘‘ โ‰ค โˆ’๐‘Žmax๐‘— < ๐‘‘|๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘— |) โ‰ฅ ๐‘(๐‘Ž). Thus, for any ๐‘Ž > 0 and set of eigenvectors๐’—(๐‘–)๐‘— :

๐‘ƒ(๐’“หœ๐‘‡๐‘–1๐’—๐‘‘(๐‘–) โ‰ฅ ๐‘Žmax

๐‘— < ๐‘‘

|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— |) โ‰ฅ ๐‘(๐‘Ž) > 0, ๐‘ƒ(๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)

๐‘‘ โ‰ค โˆ’๐‘Žmax

๐‘— < ๐‘‘

|๐’“หœ๐‘‡๐‘–2๐’—(๐‘—๐‘–)|) โ‰ฅ ๐‘(๐‘Ž) >0.

This alignment between sampled rewards ๐’“หœ๐‘–1, ๐’“หœ๐‘–2 and๐’—๐‘‘(๐‘–) can also be expressed as follows by taking๐‘Žhigh enough:

Lemma 9. Assume that for a given iteration๐‘–, ๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(๐‘–)

๐‘‘

โˆ’1

โ‰ฅ ๐›ผ. Then, for any ๐œ€ > 0, the following events have nonzero, uniformly-lower-bounded probability:

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)๐‘‘ 2

< ๐œ€and

๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2 โˆ’ (โˆ’๐’—๐‘‘(๐‘–)) 2

< ๐œ€.

Proof. By Lemma 8, Eq.s (B.12) and (B.13) both hold. The events in Eq.s (B.12) and (B.13), {๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ โ‰ฅ ๐‘Žmax๐‘— < ๐‘‘|๐’“หœ๐‘‡๐‘–1๐’—(๐‘–)๐‘— |} and {๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘‘ โ‰ค โˆ’๐‘Žmax๐‘— < ๐‘‘|๐’“หœ๐‘‡๐‘–2๐’—(๐‘–)๐‘— |}, will be referred to as events๐ด(๐‘Ž)and๐ต(๐‘Ž), respectively. From Lemma 8,๐ด(๐‘Ž)and๐ต(๐‘Ž) have positive, lower-bounded probability for any๐‘Ž. The proof argues that regardless of the specific eigenbasis, the values of๐‘Ž required for the result to hold have some fixed upper bound.

First, note that under events ๐ด(๐‘Ž) and ๐ต(๐‘Ž), as ๐‘Ž โˆ’โ†’ โˆž, ||๐’“หœ๐’“หœ๐‘–1

๐‘–1||2 โˆ’โ†’ ๐’—(๐‘–)๐‘‘ and

๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2 โˆ’โ†’ โˆ’๐’—(๐‘–)๐‘‘ . Let๐œ€ > 0. Under event๐ด(๐‘Ž)for sufficiently-large๐‘Ž,

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)๐‘‘ 2

<

๐œ€. Define๐‘Žmin,1(๐œ€,๐’—1(๐‘–), . . . ,๐’—๐‘‘(๐‘–)) as the minimum value of๐‘Žsuch that๐ด(๐‘Ž)implies

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)๐‘‘ 2

โ‰ค ๐œ€

2, given the eigenbasis {๐’—1(๐‘–), . . . ,๐’—๐‘‘(๐‘–)}. Because the inequality defining๐ด(๐‘Ž)is continuous in๐‘Ž,๐’“หœ๐‘–1, and the eigenbasis{๐’—1(๐‘–), . . . ,๐’—(๐‘–)๐‘‘ }, the function ๐‘Žmin,1(๐œ€,๐’—(๐‘–)

1 , . . . ,๐’—(๐‘–)

๐‘‘ )is also continuous in the eigenbasis{๐’—(๐‘–)

1 , . . . ,๐’—(๐‘–)

๐‘‘ }. Because ๐‘Žmin,1(๐œ€,๐’—(๐‘–)

1 , . . . ,๐’—(๐‘–)

๐‘‘ )is positive for all{๐’—(๐‘–)

1 , . . . ,๐’—(๐‘–)

๐‘‘ }, and the set of all eigenbases {๐’—(๐‘–)

1 , . . . ,๐’—(๐‘–)

๐‘‘ }is compact, there exists๐‘Žmin,1(๐œ€)such that for any eigenbasis, if๐ด(๐‘Ž) holds for๐‘Ž โ‰ฅ ๐‘Žmin,1(๐œ€), then

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)

๐‘‘

2

< ๐œ€.

By the same arguments, there exists ๐‘Žmin,2(๐œ€) such that for any eigenbasis, if ๐ต(๐‘Ž) holds for ๐‘Ž โ‰ฅ ๐‘Žmin,2(๐œ€), then

๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2 โˆ’ (โˆ’๐’—(๐‘–)๐‘‘ ) 2

< ๐œ€. Taking ๐‘Žmin(๐œ€) :=

max{๐‘Žmin,1(๐œ€), ๐‘Žmin,2(๐œ€)}, then for any๐‘Ž โ‰ฅ ๐‘Žmin(๐œ€), under events ๐ด(๐‘Ž) and ๐ต(๐‘Ž), both

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)๐‘‘ 2

< ๐œ€and

๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2 โˆ’ (โˆ’๐’—(๐‘–)๐‘‘ ) 2

< ๐œ€hold.

The next step shows that given sampled rewards๐’“หœ๐‘–1,๐’“หœ๐‘–2that are highly-aligned with the eigenvector ๐’—(๐‘–)

๐‘‘ of หœ๐‘€๐‘– as in Lemma 9, there is a lower-bounded probability of sampling actions that have non-zero projection onto this eigenvector.

Importantly, for each possible eigenvector ๐’—(๐‘–)๐‘‘ , the analysis will assume that there exist two actions๐’™1,๐’™2 โˆˆ Asuch that:

| (๐’™2โˆ’๐’™1)๐‘‡๐’—(๐‘–)

๐‘‘ | โ‰ฅ ๐‘ > 0, (B.15)

where the constant๐‘is independent of the eigenvector๐’—(๐‘–)

๐‘‘ . If this is not satisfied, then๐’™๐‘‡๐’—(๐‘–)

๐‘‘ is the same (or arbitrarily-similar) for all actions ๐’™ โˆˆ A. In this case, the eigenvector ๐’—(๐‘–)๐‘‘ lies along a dimension that is irrelevant for learning the utilities๐’“. The corresponding eigenvalue๐œ†(๐‘–)

๐‘‘ is necessarily 0, since all vectors ๐’™๐‘– are orthogonal to ๐’—๐‘‘(๐‘–). Thus, the learning problem would remain equivalent if A were reduced to a lower-dimensional subspace to which ๐’—(๐‘–)๐‘‘ is orthogonal. Therefore, without loss of generality, this proof assumes that no such eigenvectors exist, that is, Eq. (B.15) is satisfied for all eigenvectors๐’—(๐‘–)

๐‘‘ of หœ๐‘€๐‘–. Lemma 10 (Generalized linear dueling bandit). Assume that there exists ๐‘–0 such that for๐‘– > ๐‘–0, ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(๐‘–)

๐‘‘

โˆ’1

โ‰ฅ ๐›ผ. Then, there exists a constant ๐‘0 > 0 such that for๐‘– > ๐‘–0:

E h

๐’™๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘

i

โ‰ฅ ๐‘0> 0, (B.16)

where๐‘0> 0depends only on the true utility parameters๐’“, so that in particular, Eq.

(B.16)holds for any eigenvector๐’—(๐‘–)

๐‘‘ .

Proof. By Lemma 9, for any๐œ€ >0, the following events have nonzero, uniformly- lower-bounded probability:

๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 โˆ’๐’—(๐‘–)

๐‘‘

2

< ๐œ€and

๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2 โˆ’ (โˆ’๐’—(๐‘–)

๐‘‘ ) 2

< ๐œ€. Recall that in the bandit setting, actions are selected as follows in each iteration๐‘–:

๐’™๐‘–1=argsup

๐’™โˆˆA

๐’™๐‘‡๐’“หœ๐‘–1, and ๐’™๐‘–2 =argsup

๐’™โˆˆA

๐’™๐‘‡๐’“หœ๐‘–2. (B.17)

Note that for any๐’™ โˆˆ Aand unit vector ๐’“, ๐‘“1(๐’“) :=๐’™๐‘‡๐’“ is uniformly-continuous in ๐’“: if ||๐’“1โˆ’๐’“2||2 โ‰ค ๐›ฟ, then,

|๐‘“1(๐’“1) โˆ’ ๐‘“1(๐’“2) | โ‰ค |๐’™๐‘‡(๐’“1โˆ’๐’“2) | โ‰ค ||๐’™||2||๐’“1โˆ’๐’“2||2โ‰ค ๐ฟ ๐›ฟ,

where๐’™ โ‰ค ๐ฟ for all๐’™ โˆˆ A. Therefore, for any๐’—๐‘‘(๐‘–) and๐œ€0 > 0, there exists๐œ€1such that when๐œ€ < ๐œ€1:

๐’™๐‘‡๐‘–1๐’—(๐‘–)๐‘‘ โˆ’๐’™๐‘‡๐‘–1 ๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2

< ๐œ€0, and

๐’™๐‘‡๐‘–2(โˆ’๐’—(๐‘–)๐‘‘ ) โˆ’๐’™๐‘‡๐‘–2 ๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2

< ๐œ€0. (B.18) Similarly, for any unit vector ๐’“, ๐‘“2(๐’“) :=sup๐’™โˆˆA๐’™๐‘‡๐’“ is uniformly-continuous in ๐’“, since a continuous function over a compact set is uniformly-continuous, and the set of all unit vectors is compact. Therefore, for any๐’—(๐‘–)๐‘‘ and๐œ€0 > 0, there exists๐œ€2such that when๐œ€ < ๐œ€2:

sup

๐’™โˆˆA

๐’™๐‘‡๐’—๐‘‘(๐‘–) โˆ’ sup

๐’™โˆˆA

๐’™๐‘‡ ๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2

< ๐œ€0, and (B.19)

sup

๐’™โˆˆA

๐’™๐‘‡(โˆ’๐’—(๐‘–)

๐‘‘ ) โˆ’ sup

๐’™โˆˆA

๐’™๐‘‡ ๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2

< ๐œ€0.

Letting ๐œ€ โ‰ค min{๐œ€1, ๐œ€2}, |๐’™๐‘‡๐‘– ๐’—(๐‘–)๐‘‘ | can be lower-bounded as follows, where ๐‘ is defined as in Eq. (B.15):

0 < ๐‘ โ‰ค sup

๐’™1,๐’™2โˆˆA

| (๐’™2โˆ’๐’™1)๐‘‡๐’—(๐‘–)

๐‘‘ |=

sup

๐’™โˆˆA

๐’™๐‘‡๐’—(๐‘–)

๐‘‘ โˆ’ inf

๐’™โˆˆA๐’™๐‘‡๐’—(๐‘–)

๐‘‘

=

sup

๐’™โˆˆA

๐’™๐‘‡๐’—(๐‘–)๐‘‘ +sup

๐’™โˆˆA

๐’™๐‘‡(โˆ’๐’—(๐‘–)๐‘‘ )

(๐‘Ž)

โ‰ค 2๐œ€0+

sup

๐’™โˆˆA

๐’™๐‘‡ ๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 +sup

๐’™โˆˆA

๐’™๐‘‡ ๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2

(๐‘)= 2๐œ€0+

๐’™๐‘‡๐‘–1 ๐’“หœ๐‘–1

||๐’“หœ๐‘–1||2 +๐’™๐‘‡๐‘–2 ๐’“หœ๐‘–2

||๐’“หœ๐‘–2||2

(๐‘)

โ‰ค 4๐œ€0+ ๐’™๐‘‡๐‘–1๐’—(๐‘–)

๐‘‘ +๐’™๐‘‡๐‘–2(โˆ’๐’—(๐‘–)

๐‘‘ )

=4๐œ€0+ |๐’™๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘ |, where (a), (b), and (c) apply Eq.s (B.19), (B.17), and (B.18), respectively.

Let๐œ€0< ๐‘

8. Then,|๐’™๐‘‡๐‘–๐’—(๐‘–)

๐‘‘ | โ‰ฅ ๐‘

2 > 0 when the events of Lemma 9 hold. Because the latter have a non-decaying probability of occurrence ๐œŒ >0:

E h

๐’™๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘

i

โ‰ฅ ๐œŒ ๐‘

8 =๐‘0> 0.

With this result, one can finally prove the desired contradiction.

Lemma 11. As๐‘– โˆ’โ†’ โˆž, ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(

๐‘–) ๐‘‘

โˆ’โ†’๐ท 0, where๐œ†(

๐‘–)

๐‘‘ is the minimum eigenvalue of๐‘€หœ๐‘–.

Proof. First, one can show via contradiction that lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(๐‘–)

๐‘‘

โˆ’1

=0. Assume that:

lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(๐‘–)

๐‘‘

โˆ’1

=2๐›ผ > 0. (B.20)

If Eq. (B.20) is true, then there exists๐‘–0such that for all๐‘– โ‰ฅ๐‘–0, ๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(๐‘–)

๐‘‘

โˆ’1

โ‰ฅ ๐›ผ. The following assumes that ๐‘– โ‰ฅ ๐‘–0. Since ๐›ฝ๐‘–(๐›ฟ) increases at most logarithmically in๐‘–, it suffices to show that๐œ†(๐‘–)

๐‘‘ increases at least linearly on average to achieve a contradiction with Eq. (B.20).

Under the contradiction hypothesis, Lemmas 8 and 10 both hold. Due to Lemma 10, DPS will infinitely often, and at a non-decaying rate, sample pairs๐’™๐‘–1,๐’™๐‘–2such that |๐’™๐‘–๐‘‡๐’—๐‘‘(๐‘–)| is lower-bounded away from zero. At iteration ๐‘›, this proof analyzes the effect of this guarantee upon๐œ†(๐‘›)

๐‘‘ . Using that๐œ†(๐‘›)

๐‘‘ corresponds to the eigenvector ๐’—๐‘‘(๐‘›) of หœ๐‘€๐‘›:

๐œ†(๐‘›)

๐‘‘ =๐’—๐‘‘(๐‘›)๐‘‡๐‘€หœ๐‘›๐’—(๐‘›)๐‘‘

(๐‘Ž)

โ‰ฅ ๐’—(๐‘›)๐‘‡๐‘‘ (๐‘šmin๐‘€๐‘›))๐’—(๐‘›)๐‘‘ (=๐‘) ๐‘šmin๐’—(๐‘›)๐‘‡๐‘‘ ๐œ† ๐ผ+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘–๐’™๐‘‡๐‘–

! ๐’—(๐‘›)๐‘‘

โ‰ฅ ๐‘šmin

"

๐œ†+

๐‘›โˆ’1

ร•

๐‘–=1

๐’™๐‘‡๐‘– ๐’—(๐‘›)

๐‘‘

2#

, (B.21)

where (a) follows from Lemma 5 and (b) from the definition of๐‘€๐‘›. Note that while the right-hand side expression in Eq. (B.21) depends upon ๐’™๐‘‡๐‘–๐’—(๐‘›)

๐‘‘ for ๐‘– < ๐‘›, Eq.

(B.16) depends upon๐’™๐‘‡๐‘– ๐’—(๐‘–)

๐‘‘ . Clearly, if๐’—(๐‘–)

๐‘‘ were constant in๐‘–, then the combination of Eq.s (B.16) and (B.21) would suffice to prove that ๐œ†(

๐‘›)

๐‘‘ โˆ’โ†’ โˆž with at least a linear average rate; however,๐’—(๐‘–)

๐‘‘ can vary with๐‘–over the space of unit vectors inR๐‘‘. This proof leverages that ๐’—(๐‘–)๐‘‘ is a unit vector, that the set of unit vectors in R๐‘‘ is compact, and that any infinite cover of a compact set has a finite subcover. In particular, for any๐œ€ >0, there exist sets๐‘†1, . . . , ๐‘†๐พ โŠ‚ R๐‘‘, ๐พ <โˆž, such that:

1. For๐’— โˆˆR๐‘‘such that||๐’—||2=1,๐’— โˆˆ๐‘†๐‘˜ for some๐‘˜ โˆˆ {1, . . . , ๐พ}, and 2. If๐’—1,๐’—2 โˆˆ๐‘†๐‘˜, then||๐’—1โˆ’๐’—2|| < ๐œ€.

It will be shown that there exists a sequence (๐‘›๐‘–) โˆˆ N such that ๐’—(๐‘›๐‘–)

๐‘‘ โˆˆ ๐‘†๐‘˜ for

fixed ๐‘˜ โˆˆ {1, . . . , ๐พ}, with the events ๐’—(๐‘›๐‘–)

๐‘‘ โˆˆ ๐‘†๐‘˜ corresponding to the indices (๐‘›๐‘–) occurring at some non-decaying frequency. Then, by appropriately choosing๐œ€, the

proof will use Eq. (B.21) and the mutual proximity of the vectors๐’—๐‘‘(๐‘›๐‘–) to show that ๐œ†(

๐‘›)

๐‘‘ increases with an at-least linear rate.

Observe that for any number of total iterations ๐‘, there exists an integer ๐‘˜ โˆˆ {1, . . . , ๐พ}such that ๐’—(๐‘–)๐‘‘ โˆˆ ๐‘†๐‘˜ during at least ๐‘๐พ iterations. Because๐พ is a constant and ๐‘๐พ is linear in๐‘, the number of iterations in which๐’—๐‘‘(๐‘–) โˆˆ๐‘†๐‘˜ is at least linear in ๐‘ for some ๐‘˜. The right-hand sum in Eq. (B.21) can then be divided according to the indices (๐‘›๐‘–) and the remaining indices:

๐œ†

(๐‘›๐‘—)

๐‘‘ โ‰ฅ ๐‘šmin

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐œ†+

๐‘›๐‘—โˆ’1

ร•

๐‘–=1

๐’™๐‘‡๐‘– ๐’—(๐‘›๐‘‘๐‘–)2๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

=๐‘šmin

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐œ†+

๐‘—โˆ’1

ร•

๐‘–=1

๐’™๐‘‡๐‘›

๐‘–๐’—(๐‘›๐‘–)

๐‘‘

2

+

๐‘›๐‘—โˆ’1

ร•

๐‘—=1;๐‘—โˆ‰{๐‘›1,๐‘›2,...,๐‘›๐‘—โˆ’1}

๐’™๐‘‡๐‘—๐’—(๐‘›๐‘–)

๐‘‘

2๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

. (B.22)

The latter sum in Eq. (B.22) is non-decreasing in ๐‘›๐‘—, as all of its terms are non- negative. In the former sum,||๐’—(๐‘›๐‘—)

๐‘‘ โˆ’๐’—(๐‘›๐‘–)

๐‘‘ ||< ๐œ€for each๐‘– โˆˆ {1, . . . , ๐‘—โˆ’1}. Defining ๐œน :=๐’—(๐‘›๐‘—)

๐‘‘ โˆ’๐’—(๐‘›๐‘–)

๐‘‘ , so that ||๐œน||2 โ‰ค ๐œ€:

๐’™๐‘‡๐‘›๐‘–๐’—๐‘‘(๐‘›๐‘–)2

= ๐’™๐‘‡๐‘›๐‘–

๐’—(๐‘›๐‘‘๐‘–) +๐œน 2 =

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘‘๐‘–) +๐’™๐‘‡๐‘›๐‘–๐œน2

โ‰ฅ

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘‘๐‘–) โˆ’

๐’™๐‘‡๐‘›๐‘–๐œน

2

. (B.23) By the Cauchy-Schwarz inequality,

๐’™๐‘‡๐‘›

๐‘–๐œน

โ‰ค ||๐œน||2โˆ— ||๐’™๐‘›๐‘–||2 โ‰ค 2๐œ€ โ„Ž, whereโ„Ž is the trajectory horizon. Because Eq. (B.16) requires thatE

h ๐’™๐‘‡๐‘›

๐‘–๐’—(๐‘›๐‘–)

๐‘‘

i

โ‰ฅ ๐‘0> 0, one can choose๐œ€small enough thatE

h

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘–)

๐‘‘

โˆ’

๐’™๐‘‡๐‘›๐‘–๐œน i

โ‰ฅ ๐‘0โˆ’2๐œ€ โ„Ž โ‰ฅ ๐‘00 > 0, and:

E

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘‘๐‘–) 2 (๐‘Ž)

โ‰ฅ E

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘‘๐‘–) โˆ’

๐’™๐‘‡๐‘›๐‘–๐œน

2 (๐‘)

โ‰ฅ E h

๐’™๐‘‡๐‘›๐‘–๐’—(๐‘›๐‘‘๐‘–) โˆ’

๐’™๐‘‡๐‘›๐‘–๐œน

i2

โ‰ฅ (๐‘00)2> 0, where (a) takes expectations of both sides of Eq. (B.23), and (b) follows from Jensenโ€™s inequality. Merging this result with Eq. (B.22) implies that๐œ†

(๐‘›๐‘—)

๐‘‘ is expected to increase at least linearly on average, according to the positive constant๐‘00, over the indices (๐‘›๐‘–). Recall that there always exists an๐‘†๐‘˜ such that the number of times when๐’—(๐‘–)๐‘‘ โˆˆ ๐‘†๐‘˜ is at least linear in the total number of iterations ๐‘. Thus, the rate at which indices (๐‘›๐‘–) occur is always (at least) linear in ๐‘ on average, and ๐œ†

(๐‘›๐‘—) ๐‘‘

increases at least linearly in๐‘ in expectation.

One can now demonstrate that lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

=0 holds: the numerator of ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(

๐‘–) ๐‘‘

is the square of a quantity that increases at most logarithmically in๐‘–, while the denominator

increases at least linearly in ๐‘– on average. This contradicts the assumption in Eq.

(B.20), and thereby proves that lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

โˆ’1

=0 must hold.

Finally, one can leverage that lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

โˆ’1

=0 to show that lim

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

โˆ’1

=0. Consider the following two possible cases: 1)๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(๐‘–)

๐‘‘

โˆ’1

converges to zero in probability, and 2) ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(๐‘–)

๐‘‘

โˆ’1

does not converge to zero in probability. In case 1), because convergence in probability implies convergence in distribution, the desired result holds.

In case 2), there exists some๐œ€ > 0 such that๐‘ƒ

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

โˆ’1

โ‰ฅ ๐œ€

6

โˆ’โ†’ 0. In this case, one can apply the same arguments used to show that lim inf

๐‘–โˆ’โ†’โˆž

๐›ฝ๐‘–(๐›ฟ)2 ๐œ†(

๐‘–) ๐‘‘

โˆ’1

=

0, but specifically over time indices where ๐›ฝ๐‘–(๐›ฟ)2

๐œ†(๐‘–)

๐‘‘

โˆ’1

โ‰ฅ ๐œ€. Due to the non- convergence in probability, these time indices must occur at some non-decaying rate;

hence, the same analysis applies. Thus,๐œ†(

๐‘–)

๐‘‘ increases in๐‘–with at least a minimum linear average rate, while ๐›ฝ๐‘–(๐›ฟ) increases at most logarithmically in๐‘–. This violates the non-convergence assumption of case 2), resulting in a contradiction. As a result, only case 1) can hold.

Combining Lemmas 7 and 11 leads to the desired result:

Proposition 4.When running DPS in the generalized linear dueling bandit setting, with utilities given via either the linear or logistic link functions and with posterior sampling distributions given in Eq.(4.13), then with probability1โˆ’๐›ฟ, the sampled utilities ๐’“หœ๐‘–1,๐’“หœ๐‘–2converge in distribution to their true values, ๐’“หœ๐‘–1,๐’“หœ๐‘–2

โˆ’โ†’๐ท ๐’“, as๐‘– โˆ’โ†’

โˆž.

Asymptotic Consistency of the Utilities for Preference-Based RL

Proposition 4 can also be extended to the preference-based RL setting; in fact, several of the previous lemmas will be re-used in this analysis. The main complication is that policies are now selected rather than actions, and each policy corresponds to a distribution over possible trajectories.

The next result enables the proof to leverage convergence in distribution of the dy- namics samples, ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2

โˆ’โ†’๐ท ๐’‘ (as guaranteed by Proposition 3), in characterizing the impact of sampled policies upon convergence of the reward model.

Lemma 12. Let๐‘“ :R๐‘†

2๐ดร—R๐‘† ๐ด โˆ’โ†’Rbe a function of transition dynamics ๐’‘ โˆˆR๐‘†

2๐ด

and reward vector ๐’“ โˆˆ R๐‘† ๐ด, ๐‘“(๐’‘,๐’“), where ๐‘“ is continuous in ๐’‘ and uniformly- continuous in ๐’“. Assume that Proposition 3 holds: ๐’‘หœ๐‘–1, ๐’‘หœ๐‘–2

โˆ’โ†’๐ท ๐’‘. Then, for any ๐›ฟ, ๐œ€ > 0, there exists๐‘–0such that for๐‘– > ๐‘–0, |๐‘“(๐’‘,๐’“) โˆ’ ๐‘“(๐’‘หœ๐‘– ๐‘—,๐’“) | < ๐œ€ for anyunit vector ๐’“and ๐‘— โˆˆ {1,2}with probability at least1โˆ’๐›ฟ.

Proof. Without loss of generality, the result is shown for๐‘— =1 (the steps are identical for ๐‘— = 1 and ๐‘— = 2). Applying Proposition 3, ๐’‘หœ๐‘–1

โˆ’โ†’๐ท ๐’‘. By continuity of ๐‘“, one can apply Fact 8 from Appendix B.1 to obtain that ๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’โ†’๐ท ๐‘“(๐’‘,๐’“) for any ๐’“. Further applying Fact 9 from Appendix B.1, ๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’โ†’๐‘ƒ ๐‘“(๐’‘,๐’“)for any๐’“. By definition of convergence in probability, given๐›ฟ, there exists๐‘–๐’“ such that for๐‘– โ‰ฅ ๐‘–๐’“:

|๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’ ๐‘“(๐’‘,๐’“) | < 1

3๐œ€with probability at least 1โˆ’๐›ฟ. (B.24) To obtain a high-probability bound that applies over all unit vectors ๐’“, one can use compactness of the set of unit vectors. Any infinite cover of a compact set has a finite subcover; in particular, for any๐›ฟ0> 0, the set of unit vectors inR๐‘‘ has a finite cover of the form {B (๐’“1, ๐›ฟ0), . . . ,B (๐’“๐พ, ๐›ฟ0)}, where {๐’“1, . . . ,๐’“๐พ} are unit vectors, andB (๐’“, ๐›ฟ0) :={๐’“0โˆˆR๐‘‘| ||๐’“0โˆ’ ๐’“||2 < ๐›ฟ0} is the๐‘‘-dimensional sphere of radius ๐›ฟ0 centered at ๐’“. Thus, there exists a finite set of unit vectorsU = {๐’“1, . . . ,๐’“๐พ} such that for any unit vector ๐’“0, ||๐’“๐‘–โˆ’ ๐’“0||2 < ๐›ฟ0for some๐‘– โˆˆ {1, . . . , ๐พ}. Because ๐‘“ is uniformly-continuous in ๐’“, for any transition dynamics ๐’‘, there exists๐›ฟ๐’‘ > 0 such that for any two unit vectors ๐’“,๐’“0such that||๐’“โˆ’๐’“0||2< ๐›ฟ๐’‘:

|๐‘“(๐’‘,๐’“) โˆ’ ๐‘“(๐’‘,๐’“0) | <

1

3๐œ€ . (B.25)

Without loss of generality, for each ๐’‘, define๐›ฟ๐’‘ :=sup๐‘ฅ such that||๐’“โˆ’ ๐’“0||2 < ๐‘ฅ implies |๐‘“(๐’‘,๐’“) โˆ’ ๐‘“(๐’‘,๐’“0) | โ‰ค 1

6๐œ€. Then, because ๐‘“ is continuous in ๐’‘, ๐›ฟ๐’‘ is also continuous in ๐’‘. Because the set of all possible transition probability vectors ๐’‘ is compact, and a continuous function over a compact set achieves its minimum value, there exists๐›ฟmin > 0 such that๐›ฟ๐’‘ โ‰ฅ ๐›ฟmin > 0 over all ๐’‘. LetUbe defined such that ๐›ฟ0โ‰ค ๐›ฟmin; then, for any unit vector ๐’“0, there exists๐’“ โˆˆ Usuch that||๐’“โˆ’๐’“0||2 < ๐›ฟmin, and thus Eq. (B.25) holds for any ๐’‘.

By Eq. (B.24), for each ๐’“๐‘— โˆˆ U, there exists๐‘–๐’“

๐‘— such that for๐‘– โ‰ฅ ๐‘–๐’“

๐‘—: ||๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’ ๐‘“(๐’‘,๐’“) ||2< 1

3๐œ€with probability at least 1โˆ’๐›ฟ. BecauseUis a finite set, there exists ๐‘–0> max{๐‘–๐’“

1, . . . , ๐‘–๐’“

๐พ}such that for ๐’“ โˆˆ Uand๐‘– > ๐‘–0:

|๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’ ๐‘“(๐’‘,๐’“) | <

1

3๐œ€for each๐’“ โˆˆ Uwith probability at least 1โˆ’๐›ฟ. (B.26)

Therefore, for any unit vector๐’“0, there exists๐’“ โˆˆ Usuch that||๐’“โˆ’๐’“0||2< ๐›ฟ0โ‰ค ๐›ฟmin, and with probability at least 1โˆ’๐›ฟfor๐‘– > ๐‘–0:

|๐‘“(๐’‘,๐’“0) โˆ’ ๐‘“(๐’‘หœ๐‘–1,๐’“0) | = |๐‘“(๐’‘,๐’“0) โˆ’ ๐‘“(๐’‘,๐’“) + ๐‘“(๐’‘,๐’“) โˆ’ ๐‘“(๐’‘หœ๐‘–1,๐’“) + ๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’ ๐‘“(๐’‘หœ๐‘–1,๐’“0) |

(๐‘Ž)

โ‰ค |๐‘“(๐’‘,๐’“0) โˆ’ ๐‘“(๐’‘,๐’“) | + |๐‘“(๐’‘,๐’“) โˆ’ ๐‘“(๐’‘หœ๐‘–1,๐’“) | + |๐‘“(๐’‘หœ๐‘–1,๐’“) โˆ’ ๐‘“(๐’‘หœ๐‘–1,๐’“0) |

(๐‘)

โ‰ค 1 3๐œ€+ 1

3๐œ€+ 1 3๐œ€=๐œ€,

where (a) holds due to the triangle inequality, and (b) holds via Eq.s (B.25) and (B.26), where the proof showed that there exists๐›ฟminsuch that 0 < ๐›ฟmin โ‰ค ๐›ฟ๐’‘for all possible transition dynamics parameters ๐’‘.

Applying Lemma 12 to the two functions ๐‘‰(๐’‘,๐’“, ๐œ‹๐‘ฃ๐‘–(๐’‘,๐’“)) = max๐œ‹0๐‘‰(๐’‘,๐’“, ๐œ‹0) and๐‘‰(๐’‘,๐’“, ๐œ‹), for any fixed policy๐œ‹, yields the following result.

Lemma 13. For any๐œ€, ๐›ฟ >0, any policy๐œ‹, and any unit reward vector๐’“, both of the following hold with probability at least1โˆ’๐›ฟfor sufficiently-large๐‘– and ๐‘— โˆˆ {1,2}:

|๐‘‰(๐’‘,๐’“, ๐œ‹) โˆ’๐‘‰(๐’‘หœ๐‘– ๐‘—,๐’“, ๐œ‹) | < ๐œ€

|๐‘‰(๐’‘,๐’“, ๐œ‹๐‘ฃ๐‘–(๐’‘,๐’“)) โˆ’๐‘‰(๐’‘หœ๐‘– ๐‘—,๐’“, ๐œ‹๐‘ฃ๐‘–(๐’‘หœ๐‘– ๐‘—,๐’“)) | < ๐œ€ .

Proof. Both statements follow by applying Lemma 12. First, consider the function ๐‘“1(๐’‘,๐’“) :=๐‘‰(๐’‘,๐’“, ๐œ‹) for a fixed policy๐œ‹. The value function๐‘‰(๐’‘,๐’“, ๐œ‹) is contin- uous in both ๐’‘ and ๐’“, and furthermore, it is linear in ๐’“. Using these observations, the proof demonstrates that๐‘‰(๐’‘,๐’“, ๐œ‹)is uniformly-continuous in๐’“. Indeed, for any linear function of the form ๐‘”(๐’›) = ๐’‚๐‘‡๐’›, for any ๐œ€0 > 0, for ๐›ฟ0 := ||๐’‚||๐œ€0 , and for any ๐’›1,๐’›2such that ||๐’›1โˆ’๐’›2|| < ๐›ฟ0:

|๐‘”(๐’›1) โˆ’๐‘”(๐’›2) | =|๐’‚๐‘‡(๐’›1โˆ’๐’›2) | โ‰ค ||๐’‚||2||๐’›1โˆ’๐’›2||2 < ||๐’‚||2๐›ฟ0=๐œ€0.

Thus, ๐‘“1 satisfies the conditions of Lemma 12 for any fixed ๐œ‹, so that for ๐‘– > ๐‘–๐œ‹,

|๐‘‰(๐’‘,๐’“, ๐œ‹) โˆ’๐‘‰(๐’‘หœ๐‘– ๐‘—,๐’“, ๐œ‹) | < ๐œ€ with probability at least 1 โˆ’๐›ฟ. Because there are finitely-many deterministic policies๐œ‹, one can set๐‘– > max๐œ‹๐‘–๐œ‹, so that the statement holds jointly over all๐œ‹.

Next, let ๐‘“2(๐’‘,๐’“) =max๐œ‹๐‘‰(๐’‘,๐’“, ๐œ‹)=๐‘‰(๐’‘,๐’“, ๐œ‹๐‘ฃ๐‘–(๐’‘,๐’“)). A maximum over finitely- many continuous functions is continuous, and a maximum over finitely-many uniformly- continuous functions is uniformly-continuous. Therefore, ๐‘“2 also satisfies the con- ditions of Lemma 12.

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 188-200)