Chapter VI: Conclusions and Future Directions
B.3 Asymptotic Consistency of the Utilities in DPS
Proposition 3.Assume that DPS is executed in the preference-based RL setting, with transition dynamics modeled via a Dirichlet model, utilities modeled via either the linear or logistic link function, and utility posterior sampling distributions given in Eq.(4.13). Then, the sampled transition dynamics ๐ห๐1, ๐ห๐2converge in distribution to the true dynamics, ๐ห๐1, ๐ห๐2
โโ๐ท ๐. This consistency result also holds when removing the๐ฝ๐(๐ฟ)factors from the distributions in Eq.(4.13).
Because the probability distribution of||๐๐||2is fixed, there exists some fixed๐ > 0 such that with probability at least 1โ๐ฟ, ||๐๐||2 โค ๐. So, for each๐, with probability at least 1โ๐ฟ,
||๐ห๐1โ๐ห๐||๐ห
๐ = ๐ฝ๐(๐ฟ) ||๐๐||2 โค ๐ฝ๐(๐ฟ)๐ .
The previous statement can be combined with the high-probability inequality||๐ห๐โ ๐||๐ห
๐ โค ๐max๐ฝ๐(๐ฟ)to obtain that for each๐, with probability at least 1โ๐ฟ,
||๐ห๐1โ๐||๐ห
๐ โค ||๐ห๐1โ๐ห๐||๐ห
๐ + ||๐ห๐โ๐||๐ห
๐ โค (๐+๐max)๐ฝ๐(๐ฟ).
Taking squares and dividing by๐ฝ๐(๐ฟ) yields that for each๐, with probability at least 1โ๐ฟ,
1
๐ฝ๐(๐ฟ)2(๐ห๐1โ๐)๐๐ห๐(๐ห๐1โ๐) โค (๐+๐max)2. By assumption, ๐
(๐) ๐
๐ฝ๐(๐ฟ)2
โโ โ๐ท as ๐ โโ โ. Recall from Definition 7 that ๐(๐๐), ๐ โ {1, . . . , ๐}, represent the eigenvectors of ห๐๐ corresponding to the eigenvalues๐(๐)
๐ . Then, with probability at least 1โ๐ฟ for each๐:
1
๐ฝ๐(๐ฟ)2(๐ห๐1โ ๐)๐๐ห๐(๐ห๐1โ๐) = 1
๐ฝ๐(๐ฟ)2(๐ห๐1โ๐)๐ ยฉ
ยญ
ยซ
๐
ร
๐=1
๐(๐)
๐ ๐(๐)๐ ๐(๐)๐๐ ยช
ยฎ
ยฌ
(๐ห๐1โ๐)
= 1 ๐ฝ๐(๐ฟ)2
๐
ร
๐=1
๐(
๐) ๐
(๐ห๐1โ ๐)๐๐(๐๐)2
โค (๐+๐max)2. (B.11)
Since ๐
(๐) ๐
๐ฝ๐(๐ฟ)2 โโ โ as ๐ โโ โ for each ๐, and ๐(๐)๐ is an orthonormal basis, the constant bound of(๐+๐max)2in Eq. (B.11) is violated if we do not have๐ห๐1โ๐ โโ๐ท 0.
Eq. (B.11) must hold with probability at least 1โ๐ฟindependently for each iteration ๐, with the (1โ๐ฟ)-probability due entirely to randomness in the posterior sampling distribution, given by Eq. (B.9). Therefore, it follows that๐ห๐1
โโ๐ท ๐. The proof that ๐ห๐2
โโ๐ท ๐with high probability is identical.
The analysis will show convergence in distribution of the reward samples,๐ห๐1,๐ห๐2
โโ๐ท
๐, by applying Lemma 7 and demonstrating that 1
๐ฝ๐(๐ฟ)2๐(๐)
๐ โโ โas๐ โโ โ. This result is proven by contradiction: intuitively, if 1
๐ฝ๐(๐ฟ)2๐(
๐)
๐ is upper-bounded, then
DPS has a lower-bounded probability of selecting policies that increase๐(๐)
๐ . First, Lemma 8 demonstrates that under the contradiction hypothesis, there is a non- decaying probability of sampling rewards ๐ห๐1,๐ห๐2 that are highly-aligned with the eigenvector๐(๐)
๐ of ห๐โ1
๐ corresponding to its largest eigenvalue, (๐(
๐) ๐ )โ1. Lemma 8. Assume that for a given iteration๐, ๐ฝ๐(๐ฟ)2
๐(๐)
๐
โ1
โฅ ๐ผ. Then for any ๐ > 0, the reward samples๐ห๐1,๐ห๐2satisfy:
๐
ห
๐๐๐1๐(๐)๐ โฅ ๐max
๐ < ๐
|๐ห๐๐1๐(๐)๐ |
โฅ ๐(๐) > 0, (B.12) ๐
๐ห๐๐2๐(๐)๐ โค โ๐max
๐ < ๐
|๐ห๐๐2๐(๐)๐ |
โฅ ๐(๐) > 0, (B.13) where๐:R+ โโR+is a continuous, monotonically-decreasing function.
Proof. Recall that the reward samples ๐ห๐1,๐ห๐2 are drawn according to the distribu- tion in Eq. (B.9). The proof begins by demonstrating that the reward samples can equivalently be expressed as:
๐ห๐1= ๐ห๐+๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(
๐) ๐
โ12
๐ง๐ ๐๐(๐๐), ๐ง๐ ๐ โผ N (0,1) i.i.d., (B.14) and similarly for ๐ห๐2. Similarly to Eq. (B.9), the expression in Eq. (B.14) has a multivariate Gaussian distribution. One can take the expectation and covariance of Eq. (B.14) with respect to the variables{๐ง๐ ๐}to show that they match the expressions in Eq. (B.9):
E
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ห
๐๐+๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(๐)
๐
โ12 ๐ง๐ ๐๐(๐)๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
= ๐ห๐+๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(๐)
๐
โ12
E[๐ง๐ ๐]๐(๐)๐ = ๐ห๐,
Cov
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ห๐+๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(
๐) ๐
โ12
๐ง๐ ๐๐(๐๐)
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
(๐)= E
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
ยฉ
ยญ
ยซ ๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(
๐) ๐
โ12
๐ง๐ ๐๐(๐๐)ยช
ยฎ
ยฌ ๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(
๐) ๐
โ12
๐ง๐ ๐๐(๐)
๐
!๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
= ๐ฝ๐(๐ฟ)2
๐
ร
๐=1 ๐
ร
๐=1
๐(๐)
๐
โ12 ๐(๐)
๐
โ12
๐(๐)๐ ๐(๐)๐
๐ E[๐ง๐ ๐๐ง๐ ๐]
(๐)
= ๐ฝ๐(๐ฟ)2
๐
ร
๐=1
๐(๐)
๐
โ1
๐(๐)๐ ๐(๐)๐๐ = ๐ฝ๐(๐ฟ)2๐หโ1
๐ , which match the expectation and covariance in Eq. (B.9). In the above, (a) applies the definition Cov[๐] =E[(๐โE[๐]) (๐โE[๐])๐], and (b) holds becauseE[๐ง๐ ๐๐ง๐ ๐] = Cov[๐ง๐ ๐๐ง๐ ๐] =๐ฟ๐ ๐, where๐ฟ๐ ๐ is the Kronecker delta function.
Next, it is shown that the probability that๐ห๐1is arbitrarily-aligned with๐(๐)๐ is lower- bounded above zero: that is, there exists ๐ : R+ โโ R+ such that for any ๐ > 0, ๐(๐ห๐
๐1๐(๐)
๐ โฅ ๐max๐ < ๐|๐ห๐
๐1๐(๐๐)|) โฅ ๐(๐) > 0. This can be shown by bounding the terms |๐ห๐๐1๐(๐)๐ |, ๐ < ๐, and ๐ห๐๐1๐(๐)๐ . Firstly, the term |๐ห๐๐1๐(๐)๐ |, ๐ < ๐, can be upper- bounded:
|๐ห๐๐1๐(๐)๐ | (=๐) ห
๐๐๐๐(๐)๐ +๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(๐)
๐
โ12
๐ง๐ ๐๐(๐)๐๐ ๐(๐)๐
(๐)
= ห
๐๐๐ ๐(๐)๐ +๐ฝ๐(๐ฟ) ๐(๐)
๐
โ12 ๐ง๐ ๐
โค |๐ห๐๐ ๐(๐๐)| + ๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
|๐ง๐ ๐| (
๐)
โค ||๐ห๐||2||๐(๐๐)||2+๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
|๐ง๐ ๐|
(๐)
โค ๐+๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
|๐ง๐ ๐|,
where (a) applies Eq. (B.14), (b) follows from orthonormality of the eigenbasis, (c) follows from the Cauchy-Schwarz inequality, and (d) uses Lemma 4, which states that||๐ห๐||2 โค ๐. Similarly,๐ห๐๐1๐(๐)๐ can be lower-bounded:
๐ห๐๐1๐(๐)
๐
(๐)= ๐ห๐๐ ๐(๐)
๐ +๐ฝ๐(๐ฟ)
๐
ร
๐=1
๐(
๐) ๐
โ12
๐ง๐ ๐๐(๐๐)๐๐(๐)
๐ (๐)
= ๐ห๐๐ ๐(๐)
๐ +๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
๐ง๐1
โฅ โ|๐ห๐๐ ๐(๐)
๐ | +๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
๐ง๐1
(๐)
โฅ โ||๐ห๐||2||๐(๐)
๐ ||2+๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
๐ง๐1
(๐)
โฅ โ๐+๐ฝ๐(๐ฟ) ๐(
๐) ๐
โ12
๐ง๐1,
where as before, (a) applies Eq. (B.14), (b) follows from orthonormality of the eigen- basis, (c) follows from the Cauchy-Schwarz inequality, and (d) holds via Lemma 4.
Given these upper and lower bounds, the probability๐
๐ห๐๐1๐(๐)๐ โฅ ๐max๐ < ๐|๐ห๐๐1๐(๐)๐ | can be lower-bounded:
๐
ห
๐๐๐1๐(๐)๐ โฅ ๐max
๐ < ๐
|๐ห๐๐1๐(๐)๐ | (๐)
โฅ ๐
โ๐+๐ฝ๐(๐ฟ) ๐(๐)
๐
โ12
๐ง๐1 โฅ ๐max
๐ < ๐
๐+๐ฝ๐(๐ฟ) ๐(๐)
๐
โ12
|๐ง๐ ๐|
=๐
ยฉ
ยญ
ยญ
ยซ ๐ง๐1 โฅ
๐ q
๐(๐)
๐
๐ฝ๐(๐ฟ) +๐max
๐ < ๐
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐
q ๐(๐)
๐
๐ฝ๐(๐ฟ) + vu ut๐(
๐) ๐
๐(
๐) ๐
|๐ง๐ ๐|
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ยช
ยฎ
ยฎ
ยฌ
(๐)
โฅ ๐
๐ง๐1 โฅ ๐
โ ๐ผ
+๐max
๐ < ๐
๐
โ ๐ผ
+ |๐ง๐ ๐|
=๐
๐ง๐1 โฅ ๐(1+๐)
โ ๐ผ
+๐max
๐ < ๐
|๐ง๐ ๐|
:=๐(๐) > 0, where (a) results from the upper and lower bounds derived above, and (b) follows because ๐
(๐) ๐
๐(๐)
๐
โค 1 and ๐ฝ๐(๐ฟ)
๐(๐)
๐
โ12
โฅ โ
๐ผby assumption. The function๐(๐) >0 is continuous and decreasing in๐.
By identical arguments, ๐(๐ห๐๐2๐(๐)๐ โค โ๐max๐ < ๐|๐ห๐๐2๐(๐)๐ |) โฅ ๐(๐). Thus, for any ๐ > 0 and set of eigenvectors๐(๐)๐ :
๐(๐ห๐๐1๐๐(๐) โฅ ๐max
๐ < ๐
|๐ห๐๐1๐(๐)๐ |) โฅ ๐(๐) > 0, ๐(๐ห๐๐2๐(๐)
๐ โค โ๐max
๐ < ๐
|๐ห๐๐2๐(๐๐)|) โฅ ๐(๐) >0.
This alignment between sampled rewards ๐ห๐1, ๐ห๐2 and๐๐(๐) can also be expressed as follows by taking๐high enough:
Lemma 9. Assume that for a given iteration๐, ๐ฝ๐(๐ฟ)2 ๐(๐)
๐
โ1
โฅ ๐ผ. Then, for any ๐ > 0, the following events have nonzero, uniformly-lower-bounded probability:
๐ห๐1
||๐ห๐1||2 โ๐(๐)๐ 2
< ๐and
๐ห๐2
||๐ห๐2||2 โ (โ๐๐(๐)) 2
< ๐.
Proof. By Lemma 8, Eq.s (B.12) and (B.13) both hold. The events in Eq.s (B.12) and (B.13), {๐ห๐๐1๐(๐)๐ โฅ ๐max๐ < ๐|๐ห๐๐1๐(๐)๐ |} and {๐ห๐๐2๐(๐)๐ โค โ๐max๐ < ๐|๐ห๐๐2๐(๐)๐ |}, will be referred to as events๐ด(๐)and๐ต(๐), respectively. From Lemma 8,๐ด(๐)and๐ต(๐) have positive, lower-bounded probability for any๐. The proof argues that regardless of the specific eigenbasis, the values of๐ required for the result to hold have some fixed upper bound.
First, note that under events ๐ด(๐) and ๐ต(๐), as ๐ โโ โ, ||๐ห๐ห๐1
๐1||2 โโ ๐(๐)๐ and
๐ห๐2
||๐ห๐2||2 โโ โ๐(๐)๐ . Let๐ > 0. Under event๐ด(๐)for sufficiently-large๐,
๐ห๐1
||๐ห๐1||2 โ๐(๐)๐ 2
<
๐. Define๐min,1(๐,๐1(๐), . . . ,๐๐(๐)) as the minimum value of๐such that๐ด(๐)implies
๐ห๐1
||๐ห๐1||2 โ๐(๐)๐ 2
โค ๐
2, given the eigenbasis {๐1(๐), . . . ,๐๐(๐)}. Because the inequality defining๐ด(๐)is continuous in๐,๐ห๐1, and the eigenbasis{๐1(๐), . . . ,๐(๐)๐ }, the function ๐min,1(๐,๐(๐)
1 , . . . ,๐(๐)
๐ )is also continuous in the eigenbasis{๐(๐)
1 , . . . ,๐(๐)
๐ }. Because ๐min,1(๐,๐(๐)
1 , . . . ,๐(๐)
๐ )is positive for all{๐(๐)
1 , . . . ,๐(๐)
๐ }, and the set of all eigenbases {๐(๐)
1 , . . . ,๐(๐)
๐ }is compact, there exists๐min,1(๐)such that for any eigenbasis, if๐ด(๐) holds for๐ โฅ ๐min,1(๐), then
๐ห๐1
||๐ห๐1||2 โ๐(๐)
๐
2
< ๐.
By the same arguments, there exists ๐min,2(๐) such that for any eigenbasis, if ๐ต(๐) holds for ๐ โฅ ๐min,2(๐), then
๐ห๐2
||๐ห๐2||2 โ (โ๐(๐)๐ ) 2
< ๐. Taking ๐min(๐) :=
max{๐min,1(๐), ๐min,2(๐)}, then for any๐ โฅ ๐min(๐), under events ๐ด(๐) and ๐ต(๐), both
๐ห๐1
||๐ห๐1||2 โ๐(๐)๐ 2
< ๐and
๐ห๐2
||๐ห๐2||2 โ (โ๐(๐)๐ ) 2
< ๐hold.
The next step shows that given sampled rewards๐ห๐1,๐ห๐2that are highly-aligned with the eigenvector ๐(๐)
๐ of ห๐๐ as in Lemma 9, there is a lower-bounded probability of sampling actions that have non-zero projection onto this eigenvector.
Importantly, for each possible eigenvector ๐(๐)๐ , the analysis will assume that there exist two actions๐1,๐2 โ Asuch that:
| (๐2โ๐1)๐๐(๐)
๐ | โฅ ๐ > 0, (B.15)
where the constant๐is independent of the eigenvector๐(๐)
๐ . If this is not satisfied, then๐๐๐(๐)
๐ is the same (or arbitrarily-similar) for all actions ๐ โ A. In this case, the eigenvector ๐(๐)๐ lies along a dimension that is irrelevant for learning the utilities๐. The corresponding eigenvalue๐(๐)
๐ is necessarily 0, since all vectors ๐๐ are orthogonal to ๐๐(๐). Thus, the learning problem would remain equivalent if A were reduced to a lower-dimensional subspace to which ๐(๐)๐ is orthogonal. Therefore, without loss of generality, this proof assumes that no such eigenvectors exist, that is, Eq. (B.15) is satisfied for all eigenvectors๐(๐)
๐ of ห๐๐. Lemma 10 (Generalized linear dueling bandit). Assume that there exists ๐0 such that for๐ > ๐0, ๐ฝ๐(๐ฟ)2
๐(๐)
๐
โ1
โฅ ๐ผ. Then, there exists a constant ๐0 > 0 such that for๐ > ๐0:
E h
๐๐๐ ๐(๐)
๐
i
โฅ ๐0> 0, (B.16)
where๐0> 0depends only on the true utility parameters๐, so that in particular, Eq.
(B.16)holds for any eigenvector๐(๐)
๐ .
Proof. By Lemma 9, for any๐ >0, the following events have nonzero, uniformly- lower-bounded probability:
๐ห๐1
||๐ห๐1||2 โ๐(๐)
๐
2
< ๐and
๐ห๐2
||๐ห๐2||2 โ (โ๐(๐)
๐ ) 2
< ๐. Recall that in the bandit setting, actions are selected as follows in each iteration๐:
๐๐1=argsup
๐โA
๐๐๐ห๐1, and ๐๐2 =argsup
๐โA
๐๐๐ห๐2. (B.17)
Note that for any๐ โ Aand unit vector ๐, ๐1(๐) :=๐๐๐ is uniformly-continuous in ๐: if ||๐1โ๐2||2 โค ๐ฟ, then,
|๐1(๐1) โ ๐1(๐2) | โค |๐๐(๐1โ๐2) | โค ||๐||2||๐1โ๐2||2โค ๐ฟ ๐ฟ,
where๐ โค ๐ฟ for all๐ โ A. Therefore, for any๐๐(๐) and๐0 > 0, there exists๐1such that when๐ < ๐1:
๐๐๐1๐(๐)๐ โ๐๐๐1 ๐ห๐1
||๐ห๐1||2
< ๐0, and
๐๐๐2(โ๐(๐)๐ ) โ๐๐๐2 ๐ห๐2
||๐ห๐2||2
< ๐0. (B.18) Similarly, for any unit vector ๐, ๐2(๐) :=sup๐โA๐๐๐ is uniformly-continuous in ๐, since a continuous function over a compact set is uniformly-continuous, and the set of all unit vectors is compact. Therefore, for any๐(๐)๐ and๐0 > 0, there exists๐2such that when๐ < ๐2:
sup
๐โA
๐๐๐๐(๐) โ sup
๐โA
๐๐ ๐ห๐1
||๐ห๐1||2
< ๐0, and (B.19)
sup
๐โA
๐๐(โ๐(๐)
๐ ) โ sup
๐โA
๐๐ ๐ห๐2
||๐ห๐2||2
< ๐0.
Letting ๐ โค min{๐1, ๐2}, |๐๐๐ ๐(๐)๐ | can be lower-bounded as follows, where ๐ is defined as in Eq. (B.15):
0 < ๐ โค sup
๐1,๐2โA
| (๐2โ๐1)๐๐(๐)
๐ |=
sup
๐โA
๐๐๐(๐)
๐ โ inf
๐โA๐๐๐(๐)
๐
=
sup
๐โA
๐๐๐(๐)๐ +sup
๐โA
๐๐(โ๐(๐)๐ )
(๐)
โค 2๐0+
sup
๐โA
๐๐ ๐ห๐1
||๐ห๐1||2 +sup
๐โA
๐๐ ๐ห๐2
||๐ห๐2||2
(๐)= 2๐0+
๐๐๐1 ๐ห๐1
||๐ห๐1||2 +๐๐๐2 ๐ห๐2
||๐ห๐2||2
(๐)
โค 4๐0+ ๐๐๐1๐(๐)
๐ +๐๐๐2(โ๐(๐)
๐ )
=4๐0+ |๐๐๐ ๐(๐)
๐ |, where (a), (b), and (c) apply Eq.s (B.19), (B.17), and (B.18), respectively.
Let๐0< ๐
8. Then,|๐๐๐๐(๐)
๐ | โฅ ๐
2 > 0 when the events of Lemma 9 hold. Because the latter have a non-decaying probability of occurrence ๐ >0:
E h
๐๐๐ ๐(๐)
๐
i
โฅ ๐ ๐
8 =๐0> 0.
With this result, one can finally prove the desired contradiction.
Lemma 11. As๐ โโ โ, ๐ฝ๐(๐ฟ)2
๐(
๐) ๐
โโ๐ท 0, where๐(
๐)
๐ is the minimum eigenvalue of๐ห๐.
Proof. First, one can show via contradiction that lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(๐)
๐
โ1
=0. Assume that:
lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(๐)
๐
โ1
=2๐ผ > 0. (B.20)
If Eq. (B.20) is true, then there exists๐0such that for all๐ โฅ๐0, ๐ฝ๐(๐ฟ)2 ๐(๐)
๐
โ1
โฅ ๐ผ. The following assumes that ๐ โฅ ๐0. Since ๐ฝ๐(๐ฟ) increases at most logarithmically in๐, it suffices to show that๐(๐)
๐ increases at least linearly on average to achieve a contradiction with Eq. (B.20).
Under the contradiction hypothesis, Lemmas 8 and 10 both hold. Due to Lemma 10, DPS will infinitely often, and at a non-decaying rate, sample pairs๐๐1,๐๐2such that |๐๐๐๐๐(๐)| is lower-bounded away from zero. At iteration ๐, this proof analyzes the effect of this guarantee upon๐(๐)
๐ . Using that๐(๐)
๐ corresponds to the eigenvector ๐๐(๐) of ห๐๐:
๐(๐)
๐ =๐๐(๐)๐๐ห๐๐(๐)๐
(๐)
โฅ ๐(๐)๐๐ (๐min๐๐))๐(๐)๐ (=๐) ๐min๐(๐)๐๐ ๐ ๐ผ+
๐โ1
ร
๐=1
๐๐๐๐๐
! ๐(๐)๐
โฅ ๐min
"
๐+
๐โ1
ร
๐=1
๐๐๐ ๐(๐)
๐
2#
, (B.21)
where (a) follows from Lemma 5 and (b) from the definition of๐๐. Note that while the right-hand side expression in Eq. (B.21) depends upon ๐๐๐๐(๐)
๐ for ๐ < ๐, Eq.
(B.16) depends upon๐๐๐ ๐(๐)
๐ . Clearly, if๐(๐)
๐ were constant in๐, then the combination of Eq.s (B.16) and (B.21) would suffice to prove that ๐(
๐)
๐ โโ โ with at least a linear average rate; however,๐(๐)
๐ can vary with๐over the space of unit vectors inR๐. This proof leverages that ๐(๐)๐ is a unit vector, that the set of unit vectors in R๐ is compact, and that any infinite cover of a compact set has a finite subcover. In particular, for any๐ >0, there exist sets๐1, . . . , ๐๐พ โ R๐, ๐พ <โ, such that:
1. For๐ โR๐such that||๐||2=1,๐ โ๐๐ for some๐ โ {1, . . . , ๐พ}, and 2. If๐1,๐2 โ๐๐, then||๐1โ๐2|| < ๐.
It will be shown that there exists a sequence (๐๐) โ N such that ๐(๐๐)
๐ โ ๐๐ for
fixed ๐ โ {1, . . . , ๐พ}, with the events ๐(๐๐)
๐ โ ๐๐ corresponding to the indices (๐๐) occurring at some non-decaying frequency. Then, by appropriately choosing๐, the
proof will use Eq. (B.21) and the mutual proximity of the vectors๐๐(๐๐) to show that ๐(
๐)
๐ increases with an at-least linear rate.
Observe that for any number of total iterations ๐, there exists an integer ๐ โ {1, . . . , ๐พ}such that ๐(๐)๐ โ ๐๐ during at least ๐๐พ iterations. Because๐พ is a constant and ๐๐พ is linear in๐, the number of iterations in which๐๐(๐) โ๐๐ is at least linear in ๐ for some ๐. The right-hand sum in Eq. (B.21) can then be divided according to the indices (๐๐) and the remaining indices:
๐
(๐๐)
๐ โฅ ๐min
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐+
๐๐โ1
ร
๐=1
๐๐๐ ๐(๐๐๐)2๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
=๐min
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐+
๐โ1
ร
๐=1
๐๐๐
๐๐(๐๐)
๐
2
+
๐๐โ1
ร
๐=1;๐โ{๐1,๐2,...,๐๐โ1}
๐๐๐๐(๐๐)
๐
2๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
. (B.22)
The latter sum in Eq. (B.22) is non-decreasing in ๐๐, as all of its terms are non- negative. In the former sum,||๐(๐๐)
๐ โ๐(๐๐)
๐ ||< ๐for each๐ โ {1, . . . , ๐โ1}. Defining ๐น :=๐(๐๐)
๐ โ๐(๐๐)
๐ , so that ||๐น||2 โค ๐:
๐๐๐๐๐๐(๐๐)2
= ๐๐๐๐
๐(๐๐๐) +๐น 2 =
๐๐๐๐๐(๐๐๐) +๐๐๐๐๐น2
โฅ
๐๐๐๐๐(๐๐๐) โ
๐๐๐๐๐น
2
. (B.23) By the Cauchy-Schwarz inequality,
๐๐๐
๐๐น
โค ||๐น||2โ ||๐๐๐||2 โค 2๐ โ, whereโ is the trajectory horizon. Because Eq. (B.16) requires thatE
h ๐๐๐
๐๐(๐๐)
๐
i
โฅ ๐0> 0, one can choose๐small enough thatE
h
๐๐๐๐๐(๐๐)
๐
โ
๐๐๐๐๐น i
โฅ ๐0โ2๐ โ โฅ ๐00 > 0, and:
E
๐๐๐๐๐(๐๐๐) 2 (๐)
โฅ E
๐๐๐๐๐(๐๐๐) โ
๐๐๐๐๐น
2 (๐)
โฅ E h
๐๐๐๐๐(๐๐๐) โ
๐๐๐๐๐น
i2
โฅ (๐00)2> 0, where (a) takes expectations of both sides of Eq. (B.23), and (b) follows from Jensenโs inequality. Merging this result with Eq. (B.22) implies that๐
(๐๐)
๐ is expected to increase at least linearly on average, according to the positive constant๐00, over the indices (๐๐). Recall that there always exists an๐๐ such that the number of times when๐(๐)๐ โ ๐๐ is at least linear in the total number of iterations ๐. Thus, the rate at which indices (๐๐) occur is always (at least) linear in ๐ on average, and ๐
(๐๐) ๐
increases at least linearly in๐ in expectation.
One can now demonstrate that lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
=0 holds: the numerator of ๐ฝ๐(๐ฟ)2
๐(
๐) ๐
is the square of a quantity that increases at most logarithmically in๐, while the denominator
increases at least linearly in ๐ on average. This contradicts the assumption in Eq.
(B.20), and thereby proves that lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
โ1
=0 must hold.
Finally, one can leverage that lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
โ1
=0 to show that lim
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
โ1
=0. Consider the following two possible cases: 1)๐ฝ๐(๐ฟ)2 ๐(๐)
๐
โ1
converges to zero in probability, and 2) ๐ฝ๐(๐ฟ)2
๐(๐)
๐
โ1
does not converge to zero in probability. In case 1), because convergence in probability implies convergence in distribution, the desired result holds.
In case 2), there exists some๐ > 0 such that๐
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
โ1
โฅ ๐
6
โโ 0. In this case, one can apply the same arguments used to show that lim inf
๐โโโ
๐ฝ๐(๐ฟ)2 ๐(
๐) ๐
โ1
=
0, but specifically over time indices where ๐ฝ๐(๐ฟ)2
๐(๐)
๐
โ1
โฅ ๐. Due to the non- convergence in probability, these time indices must occur at some non-decaying rate;
hence, the same analysis applies. Thus,๐(
๐)
๐ increases in๐with at least a minimum linear average rate, while ๐ฝ๐(๐ฟ) increases at most logarithmically in๐. This violates the non-convergence assumption of case 2), resulting in a contradiction. As a result, only case 1) can hold.
Combining Lemmas 7 and 11 leads to the desired result:
Proposition 4.When running DPS in the generalized linear dueling bandit setting, with utilities given via either the linear or logistic link functions and with posterior sampling distributions given in Eq.(4.13), then with probability1โ๐ฟ, the sampled utilities ๐ห๐1,๐ห๐2converge in distribution to their true values, ๐ห๐1,๐ห๐2
โโ๐ท ๐, as๐ โโ
โ.
Asymptotic Consistency of the Utilities for Preference-Based RL
Proposition 4 can also be extended to the preference-based RL setting; in fact, several of the previous lemmas will be re-used in this analysis. The main complication is that policies are now selected rather than actions, and each policy corresponds to a distribution over possible trajectories.
The next result enables the proof to leverage convergence in distribution of the dy- namics samples, ๐ห๐1, ๐ห๐2
โโ๐ท ๐ (as guaranteed by Proposition 3), in characterizing the impact of sampled policies upon convergence of the reward model.
Lemma 12. Let๐ :R๐
2๐ดรR๐ ๐ด โโRbe a function of transition dynamics ๐ โR๐
2๐ด
and reward vector ๐ โ R๐ ๐ด, ๐(๐,๐), where ๐ is continuous in ๐ and uniformly- continuous in ๐. Assume that Proposition 3 holds: ๐ห๐1, ๐ห๐2
โโ๐ท ๐. Then, for any ๐ฟ, ๐ > 0, there exists๐0such that for๐ > ๐0, |๐(๐,๐) โ ๐(๐ห๐ ๐,๐) | < ๐ for anyunit vector ๐and ๐ โ {1,2}with probability at least1โ๐ฟ.
Proof. Without loss of generality, the result is shown for๐ =1 (the steps are identical for ๐ = 1 and ๐ = 2). Applying Proposition 3, ๐ห๐1
โโ๐ท ๐. By continuity of ๐, one can apply Fact 8 from Appendix B.1 to obtain that ๐(๐ห๐1,๐) โโ๐ท ๐(๐,๐) for any ๐. Further applying Fact 9 from Appendix B.1, ๐(๐ห๐1,๐) โโ๐ ๐(๐,๐)for any๐. By definition of convergence in probability, given๐ฟ, there exists๐๐ such that for๐ โฅ ๐๐:
|๐(๐ห๐1,๐) โ ๐(๐,๐) | < 1
3๐with probability at least 1โ๐ฟ. (B.24) To obtain a high-probability bound that applies over all unit vectors ๐, one can use compactness of the set of unit vectors. Any infinite cover of a compact set has a finite subcover; in particular, for any๐ฟ0> 0, the set of unit vectors inR๐ has a finite cover of the form {B (๐1, ๐ฟ0), . . . ,B (๐๐พ, ๐ฟ0)}, where {๐1, . . . ,๐๐พ} are unit vectors, andB (๐, ๐ฟ0) :={๐0โR๐| ||๐0โ ๐||2 < ๐ฟ0} is the๐-dimensional sphere of radius ๐ฟ0 centered at ๐. Thus, there exists a finite set of unit vectorsU = {๐1, . . . ,๐๐พ} such that for any unit vector ๐0, ||๐๐โ ๐0||2 < ๐ฟ0for some๐ โ {1, . . . , ๐พ}. Because ๐ is uniformly-continuous in ๐, for any transition dynamics ๐, there exists๐ฟ๐ > 0 such that for any two unit vectors ๐,๐0such that||๐โ๐0||2< ๐ฟ๐:
|๐(๐,๐) โ ๐(๐,๐0) | <
1
3๐ . (B.25)
Without loss of generality, for each ๐, define๐ฟ๐ :=sup๐ฅ such that||๐โ ๐0||2 < ๐ฅ implies |๐(๐,๐) โ ๐(๐,๐0) | โค 1
6๐. Then, because ๐ is continuous in ๐, ๐ฟ๐ is also continuous in ๐. Because the set of all possible transition probability vectors ๐ is compact, and a continuous function over a compact set achieves its minimum value, there exists๐ฟmin > 0 such that๐ฟ๐ โฅ ๐ฟmin > 0 over all ๐. LetUbe defined such that ๐ฟ0โค ๐ฟmin; then, for any unit vector ๐0, there exists๐ โ Usuch that||๐โ๐0||2 < ๐ฟmin, and thus Eq. (B.25) holds for any ๐.
By Eq. (B.24), for each ๐๐ โ U, there exists๐๐
๐ such that for๐ โฅ ๐๐
๐: ||๐(๐ห๐1,๐) โ ๐(๐,๐) ||2< 1
3๐with probability at least 1โ๐ฟ. BecauseUis a finite set, there exists ๐0> max{๐๐
1, . . . , ๐๐
๐พ}such that for ๐ โ Uand๐ > ๐0:
|๐(๐ห๐1,๐) โ ๐(๐,๐) | <
1
3๐for each๐ โ Uwith probability at least 1โ๐ฟ. (B.26)
Therefore, for any unit vector๐0, there exists๐ โ Usuch that||๐โ๐0||2< ๐ฟ0โค ๐ฟmin, and with probability at least 1โ๐ฟfor๐ > ๐0:
|๐(๐,๐0) โ ๐(๐ห๐1,๐0) | = |๐(๐,๐0) โ ๐(๐,๐) + ๐(๐,๐) โ ๐(๐ห๐1,๐) + ๐(๐ห๐1,๐) โ ๐(๐ห๐1,๐0) |
(๐)
โค |๐(๐,๐0) โ ๐(๐,๐) | + |๐(๐,๐) โ ๐(๐ห๐1,๐) | + |๐(๐ห๐1,๐) โ ๐(๐ห๐1,๐0) |
(๐)
โค 1 3๐+ 1
3๐+ 1 3๐=๐,
where (a) holds due to the triangle inequality, and (b) holds via Eq.s (B.25) and (B.26), where the proof showed that there exists๐ฟminsuch that 0 < ๐ฟmin โค ๐ฟ๐for all possible transition dynamics parameters ๐.
Applying Lemma 12 to the two functions ๐(๐,๐, ๐๐ฃ๐(๐,๐)) = max๐0๐(๐,๐, ๐0) and๐(๐,๐, ๐), for any fixed policy๐, yields the following result.
Lemma 13. For any๐, ๐ฟ >0, any policy๐, and any unit reward vector๐, both of the following hold with probability at least1โ๐ฟfor sufficiently-large๐ and ๐ โ {1,2}:
|๐(๐,๐, ๐) โ๐(๐ห๐ ๐,๐, ๐) | < ๐
|๐(๐,๐, ๐๐ฃ๐(๐,๐)) โ๐(๐ห๐ ๐,๐, ๐๐ฃ๐(๐ห๐ ๐,๐)) | < ๐ .
Proof. Both statements follow by applying Lemma 12. First, consider the function ๐1(๐,๐) :=๐(๐,๐, ๐) for a fixed policy๐. The value function๐(๐,๐, ๐) is contin- uous in both ๐ and ๐, and furthermore, it is linear in ๐. Using these observations, the proof demonstrates that๐(๐,๐, ๐)is uniformly-continuous in๐. Indeed, for any linear function of the form ๐(๐) = ๐๐๐, for any ๐0 > 0, for ๐ฟ0 := ||๐||๐0 , and for any ๐1,๐2such that ||๐1โ๐2|| < ๐ฟ0:
|๐(๐1) โ๐(๐2) | =|๐๐(๐1โ๐2) | โค ||๐||2||๐1โ๐2||2 < ||๐||2๐ฟ0=๐0.
Thus, ๐1 satisfies the conditions of Lemma 12 for any fixed ๐, so that for ๐ > ๐๐,
|๐(๐,๐, ๐) โ๐(๐ห๐ ๐,๐, ๐) | < ๐ with probability at least 1 โ๐ฟ. Because there are finitely-many deterministic policies๐, one can set๐ > max๐๐๐, so that the statement holds jointly over all๐.
Next, let ๐2(๐,๐) =max๐๐(๐,๐, ๐)=๐(๐,๐, ๐๐ฃ๐(๐,๐)). A maximum over finitely- many continuous functions is continuous, and a maximum over finitely-many uniformly- continuous functions is uniformly-continuous. Therefore, ๐2 also satisfies the con- ditions of Lemma 12.