Chapter VI: Conclusions and Future Directions
B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the
B.1 Facts about Convergence in Distribution
Before proceeding with the asymptotic consistency proofs, two facts about conver- gence in distribution are reviewed; these will be applied later.
Recall that for a random variable๐and a sequence of random variables(๐๐),๐ โN, ๐๐
โโ๐ท ๐ denotes that๐๐ converges to ๐ in distribution, while ๐๐
โโ๐ ๐ denotes that๐๐converges to ๐ in probability.
Fact 8(Billingsley, 1968). For random variables๐,๐๐,โR๐, where๐โN, and any continuous function๐:R๐ โโ R, if๐๐
โโ๐ท ๐, then๐(๐๐)โโ๐ท ๐(๐).
Fact 9 (Billingsley, 1968). For random variables ๐๐ โ R๐, ๐ โ N, and constant vector ๐ โ R๐, ๐๐
โโ๐ท ๐ is equivalent to ๐๐
โโ๐ ๐. Convergence in probability means that for any๐ >0, ๐(||๐๐โ๐||2 โฅ ๐) โโ0as๐โโ โ.
B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-
dynamics parameters, respectively (hiding the dependency on the DPS episode๐1 or ๐2 for the latter three quantities); thus, [๐(๐)]๐ denotes the true probability of transitioning from state-action pair ห๐ ๐ to the ๐th state, and analogously for the ๐th elements of ๐ห(๐), ๐ห(๐), and ๐ห0(๐). Then, from the Dirichlet model,
[๐ห(๐)]๐ =
๐๐ ๐ +๐ผ๐ ๐ ,0 ๐๐+ร๐
๐=1๐ผ๐ ๐,0 ,
where the prior for ๐(๐)isร๐ 1 ๐=1๐ผ๐ ๐,0
[๐ผ๐1,0, . . . , ๐ผ๐ ๐,0]๐ for user-defined hyperparam- eters๐ผ๐ ๐ ,0> 0. Meanwhile, the maximum likelihood is given by[๐ห0(๐)]๐ = max(๐๐๐ ๐
๐,1)
(this is equivalent to[๐ห(๐)]๐, except with the prior parameters set to zero). Consider the sampled dynamics at state-action pair ห๐ ๐. For any๐ >0,
๐ ||๐ห(๐)โ ๐(๐)||1 โฅ ๐
= ๐ ||๐ห(๐)โ ๐ห(๐) + ๐ห(๐) โ ๐ห0(๐)+ ๐ห0(๐) โ ๐(๐)||1โฅ ๐
(๐)
โค ๐ ||๐ห(๐) โ ๐ห(๐)||1+ ||๐ห(๐) โ ๐ห0(๐)||1+ ||๐ห0(๐)โ ๐(๐)||1 โฅ ๐
โค ๐
||๐ห(๐) โ ๐ห(๐)||1 โฅ ๐ 3
ร ||๐ห(๐)โ ๐ห0(๐)||1 โฅ ๐ 3
ร ||๐ห0(๐) โ ๐(๐)||1โฅ ๐ 3
(๐)
โค ๐
||๐ห(๐) โ ๐ห(๐)||1 โฅ ๐ 3
+๐
||๐ห(๐) โ ๐ห0(๐)||1โฅ ๐ 3
+๐
||๐ห0(๐) โ ๐(๐)||1 โฅ ๐ 3
, (B.2) where (a) holds due to the triangle inequality and (b) follows from the union bound.
This proof will upper-bound each term in Eq. (B.2) in terms of๐๐ and show that it decays as๐๐ โโ โ, that is, as ห๐ ๐ is visited infinitely often. For the first term, this bound is achieved via Chebyshevโs inequality:
๐
||๐ห(๐) โ ๐ห(๐)||1 โฅ ๐ 3
โค ๐
๐
ร
๐=1
n
[๐ห(๐)]๐ โ [๐ห(๐)]๐
โฅ ๐
3๐ o
!
(๐)
โค
๐
ร
๐=1
๐
[๐ห(๐)]๐ โ [๐ห(๐)]๐
โฅ ๐
3๐ (๐)
โค
๐
ร
๐=1
9๐2 ๐2
Varh
[๐ห(๐)]๐
i ,
where (a) follows from the union bound and (b) is an application of Chebyshevโs inequality. For a Dirichlet variable๐ with parameters(๐ผ1, . . . , ๐ผ๐),๐ผ๐ >0 for each ๐, the variance of the๐thcomponent๐๐ is given by:
Var[๐๐] = ๐ผห๐(1โ๐ผห๐) 1+ร๐
๐=1๐ผ๐
โค 1
2โ 1
1+ร๐
๐=1๐ผ๐ ,
where ห๐ผ๐ := ร๐๐ผ๐ ๐=1๐ผ๐
. In the DPS algorithm, ๐ห(๐) is drawn from a Dirichlet distribu-
tion with parameters (๐ผ๐1, . . . , ๐ผ๐ ๐) =(๐ผ๐1,0+๐๐1, . . . , ๐ผ๐ ๐,0+๐๐ ๐), so that, Varh
[๐ห(๐)]๐
i
โค 1
2 โ 1
1+ร๐
๐=1๐ผ๐ ๐
= 1
2 โ 1
1+ร๐
๐=1(๐ผ๐ ๐,0+๐๐ ๐)
โค 1
2 โ 1
1+ร๐ ๐=1๐๐ ๐
= 1
2(1+๐๐). Thus,
๐ ||๐ห(๐)โ ๐ห(๐)||1 โฅ ๐ 3
!
โค
๐
ร
๐=1
9๐2 ๐2
1
2(1+๐๐) = 9๐3 2๐2(1+๐๐). Considering the second term in Eq. (B.2),
๐ ||๐ห(๐) โ ๐ห0(๐)||1 โฅ ๐ 3
!
โค ๐
๐
ร
๐=1
n
[๐ห(๐)โ ๐ห0(๐)]๐
โฅ ๐
3๐ o
!
(๐)
โค
๐
ร
๐=1
๐
[๐ห(๐)]๐ โ [๐ห0(๐)]๐
โฅ ๐
3๐ (๐)
โค
๐
ร
๐=1
๐
๐ผ๐ ๐ ,0+ร๐
๐=1๐ผ๐ ๐,0 ๐๐ +ร๐
๐=1๐ผ๐ ๐,0
โฅ ๐ 3๐
! ,
where (a) holds via the union bound and (b) follows for๐๐ โฅ1 because when๐๐ โฅ 1:
[๐ห(๐)]๐โ [๐ห0(๐)]๐
=
๐๐ ๐ +๐ผ๐ ๐ ,0 ๐๐+ร๐
๐=1๐ผ๐ ๐,0
โ ๐๐ ๐ ๐๐
=
๐ผ๐ ๐ ,0 ๐๐+ร๐
๐=1๐ผ๐ ๐,0
โ
๐๐ ๐ร๐
๐=1๐ผ๐ ๐,0 ๐๐(๐๐+ร๐
๐=1๐ผ๐ ๐,0)
โค
๐ผ๐ ๐ ,0 ๐๐ +ร๐
๐=1๐ผ๐ ๐,0 +
๐๐ ๐ ๐๐
ร๐
๐=1๐ผ๐ ๐,0 ๐๐ +ร๐
๐=1๐ผ๐ ๐,0
โค
๐ผ๐ ๐ ,0+ร๐
๐=1๐ผ๐ ๐,0 ๐๐+ร๐
๐=1๐ผ๐ ๐,0 .
For the third term in Eq. (B.2), one can apply the following concentration inequality for Dirichlet variables (see Appendix C.1 in Jaksch, Ortner, and Auer, 2010):
๐(||๐ห0(๐)โ ๐(๐)||1 โฅ ๐) โค (2๐โ2)exp โ๐๐๐2 2
! . Therefore:
๐
||๐ห0(๐) โ ๐(๐)||1โฅ ๐ 3
โค (2๐โ2)exp โ๐๐๐2 18
! . Thus, to upper-bound the right-hand side of Eq. (B.2), for any๐ > 0:
๐ ||๐ห(๐)โ๐(๐)||1 โฅ ๐
โค 9๐3 2๐2(๐๐ +1)+
๐
ร
๐=1
๐
๐ผ๐ ๐ ,0+ร๐
๐=1๐ผ๐ ๐,0 ๐๐ +ร๐
๐=1๐ผ๐ ๐,0
โฅ ๐ 3๐
!
+(2๐โ2)exp โ๐๐๐2 18
! .
On the right hand side, the first and third terms clearly decay as ๐๐ โโ โ. The middle term is identically zero for๐๐ large enough, since the๐ผ๐ ๐ ,0values are user- defined constants. Given this inequality, it is clear that for any๐ > 0, as๐๐ โโ โ,
๐ ||๐ห(๐) โ ๐(๐)||1 โฅ ๐
โโ 0. If every state-action pair is visited infinitely often, then ๐๐ โโ โ for each ๐, and therefore, ๐ห(๐) converges in probability to ๐(๐): ๐ห(๐) โโ๐ ๐(๐). Convergence in probability implies convergence in distribution, the desired result.
To continue proving that DPSโs model of the transition dynamics converges, this analysis uses that the magnitude of the utility estimator||๐ห๐||2, the mean of the utility posterior sampling distribution, is uniformly upper-bounded; in other words, there exists๐ < โsuch that ||๐ห๐||2 โค ๐.
Lemma 4. When preferences are given by a linear or logistic link function, across all ๐ โฅ 1, there exists some ๐ < โ such that estimated reward at DPS trial ๐ is bounded by๐:||๐ห๐||2 โค ๐.
Proof. Firstly, if the link function is logistic, the desired result holds automatically by the definition of๐ห๐given in Eq. (4.10): the quantity is projected onto the compact setฮ โ R๐ of all possible values of๐, and a compact set onR๐must be bounded.
Secondly, the result is proven in the case of a linear link function. In this case, recall that the MAP reward estimate๐ห๐is the solution to a ridge regression problem:
๐ห๐=arg inf๐ (๐โ1
ร
๐=1
(๐๐๐ ๐โ๐ฆ๐)2+๐||๐||22 )
=arg inf๐ (๐โ1
ร
๐=1
(๐๐๐ ๐โ๐ฆ๐)2+ 1
๐โ1๐||๐||22 )
. (B.3) The desired result is proven by contradiction. Assuming that there exists no upper bound๐, the proof will identify a subsequence(๐ห๐๐)of MAP estimates whose lengths increase unboundedly, but whose directions converge. Then, it will show that such vectors fail to minimize the objective in Eq. (B.3), achieving a contradiction.
Firstly, the vectors ๐๐ = ๐๐2 โ ๐๐1 have bounded magnitude: in the bandit case, ๐๐1,๐๐2โ A, and the action spaceAis compact, while in the RL setting,||๐๐ ๐||1= โ for ๐ โ {1,2}. The binary labels๐ฆ๐are also bounded, as they take values in
โ1
2,1
2 . Note that for ๐ = 0, (๐๐๐ ๐โ ๐ฆ๐)2+ ๐โ11 ๐||๐||22 = 14. The desired statement is proven by contradiction: assume that there is no ๐ < โ such that ||๐ห๐||2 โค ๐ for all ๐. Then, the sequence ๐ห1,๐ห2, . . . must have a subsequence indexed by (๐๐) such that lim๐โโโ||๐ห๐๐||2 = โ. Consider the sequence of unit vectors ||๐ห๐ห๐๐
๐๐||2. This sequence lies within the compact set of unit vectors in R๐, so it must have a convergent subsequence; we index this subsequence of the sequence (๐๐) by (๐๐
๐). Then, the
sequence(๐ห๐๐) is such that lim๐โโโ||๐ห๐๐||2 =โand lim๐โโโ
๐ห๐ ๐
||๐ห๐
๐||2 = ๐ห๐ข๐๐๐ก, where ๐ห๐ข๐๐๐ก โR๐ is a fixed unit vector.
For any ๐๐ such that |๐๐๐๐ห๐ข๐๐๐ก| โ 0, lim๐๐
๐โโโ(๐๐๐ ๐ห๐๐ ๐
โ ๐ฆ๐)2 = โ, and thus, the corresponding terms in Eq. (B.3) approach infinity. However, a lower value of the optimization objective in Eq. (B.3) can be realized by replacing ๐ห๐๐
๐ with the assignment ๐ = 0. Meanwhile, for any ๐๐ such that |๐๐๐ ๐ห| = 0, replacing ๐ห๐๐
๐ with
๐ = 0 would also decrease the value of the optimization objective in Eq. (B.3).
Therefore, for large ๐, ๐ =0results in a smaller objective function value than๐ห๐๐ ๐. This is a contradiction, proving that the elements of the sequence ๐ห๐๐
๐ cannot have arbitrarily large magnitudes. Thus, the elements of the original sequence ๐ห๐ also cannot become arbitrarily large, and||๐ห๐|| โค ๐ for some๐ < โ.
The next intermediate result relates the matrix ห๐๐, defined in Eq. (B.1), and the matrix ๐๐ =๐ ๐ผ+ร๐โ1
๐=1 ๐๐๐๐๐.
Lemma 5. On iteration๐of DPS, the posterior covariance matrix for the rewards is ฮฃ(๐) = ๐ฝ๐(๐ฟ)2๐ห๐; if the link function๐is linear, then๐ห๐ =๐๐, while if๐is logistic, then ๐ห๐ = ๐ ๐ผ + ร๐โ1
๐=1 ๐ห(2๐ฆ๐๐๐๐๐ห๐)๐๐๐๐๐. In both cases, there exist two constants ๐min, ๐maxsuch that0 < ๐min โค ๐max< โand๐min๐๐ ๐ห๐ ๐max๐๐.
Proof. Firstly, if๐ is linear, then ห๐๐ = ๐๐, so the desired result clearly holds with ๐min =๐max =1.
If๐is logistic, the desired statement is equivalent to:
๐min ๐ ๐ผ+
๐โ1
ร
๐=1
๐๐๐๐๐
!
๐ ๐ผ+
๐โ1
ร
๐=1
ห
๐(2๐ฆ๐๐๐๐ ๐ห๐)๐๐๐๐๐ ๐max ๐ ๐ผ+
๐โ1
ร
๐=1
๐๐๐๐๐
! .
By definition of ห๐, ห๐(๐ฅ) โ (0,โ) for all ๐ฅ โ R. Moreover, the domain of ห๐ has bounded magnitude, since all possible inputs to ห๐are of the form 2๐ฆ๐๐๐๐ ๐ห๐, in which
|๐ฆ๐| = 12, ๐๐ belongs to a compact set, and ||๐ห๐|| โค ๐ by Lemma 4. Therefore, all possible inputs to ห๐belong to a compact set. A continuous function over a compact set always attains its maximum and minimum values; therefore, there exist values
ห
๐min,๐หmax such that 0 < ๐หmin โค ๐ห(๐ฅ) โค ๐หmax < โfor all possible inputs๐ฅ to ห๐(๐ฅ).
Therefore, ๐ ๐ผ+
๐โ1
ร
๐=1
ห
๐(2๐ฆ๐๐๐๐ ๐ห๐)๐๐๐๐๐ ๐ ๐ผ+
๐โ1
ร
๐=1
ห
๐min๐๐๐๐๐ min{๐หmin,1}
"
๐ ๐ผ+
๐โ1
ร
๐=1
๐๐๐๐๐
# , and
๐ ๐ผ+
๐โ1
ร
๐=1
ห
๐(2๐ฆ๐๐๐๐ ๐ห๐)๐๐๐๐๐ ๐ ๐ผ+
๐โ1
ร
๐=1
ห
๐max๐๐๐๐๐ max{๐หmax,1}
"
๐ ๐ผ+
๐โ1
ร
๐=1
๐๐๐๐๐
# ,
which proves the desired result for๐min =min{๐หmin,1}and๐max =max{๐หmax,1}.
To finish proving convergence of the transition dynamics Bayesian model, Lemma 6 demonstrates that every state-action pair is visited infinitely often.
Lemma 6. Under DPS with preference-based RL, assume that the dynamics are modeled via a Dirichlet model and that the utilities are modeled either via the linear or logistic link functions, with posterior sampling distributions given in Eq.(4.13).
Then, every state-action pair is visited infinitely often.
This consistency result also holds when removing the๐ฝ๐(๐ฟ)and๐ฝ0
๐(๐ฟ)factors from the distributions in Eq.(4.13).
Proof. The proof proceeds by assuming that there exists a state-action pair that is visited only finitely-many times. This assumption will lead to a contradiction1: once this state-action pair is no longer visited, the posterior sampling distribution for the utilities ๐ is no longer updated with respect to it. Then, DPS is guaranteed to eventually sample a high enough reward for this state-action that the resultant policy will prioritize visiting it.
First, note that DPS is guaranteed to reach at least one state-action pair infinitely often: given the problemโs finite state and action spaces, at least one state-action pair must be visited infinitely often during DPS execution. If all state-actions arenot visited infinitely often, there must exist a state-action pair(๐ , ๐)such that๐ is visited infinitely often, while (๐ , ๐) is not. Otherwise, if all actions are selected infinitely often in all infinitely-visited states, the finitely-visited states are unreachable (in which case these states are irrelevant to the learning process and regret minimization,
1Note that in finite-horizon MDPs, the concept of visiting a state finitely-many times is not the same as that of a transient state in an infinite Markov chain, because: 1) due to a finite horizon, the state is resampled from the initial state distribution ๐0(๐ ) everyโ timesteps, and 2) the policyโ
which determines which state-action pairs can be reached in an episodeโis also resampled everyโ timesteps.
and can be ignored). Without loss of generality, this state-action pair(๐ , ๐)is labeled as ห๐ 1. To reach a contradiction, it suffices to show that ห๐ 1is visited infinitely often.
Let ๐1 be the utility vector with a reward of 1 in state-action pair ห๐ 1 and rewards of zero elsewhere. From Definition 6, ๐๐ฃ๐(๐ห,๐1) is the policy that maximizes the expected number of visits to ห๐ 1under dynamics ๐หand utility vector๐1:
๐๐ฃ๐(๐ห,๐1) =argmax๐๐(๐ห,๐1, ๐),
where๐(๐ห,๐1, ๐) is the expected total reward of a length-โtrajectory under ๐ห,๐1, and๐, or equivalently (by definition of ๐1), the expected number of visits to state- action ห๐ 1.
Next, it will be shown that there exists a ๐ > 0 such that ๐(๐ = ๐๐ฃ๐(๐ห,๐1)) > ๐ for all possible values of ๐. That is, for any sampled parametersห ๐, the probabilityห of selecting policy๐๐ฃ๐(๐ห,๐1) is uniformly lower-bounded, implying that DPS must eventually select๐๐ฃ๐(๐ห,๐1).
Let ห๐๐ be the sampled utility (also referred to as reward) associated with state- action pair ห๐ ๐ in a particular DPS episode, for each state-action ๐ โ {1, . . . , ๐}, with ๐ = ๐ ๐ด. The proof will show that conditioned on ๐, there existsห ๐ฃ > 0 such that if ห๐1 exceeds max{๐ฃ๐ห2, ๐ฃ๐ห3, . . . , ๐ฃ๐ห๐}, then value iteration returns the policy ๐๐ฃ๐(๐ห,๐1), which is the policy maximizing the expected amount of time spent in
ห
๐ 1. This can be seen by setting ๐ฃ := ๐โ
1, where โis the time horizon and ๐1 is the expected number of visits to ห๐ 1under๐๐ฃ๐(๐ห,๐1). Under this definition of๐ฃ, the event {๐ห1 โฅmax{๐ฃ๐ห2, ๐ฃ๐ห3, . . . , ๐ฃ๐ห๐}}is equivalent to{๐ห1๐1 โฅ โmax{๐ห2,๐ห3, . . . ,๐ห๐}}; the latter inequality implies that given ๐หand๐, the expected reward accumulated solelyห in state-action ห๐ 1exceeds the reward gained by repeatedly (during all โtime-steps) visiting the state-action pair in the set {๐ ห2, . . . ,๐ ห๐} having the highest sampled reward. Clearly, in this situation, value iteration results in the policy๐๐ฃ๐(๐ห,๐1).
Next, it is shown that๐ฃ = โ
๐1 is continuous in the sampled dynamics ๐ห by showing that๐1is continuous in ๐. Recall thatห ๐1is defined as expected number of visits to ห๐ 1 under๐๐ฃ๐(๐ห,๐1). This is equivalent to the expected reward for following ๐๐ฃ๐(๐ห,๐1) under dynamics ๐หand rewards ๐1:
๐1 =๐(๐ห,๐1, ๐๐ฃ๐(๐ห,๐1)) =max
๐
๐(๐ห,๐1, ๐). (B.4) The value of any policy ๐ is continuous in the transition dynamics parameters, so ๐(๐ห,๐1, ๐)is continuous in ๐. The maximum in Eq. (B.4) is taken over the finite setห
of deterministic policies; because a maximum over a finite number of continuous functions is also continuous, ๐1is continuous in ๐.ห
Next, recall that a continuous function on a compact set achieves its maximum and minimum values on that set. The set of all possible dynamics parameters ๐ห is such that for each state-action pair ๐,ร๐
๐=1๐๐ ๐ =1 and๐๐ ๐ โฅ 0โ๐; the set of all possible vectors ๐ห is clearly closed and bounded, and hence compact. Therefore,๐ฃ achieves its maximum and minimum values on this set, and for any ๐,ห ๐ฃ โ [๐ฃmin, ๐ฃmax], where ๐ฃmin >0 (๐ฃis nonnegative by definition, and๐ฃ =0 is impossible, as it would imply that ห๐ 1is unreachable).
Then, ๐(๐ = ๐๐ฃ๐(๐ห,๐1)) can be expressed in terms of ๐ฃ and the parameters of the reward posterior. Firstly,
๐(๐ =๐๐ฃ๐(๐ห,๐1)) โฅ ๐(๐ห1 >max{๐ฃ๐ห2, ๐ฃ๐ห3, . . . , ๐ฃ๐ห๐}) โฅ
๐
ร
๐=2
๐(๐ห1> ๐ฃ๐ห๐).
In the ๐๐ก โ DPS iteration, the sampled rewards are drawn from a jointly Gaussian posterior: ๐ห โผ N (๐(๐),ฮฃ(๐)) for some ๐(๐) and ฮฃ(๐), where [๐(๐)]๐ = ๐(๐)
๐ and
[ฮฃ(๐)]๐ ๐ = ฮฃ(๐)
๐ ๐ . Then,(๐ห1โ๐ฃ๐ห๐) โผ N (๐(
๐) 1 โ๐ฃ ๐(
๐)
๐ , ฮฃ11(๐)+๐ฃ2ฮฃ(๐ ๐๐)โ2๐ฃฮฃ(1๐)
๐ ), so that:
๐(๐๐1 =๐๐ฃ๐(๐ห,๐1)) โฅ
๐
ร
๐=2
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
1โฮฆยฉ
ยญ
ยญ
ยซ
โ๐(๐)
1 +๐ฃ ๐(๐)
๐
q ฮฃ(๐)
11 +๐ฃ2ฮฃ(๐)๐ ๐ โ2๐ฃฮฃ(๐)
1๐
ยช
ยฎ
ยฎ
ยฌ
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
=
๐
ร
๐=2
ฮฆยฉ
ยญ
ยญ
ยซ
๐(
๐) 1 โ๐ฃ ๐(
๐) ๐
q ฮฃ(๐)
11 +๐ฃ2ฮฃ(๐)
๐ ๐ โ2๐ฃฮฃ(๐)
1๐
ยช
ยฎ
ยฎ
ยฌ
, (B.5)
where ฮฆis the standard Gaussian cumulative distribution function. For the right- hand expression in Eq. (B.5) to have a lower bound greater than zero, the argument of ฮฆ(ยท)must be lower-bounded. It suffices to upper-bound the numeratorโs magnitude and to lower-bound the denominator above zero for each product factor ๐ and over all iterations๐.
The numerator can be upper-bounded using Lemma 4. Since ๐(๐) equals ๐ห๐ at iteration ๐, ||๐(๐)||2 โค ๐; therefore, |๐(๐)
1 |,|๐(๐)
๐ | โค ๐. Because 0 < ๐ฃ โค ๐ฃmax,
|๐1โ๐ฃ ๐๐| โค |๐(
๐)
1 | +๐ฃ|๐(
๐)
๐ | โค (1+๐ฃmax)๐.
To lower-bound the denominator, first note that it is equal toq
๐๐๐ฮฃ(๐)๐๐, in which ๐๐ โR๐is defined as a vector with 1 in the first position,โ๐ฃin the ๐thposition for
some ๐ โ {2, . . . , ๐}, and zero elsewhere:
๐๐ :=[1,0, . . . ,0,โ๐ฃ ,0, . . . ,0]๐. (B.6) Equivalently, it must be shown that ๐๐๐ฮฃ(๐)๐๐ is lower-bounded above zero. By Lemma 5, it holds thatฮฃ(๐) ๐ฝ๐(๐ฟ)2
๐max
๐โ1
๐ , implying that๐๐๐ฮฃ(๐)๐๐ โฅ ๐ฝ๐(๐ฟ)2
๐max ๐๐๐๐โ1
๐ ๐๐. Because๐maxis a constant and ๐ฝ๐(๐ฟ), defined in Eq. (4.4), is non-decreasing in๐, it suffices to prove that ๐๐๐๐โ1
๐ ๐๐ is lower-bounded above zero. (Thus, the result holds regardless of the presence of ๐ฝ๐(๐ฟ)in the utility sampling distribution.) Recall from Definition 7 that the eigenvectors of๐โ1
๐ are๐(1๐), . . . ,๐(๐)
๐ , with corre- sponding eigenvalues
๐(๐)
1
โ1
, . . . ,
๐(๐)
๐
โ1
. The vector๐๐can be written in terms of the orthonormal basis formed by the eigenvectors{๐(๐)
๐ }: ๐๐ =
๐
ร
๐=1
๐ผ(
๐) ๐ ๐(๐)
๐
, (B.7)
for some coefficients ๐ผ(๐)
๐ โ R. Using Eq. (B.7), the quantity to be lower-bounded can now be written as:
๐๐๐๐โ1
๐ ๐๐ =
๐
ร
๐=1
๐ผ(
๐) ๐ ๐(๐)๐
๐
! ๐ ร
๐=1
1 ๐(
๐) ๐
๐(๐)
๐ ๐(๐)๐
๐
! ๐ ร
๐=1
๐ผ(
๐) ๐ ๐(๐๐)
!
(๐)=
๐
ร
๐=1
๐ผ(๐)
๐
2 1 ๐(๐)
๐ (๐)
โฅ
๐ผ(๐)
๐0
2 1 ๐(๐)
๐0
, (B.8)
where equality (a) follows by orthonormality of the eigenvector basis, and (b) holds for any ๐0 โ {1, . . . , ๐} due to positivity of the eigenvalues (๐๐)โ1. Therefore, to show that the denominator is bounded away from zero, it suffices to show that for every๐, there exists some๐0such that
๐ผ(๐)
๐0
2 ๐(๐)
๐0
โ1
is bounded away from zero.
To prove the previous statement, note that by definition of ๐๐, the eigenvalues (๐(๐)
๐ )โ1are non-increasing in๐. Below, the proof will show that for any eigenvalue (๐(๐)
๐ )โ1 such that lim๐โโโ(๐(๐)
๐ )โ1 = 0, the first element of its corresponding eigenvector, h
๐(๐)๐ i
1, also converges to zero. Since the first element of๐๐ equals 1, Eq. (B.6) implies that there must exist some๐0such that
h ๐(๐)
๐0
i
1
6
โโ0 and๐ผ(
๐) ๐0 is bounded away from 0. If these implications did not hold, then๐๐ would not have a value of 1 in its first element, contradicting its definition. These observations imply that for every ๐, there must be some ๐0 such that as ๐ โโ โ, (๐(๐)
๐0 )โ1 โโ6 0 and ๐ผ(
๐)
๐0 is bounded away from zero.
Let๐๐denote the observation matrix after๐โ1 observations:๐๐ := h
๐1 . . . ๐๐โ1
i๐ . Then, ๐โ1
๐ = (๐๐
๐๐๐+๐ ๐ผ)โ1. The matrices ๐โ1
๐ and ๐๐
๐๐๐ have the same eigen- vectors. Meanwhile, for each eigenvalue (๐(
๐)
๐ )โ1of ๐โ1
๐ , ๐๐
๐๐๐ has an eigenvalue ๐(๐)
๐ :=๐(๐)
๐ โ๐โฅ 0 corresponding to the same eigenvector. We aim to characterize the eigenvectors of๐โ1
๐ whose eigenvalues approach zero. Since these eigenvectors are identical to those of๐๐
๐๐๐whose eigenvalues approach infinity, the latter can be considered instead.
Without loss of generality, assume that all finitely-visited state-action pairs (in- cluding ห๐ 1) occur in the first ๐ < ๐โ1 iterations, and index these finitely-visited state-action pairs from 1 to ๐ โฅ 1, so that the finitely-visited state-actions are:
{๐ ห1,๐ ห2,ยท ยท ยท ,๐ ห๐}. Let๐1:๐ โR๐ร๐ denote the matrix containing the first๐ rows of ๐๐, while๐๐+1:๐โR๐โ๐ร๐ denotes the remaining rows of๐๐. With this notation,
๐๐
๐๐๐=
๐โ1
ร
๐=1
๐๐๐๐๐ = ๐๐
1:๐๐1:๐ +๐๐
๐+1:๐๐๐+1:๐.
Because the first ๐ state-action pairs, {๐ ห1,๐ ห2,ยท ยท ยท ,๐ ห๐}, are unvisited after iteration ๐, the first๐ elements of๐๐are zero for all๐ > ๐. Therefore, ๐๐
๐+1:๐๐๐+1:๐can be written in the following block matrix form:
๐๐
๐+1:๐๐๐+1:๐=
"
๐๐ร๐ ๐๐ร(๐โ๐)
๐(๐โ๐)ร๐ ๐ด๐
# ,
where ๐๐ร๐ denotes the all-zero matrix with dimensions ๐ ร ๐. The matrix ๐ด๐ includes elements that are unbounded as ๐ โโ โ. In particular, the diagonal elements of ๐ด๐ approach infinity as๐ โโ โ. The matrix ๐๐
๐๐๐ can be written in the following block matrix form:
๐๐
๐๐๐ =๐๐
1:๐๐1:๐+ ๐๐
๐+1:๐๐๐+1:๐
=
"
[๐๐
1:๐
๐1:๐](1:๐ ,1:๐) [๐๐
1:๐
๐1:๐](1:๐ ,๐+1:๐)
[๐๐
1:๐๐1:๐](๐+1:๐ ,1:๐) [๐๐
1:๐๐1:๐](๐+1:๐ ,๐+1:๐)+ ๐ด๐
# :=
"
๐ต ๐ถ
๐ถ๐ ๐ท๐
# , where๐(๐:๐,๐:๐) denotes the submatrix of๐ obtained by extracting rows๐through ๐and columns๐through๐. Because matrices๐ตand๐ถonly depend upon๐1:๐, they are fixed as๐increases, while matrix๐ท๐contains values that grow towards infinity with increasing ๐. In particular, all elements along ๐ท๐โs diagonal are unbounded.
Intuitively, in the limit,๐ตand๐ถare close to zero compared to๐ท๐, and๐๐
๐๐๐(when normalized) increasingly resembles a matrix in which only the bottom-right block is nonzero. This intuitive notion is formalized next.
Consider an eigenpair (๐๐(๐), ๐(๐)
๐ ) of ๐๐
๐๐๐ such that lim๐โโโ๐(๐)
๐ = โ. The fol- lowing argument shows that the first element of ๐๐(๐) must approach 0. Letting ๐๐(๐) =
h
๐๐(๐)๐ ๐๐(๐)๐ i๐
, where๐๐(๐) โR๐ and๐๐(๐) โR๐โ1โ๐: (๐๐
๐๐๐)๐๐(๐) = ๐๐
๐๐๐
"
๐๐(๐) ๐๐(๐)
#
=
"
๐ต ๐ถ
๐ถ๐ ๐ท๐
# "
๐๐(๐) ๐๐(๐)
#
=
"
๐ต๐๐(๐) +๐ถ๐๐(๐) ๐ถ๐๐๐(๐) +๐ท๐๐๐(๐)
#
=๐(
๐) ๐
"
๐๐(๐) ๐๐(๐)
# .
Dividing both sides by๐(
๐) ๐ , 1
๐(๐)
๐
๐๐
๐๐๐
"
๐๐(๐) ๐๐(๐)
#
=
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
1 ๐(๐)
๐
๐ต๐๐(๐) +๐ถ๐(๐)๐
1 ๐(๐)
๐
๐ถ๐๐๐(๐) +๐ท๐๐๐(๐)
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
=
"
๐๐(๐) ๐๐(๐)
# .
In the upper matrix block: lim๐โโโ๐(๐)
๐ = โ, ๐ต and ๐ถ are fixed as ๐ increases, and๐๐(๐) and๐๐(๐) have upper-bounded elements because๐๐(๐) is a unit vector. Thus, lim๐โโโ๐๐(๐) =lim๐โโโ 1
๐(๐)
๐
๐ต๐๐(๐)+๐ถ๐๐(๐)
=0. In particular, the first element of
๐๐(๐) converges to zero, implying that the same is true of๐๐(๐).
As justified above, this result implies that for each iteration๐, there exists an index ๐0 โ {1, . . . , ๐}such that the right-hand side of Eq. (B.8) has a lower bound above zero. This completes the proof that the denominator in Eq. (B.5) does not decay to zero. As a result, there exists some ๐ >0 such that ๐(๐ =๐๐ฃ๐(๐ห,๐1)) โฅ ๐ >0.
In consequence, DPS is guaranteed to infinitely often sample pairs(๐ห, ๐)such that ๐ = ๐๐ฃ๐(๐ห,๐1). As a result, DPS infinitely often samples policies that prioritize reaching ห๐ 1 as quickly as possible. Such a policy always takes action ๐ in state ๐ . Furthermore, because ๐ is visited infinitely often, either a) ๐0(๐ ) > 0 or b) the infinitely-visited state-action pairs include a path with a nonzero probability of reaching ๐ . In case a), since the initial state distribution is fixed, the MDP will infinitely often begin in state๐ under the policy๐ =๐๐ฃ๐(๐ห,๐1), so ห๐ 1will be visited infinitely often. In case b), due to Lemma 3, the transition dynamics parameters for state-actions along the path to ๐ converge to their true values (intuitively, the algorithm knows how to reach๐ ). In episodes with the policy๐ =๐๐ฃ๐(๐ห,๐1), DPS is thus guaranteed to reach ห๐ 1infinitely often. Since DPS selects ๐๐ฃ๐(๐ห,๐1) infinitely often, it must reach ห๐ 1 infinitely often. This presents a contradiction, proving that every state-action pair must be visited infinitely often.
The direct combination of Lemmas 3 and 6 prove asymptotic consistency of the transition dynamics model:
Proposition 3.Assume that DPS is executed in the preference-based RL setting, with transition dynamics modeled via a Dirichlet model, utilities modeled via either the linear or logistic link function, and utility posterior sampling distributions given in Eq.(4.13). Then, the sampled transition dynamics ๐ห๐1, ๐ห๐2converge in distribution to the true dynamics, ๐ห๐1, ๐ห๐2
โโ๐ท ๐. This consistency result also holds when removing the๐ฝ๐(๐ฟ)factors from the distributions in Eq.(4.13).