Chapter VI: Conclusions and Future Directions
A.3 Gaussian Process Preference Model
counts (i.e., π) and the Gaussian process prior on π. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yieldsπ|πΉ βΌ N (π,Ξ£), where the expressions forπandΞ£are given by Eq.s (A.2) and (A.3) above.
By substituting π0for πΉ, the conditional posterior density of πcan be expressed in terms ofπ,π0,πΎπ, andππ, that is, in terms of observed data and the Gaussian process prior parameters.
state-action visit counts,ππ1:π’(ππ1) = ππππ1. Thus, the full likelihood expression is:
π(D |π) =
π
Γ
π=1
π(π§π), (A.6)
π§π := π¦0
π(π’(ππ2) βπ’(ππ1)) π
= π¦0
πππ(ππ2βππ1) π
= π¦0
πππππ
π . Given the preference datasetD, one can model the posterior probability ofπ:
π(π| D) β π(D |π)π(π),
where the expressions for the prior π(π) and likelihood π(D |π) are given by Eq.s (A.5) and (A.6), respectively. This posterior can be estimated by the Laplace approximation, from which samples πΛof the utilitiesπcan easily be drawn:
Λ
π βΌ N (πΛMAP, πΌΞ£MAP), where: (A.7)
πΛMAP =argminππ(π), (A.8)
Ξ£MAP =
β2ππ(π) |πΛMAP
β1
, (A.9)
andπ(π) := 12ππΞ£β1πβΓπ
π=1logπ(π§π)is the negative log posterior, neglecting con- stant terms with respect toπ; lastly,πΌ >0 is a tunable hyperparameter that influences the balance between exploration and exploitation. In order for the Laplace approxi- mation to be valid, π(π)must be convex in π: this guarantees that the optimization problem in Eq. (A.8) is convex and that the covariance matrix defined by Eq. (A.9) is positive definite, and therefore a valid Gaussian covariance matrix. Convexity of π(π)can be established by demonstrating that its Hessian matrix is positive definite.
It can be shown that for any π,β2ππ(π) = Ξ£β1+Ξ, where:
[Ξ]π π := 1 π2
π
Γ
π=1
[ππ]π[ππ]π
"
βπ00(π§π) π(π§π) +
π0(π§π) π(π§π)
2#
, (A.10)
for ππ =ππ2βππ1. Because the prior covarianceΞ£is positive definite, to show that
β2ππ(π) is positive definite, it suffices to show thatΞis positive semidefinite. From Eq. (A.10), one can see that:
Ξ = 1 π2
π
Γ
π=1
"
βπ00(π§π) π(π§π) +
π0(π§π) π(π§π)
2# πππππ.
Clearly,πππππ is positive semidefinite, and thus, the following statement is a sufficient condition for convexity ofπ(π):
βπ00(π§) π(π§) +
π0(π§)
π(π§) β₯ 0 for allπ§ βR.
In particular, this condition is satisfied for the Gaussian link function,πGaussian(Β·)= Ξ¦(Β·), where Ξ¦ is the standard Gaussian CDF, as well as for the sigmoidal link function, πlog(π₯) := π(π₯) = 1+exp(β1 π₯). In this work, the experiments utilize the sigmoidal link function.
Bayesian Logistic Regression
Notably, the Bayesian logistic regression inference model discussed in Section 4.3 is a special case of the Gaussian process preference model, in whichπ =1,π is the sigmoidal link function, and the prior covariance matrix is diagonal, i.e.Ξ£ = π πΌ; for instance, the latter condition occurs with the squared exponential kernel defined in Eq. (A.4) when its lengthscale π is set to zero. In this thesis, a number of the experiments with the Gaussian process preference model fall under the special case of Bayesian logistic regression, and therefore, this model is briefly reviewed here.
In Bayesian logistic regression, the Gaussian prior over possible reward vectors π βRπ is:π βΌ N (0, π πΌ), whereπ > 0. Setting theπth preference labelπ¦π equal to 12 ifππ2 ππ1, whileπ¦π =β1
2 ifππ1 ππ2, the logistic regression likelihood is:
π(D |π) =
π
Γ
π=1
π(π¦π|π,ππ) =
π
Γ
π=1
1
1+exp(β2π¦ππππ π) .
The experiments approximate the posterior, π(π| D) β π(D |π)π(π), as Gaussian via the Laplace approximation:
π(π| D) β N (πΛMAP, πΌΞ£MAP), where:
Λ
πMAP =argmin
π
π(π), π(π):=βlogπ(D,π) =βlogπ(π) βlogπ(D |π), (A.11) Ξ£MAP =
β2ππ(π) πΛ
β1
,
where the optimization in Eq. (A.11) is convex, and πΌ > 0 is a tunable hyperpa- rameter that influences the balance between exploration and exploitation. Note that multiplying the covariance by a well-tunedπΌis more practical than using the π½π(πΏ) parameters considered in the asymptotic consistency analysis (Section 4.4), as the latter results in overly-conservative covariance matrices in practice.
A p p e n d i x B
PROOFS OF ASYMPTOTIC CONSISTENCY FOR DUELING POSTERIOR SAMPLING
This appendix proves the asymptotic consistency results stated in Section 4.4. The details are organized into three sections, which prove:
1. In the preference-based RL setting, samples from the model posterior over transition dynamics parameters converge in distribution to the true transition probabilities.
2. In both the preference-based generalized linear bandit and RL settings, sam- ples from the utility posterior converge in distribution to the true utilities.
3. DPSβs selected policies converge in distribution to the optimal policy in the preference-based RL setting. DPSβs selected actions converge to the optimal action in the generalized linear bandit setting with a finite action spaceA.
Please refer to Section 4.2 to review relevant notation, e.g. for the posterior samples drawn in each iteration. In addition, the following notation is used for the value function and for policies given by value iteration:
Definition 5 (Value function given transition dynamics, rewards, and a policy).
Defineπ(π,π, π) as the value function over a length-βepisodeβi.e., the expected total reward in the episodeβunder transition dynamics π βRπ
2π΄
, rewardsπ βRπ π΄, and policyπ:
π(π,π, π) =Γ
π βS
π0(π )E
" β
Γ
π‘=1
π(π π‘, π(π π‘, π‘))
π 1=π , π = π,π = π
# .
Definition 6(Optimal deterministic policy given transition dynamics and rewards).
Define ππ£π(π,π) := argmaxππ(π,π, π) as the optimal deterministic policy given transition dynamics π β Rπ
2π΄
and rewards π β Rπ π΄ (breaking ties randomly if multiple deterministic policies achieve the maximum). Note that ππ£π(π,π) can be found via finite-horizon value iteration: defining ππ,π‘(π ) as in Eq. (3.6), set ππ, β+1(π ) := 0 for each π β S and use the Bellman equation to calculateππ,π‘(π )
successively forπ‘ β {β, ββ1, . . . ,1}given πandπ:
π(π , π‘) =argmaxπβA
"
π(π , π) +Γ
π 0βS
π(π π‘+1=π 0|π π‘ =π , ππ‘ =π)ππ,π‘+1(π 0)
# ,
ππ,π‘(π ) = Γ
πβA
I[π(π ,π‘)=π]
"
π(π , π) +Γ
π 0βS
π(π π‘+1=π 0|π π‘ =π , ππ‘ =π)ππ,π‘+1(π 0)
# .
As value iteration results in only deterministic policies, of which there are finitely- many (more precisely, there areπ΄π β), the maximum argumentππ£π(π,π):=argmaxπ π(π,π, π)is taken over a finite policy class.
Recall that ππ := π πΌ+Γπβ1
π=1 πππππ (see Eq. (4.2)). For the linear link function, the posterior sampling distributionβs covariance is given byΞ£(π) = π½π(πΏ)2πβ1
π . For the logistic link function, the posterior covariance is given byΞ£(π) = π½π(πΏ)2π0
π, where π0
π=π πΌ+Γπβ1
π=1 πΛ(2π¦ππππ πΛπ)πππππ, Λπ(π₯) :=
π0
log(π₯) πlog(π₯)
2
βπ
00 log(π₯)
πlog(π₯) comes from the Laplace approximation, andπlogis the sigmoid function.
As in Section 4.4, it is notationally convenient to define a matrix Λππ β RπΓπ such that:


ο£²

ο£³
Λ
ππ =ππ =π πΌ+Γπβ1
π=1 πππππ for the linear link function, and
Λ
ππ =π0
π =π πΌ+Γπβ1
π=1 πΛ(2π¦ππππ πΛπ)πππππ for the logistic link function.
(B.1)
Then, under either link function, the posterior sampling distribution has covariance Ξ£(π) = π½π(πΏ)2πΛβ1
π .
Finally, notation is defined for the eigenvectors and eigenvalues of the matrices ππ and Λππ:
Definition 7 (Eigenvalue notation). Let π(π)
π refer to the πth-largest eigenvalue of ππ, andπ(ππ) denote its corresponding eigenvector. Similarly, letπ(
π)
π refer to the πth- largest eigenvalue of πΛπ, and π(π)π denote its corresponding eigenvector. Note that πβ1
π also has eigenvectorsπ(ππ), with corresponding eigenvalues 1
π(
π) π
. Becauseππis positive definite, the eigenvectors{π(π)π }form an orthonormal basis, andπ(π)
π > 0for allπ, π. The equivalent statements also hold for πΛπ, which is also positive definite becauseπΛ(π₯) >0for all possible inputs.
B.1 Facts about Convergence in Distribution
Before proceeding with the asymptotic consistency proofs, two facts about conver- gence in distribution are reviewed; these will be applied later.
Recall that for a random variableπand a sequence of random variables(ππ),π βN, ππ
ββπ· π denotes thatππ converges to π in distribution, while ππ
ββπ π denotes thatππconverges to π in probability.
Fact 8(Billingsley, 1968). For random variablesπ,ππ,βRπ, whereπβN, and any continuous functionπ:Rπ ββ R, ifππ
ββπ· π, thenπ(ππ)ββπ· π(π).
Fact 9 (Billingsley, 1968). For random variables ππ β Rπ, π β N, and constant vector π β Rπ, ππ
ββπ· π is equivalent to ππ
ββπ π. Convergence in probability means that for anyπ >0, π(||ππβπ||2 β₯ π) ββ0asπββ β.
B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-