Chapter IV: Dueling Posterior Sampling for Preference-Based Bandits and
4.3 Posterior Modeling for Utility Inference and Credit Assignment
concepts of a policy and a trajectory both reduce to the selected action. Thirdly, each action is associated with aπ-dimensional vector, rather than with just an index in {1, . . . , π΄}, as in tabular RL. While in the RL case, the utility vector π denotes the utility of each state-action pair, in the generalized linear bandit case, it specifies a linear weight on each action space dimension. With these modifications, DPS is outlined in Algorithms 8-10.
Letππ βR(πβ1)Γπbe the observation matrix afterπβ1 preferences, in which theπth row contains observationππ =ππ2βππ1. Furthermore, defineππ βRπβ1as the vector of corresponding preference labels, with itsπth element equal toπ¦π β
β1
2,1
2 . The following two subsections discuss Bayesian linear regression and Bayesian logistic regression approaches to utility inference. In addition, Appendix A discusses Gaussian process methods for posterior inference of the utilities.
Remark 4. The posterior inference methods discussed in this section also apply to the multi-dueling setting (see Section 2.5), where in each iteration, more than 2 actions or policies can be sampled to yield multiple pairwise preferences. Posterior inference simply translates a preference dataset to a posterior distribution; this procedure is agnostic to the number of preference queries per iteration.
Utility Inference via Bayesian Linear Regression
Several different variants of Bayesian linear regression could be performed to infer the utilities,π. Firstly, in standard Bayesian linear regression, one defines a Gaussian prior over the reward vector π β Rπ: π βΌ N (0,(π0)β1πΌ). The likelihood of the data conditioned uponπ is also Gaussian:
π(yπ|ππ,π;π2) = 1
(2π π2)πβ21exp
β 1 2π2
||yπβπππ||2
.
This conjugate Gaussian prior and likelihood pair result in the following closed-form Gaussian posterior:
π|ππ,ππ, π2, π0βΌ N (ππ
πππ+π2π0πΌ)β1ππ
πππ, π2(ππ
πππ+π2π0πΌ)β1 . Definingπ:=π2π0, this posterior can equivalently be written as:
π|Dπ, π2, π βΌ N
πΛπ, π2πβ1
π
, where (4.1)
ππ :=π πΌ+
πβ1
Γ
π=1
πππππ forπ β₯ 1, andπΛπ:= πβ1
π πβ1
Γ
π=1
π¦πππ. (4.2) Note thatπΛπis the MAP estimate of π, as well as a ridge regression estimate ofπ.
In the linear bandit setting, several posterior sampling regret analyses (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) use a modification of this posterior, which scales the covariance matrix by a factor that increases logarithmically inπ:
π|Dπ, πβΌ N Λ
ππ, π½π(πΏ)2πβ1
π
, (4.3)
whereπΛπandπβ1
π are defined in Eq. (4.2), andπ½π(πΏ)is given by:
π½π(πΏ) := π s
2 log
det(ππ)1/2πβπ/2 πΏ
+β
πππ β€ π vu t
πlog 1+ πΏ2π
ππ
πΏ
! +β
πππ, (4.4) whereπΏ β (0,1) is a failure probability to be discussed shortly, and πΏ is an upper bound such that ||ππ||2 β€ πΏ for all π. Note that in the RL setting, πΏ β€ 2β, since
||ππ||2 = ||ππ2 β ππ1||2 β€ ||ππ2 β ππ1||1 β€ ||ππ2||1 + ||ππ1||1 = 2β. Recall that π β€ 1 (see Section 3.3) due to the sub-Gaussianity of preference feedback, and that
||π||2 β€ ππ (see Assumptions 1 and 6).
In Agrawal and Goyal (2013) and Abeille and Lazaric (2017), the factor of π½π(πΏ)2 is leveraged to prove regret bounds for linear bandits with sub-Gaussian noise in the feedback. In the current preference-based problem, meanwhile, scaling by π½π(πΏ)2 will lead to an asymptotic consistency guarantee. This is shown in Section 4.4 by applying the following result from Abbasi-Yadkori, PΓ‘l, and SzepesvΓ‘ri (2011), which arises from a martingale concentration inequality. This proposition demonstrates that with high probability,πlies within a confidence ellipsoid centered atπΛπ:
Proposition 1 (Theorem 2 from Abbasi-Yadkori, PΓ‘l, and SzepesvΓ‘ri, 2011). Let {πΉπ}β
π=0be a filtration, and let{ππ}β
π=1be a real-valued stochastic process such that ππ is both πΉπ-measurable and conditionally π -sub-Gaussian for some π β₯ 0. Let {ππ} be an Rπ-valued stochastic process such that ππ is πΉπβ1-measurable. Define π¦π := πππ π+ππ, and assume that||π||2 β€ ππ and ||ππ||2 β€ πΏ. Then, for any πΏ > 0, with probability at least1βπΏ and for allπ > 0, ||πΛπ βπ||ππ β€ π½π(πΏ), where πΛπ, ππ andπ½π(πΏ)are defined in Eq.s(4.2)and(4.4).
Importantly, this result does not require the noise in the labelsπ¦πto be Gaussian, but instead only assumes that the noise is sub-Gaussian. Furthermore, preference noise is sub-Gaussian as discussed in Section 3.3.
Note that while the distribution in Eq. (4.1) is an exact, conjugate posterior given a Gaussian prior and likelihood, the scaled distribution in Eq. (4.3) is no longer an exact posterior, as it is not proportional to the product of a prior and likelihood. In fact, as discussed in Agrawal and Goyal (2013) and Abeille and Lazaric (2017), posterior sampling does not need to sample from an actual Bayesian posterior distribution;
rather, any distribution satisfying appropriate concentration and anti-concentration inequalities will achieve low regret.
In fact, neither of the probability distributions in Eq.s (4.1) or (4.3) represents a true, exact posterior for modeling preference data, as binary preference feedback does not exhibit a Gaussian likelihood. Recall thatπ¦π β
β1
2, 1
2 , such thatπ¦π = 12 ifππ2 ππ1
andπ¦π=β1
2 ifππ1 ππ2. The exact posterior takes the form:
π(π | Dπ) β π(π)
πβ1
Γ
π=1
π(π¦π |ππ,π) = π(π)
πβ1
Γ
π=1
2π¦πππππ+ 1 2
, (4.5)
where π(π) is an appropriate prior for π and 2π¦πππππ β (β0.5,0.5) ensures that π(π¦π | ππ,π) β [0,1]. To guarantee that 2π¦πππππ β (β0.5,0.5), the prior π(π) must have bounded support; possibilities include uniform and truncated Gaussian distributions.
While the posterior in Eq. (4.5) lacks a closed-form representation and is therefore intractable, one can sample from it via approximate posterior inference techniques such as the Laplace approximation (Section 2.1) or Markov chain Monte Carlo sampling. The Laplace approximation is only valid when the Hessian matrix of the negative log posterior is positive definite. In this case, the negative log posterior is:
βlogπ(π | Dπ) =βlogπ(π) β
πβ1
Γ
π=1
log
2π¦πππππ+ 1 2
+πΆ , whereπΆ is a constant. Its Hessian matrix is:
π» :=β2π[βlogπ(π | Dπ)] =β2π[βlogπ(π)] +
πβ1
Γ
π=1
πππππ (2π¦πππππ+0.5)2.
Clearly, the terms containing πππππ are positive semidefinite. Thus, π» 0 holds, provided that the selected prior satisfiesβ2π[βlogπ(π)] 0. Under such a prior, the Laplace approximation is given by:
π(π | Dπ) β N Λ ππ, π»β1
π=πΛπ
,
whereπΛπ=argminπ[βlogπ(π | Dπ)]andπ» =β2π[βlogπ(π | Dπ)]. Utility Inference via Bayesian Logistic Regression
One can also infer the utilities π βRπ via Bayesian logistic regression, especially if a logistic link function is thought to govern user preferences. Typically, one sets a Gaussian prior over the utilitiesπ:π βΌ N (0, π πΌ), whereπ >0. As in the linear case, the outcomes areπ¦π β
β1
2,1
2 . Then, the logistic regression likelihood is:
π(Dπ|π) =
πβ1
Γ
π=1
π(π¦π|π,ππ) =
πβ1
Γ
π=1
1
1+exp(β2π¦ππππ π) .
The posterior,π(π| Dπ) β π(Dπ|π)π(π), can be approximated as Gaussian via the Laplace approximation:
π(π| Dπ) β N (πΛMAPπ , Ξ£MAPπ ), where: (4.6) πΛMAPπ =argmin
π
π(π), π(π) :=βlogπ(Dπ,π)=βlogπ(π) βlogπ(Dπ|π), (4.7) Ξ£MAPπ =
β2ππ(π) Λ
πMAPπ
β1
, (4.8)
and where the optimization problem in Eq. (4.7) is convex: similarly to the linear case, one can show that the Hessian matrix of π(π),β2π[π(π)], is positive definite.
Note that the Laplace approximationβs inverse posterior covariance, (Ξ£MAPπ )β1, is given by:
(Ξ£MAPπ )β1 =π πΌ+
πβ1
Γ
π=1
Λ
π(2π¦πππππΛMAPπ )πππππ, (4.9) where Λπ(π₯) := π0
log(π₯) πlog(π₯)
2
β π
00 log(π₯)
πlog(π₯), andπlogis the sigmoid function.
The posterior π(π | Dπ) can also be estimated using other posterior approximation methods, for instance Markov chain Monte Carlo.
Filippi et al. (2010) prove that for logistic regression, the true utilities π lie within a confidence ellipsoid centered at the estimateπΛπwith high probability, whereπΛπis the maximum quasi-likelihood estimator ofπ, projected onto the domainΞ:
Λ
ππ=argmin
πβΞ
πβ1
Γ
π=1
πlog(πππ π)ππβ
π¦π+ 1 2
ππ
πβπ1
, (4.10)
where ππ is as defined in Eq. (4.2) and πlog is the sigmoidal function, πlog(π₯) = (1+πβπ₯)β1.1
Proposition 2 (Proposition 1 from Filippi et al., 2010). Let {πΉπ}β
π=0be a filtration, and let{ππ}β
π=1be a real-valued stochastic process such thatππisπΉπ-measurable and conditionallyπ -sub-Gaussian for someπ β₯ 0. Let{ππ} be anRπ-valued stochastic process such that ππ is πΉπβ1-measurable. The observations π¦π β
β1
2,1
2 are given
1The equationΓπβ1 π=1
πlog(πππ π)ππβ π¦π+ 1
2
ππ
=0 is known as thelikelihood equation, and its unique solution is the maximum likelihood estimate of the logistic regression problem. This fact is derived, for instance, on page 2 of Gourieroux and Monfort (1981). Because the typical form of the likelihood equation assumes that the labels belong to{0,1}, Eq. (4.10) adds 0.5 to the labels in order to use this standard form, mirroring Filippi et al. (2010).
byπ¦π =πlog(πππ π) +ππ, where πlogis a sigmoidal function. Assume that ||ππ||2 β€ πΏ, π0
min is a positive lower bound on the slope of πlog, and π > 0 lower-bounds the minimum eigenvalue of ππ. Then, for anyπΏ > 0, with probability at least1βπΏand for allπ > 0,||πΛπβπ||ππ β€ π½π(πΏ), where:
π½π(πΏ)= 2π π0
min
s
3+2 log
1+ 2πΏ2 π
p2πlogπ s
log π
πΏ
. (4.11)
Recall from Eq. (4.2) thatπ >0 indeed lower-bounds ππβs minimum eigenvalue.
Remark 5. In fact, the result in Filippi et al. (2010) applies not only to logistic regression on binary labels, but also to any generalized linear model. General- ized linear models are a well-known statistical framework in which outcomes are generated from distributions belonging to the exponential family, and are further described in McCullagh and Nelder (1989).
Proof. The stated result is in fact a slightly-adapted version of Proposition 1 from Filippi et al. (2010), which actually derives that|π(πππ π) βπ(πππ πΛπ) | β€ π½0
π(πΏ) ||ππ||πβ1 π
for allπ β₯ 1 and vectors ππ with probability 1βπΏ. The constant π½0
π(πΏ) is related to π½π(πΏ)(defined above) byπ½0
π(πΏ) = π max
π
πΏππ½π(πΏ), whereπΏπis the Lipschitz constant ofπ(denoted byππ in Filippi et al., 2010) and where each reward lies in [0, π max] in the setting described in Filippi et al. (2010). To prove their result, Filippi et al.
(2010) use that |π(πππ π) βπ(πππ πΛπ) | β€ πΏπ|πππ π β πππ πΛπ| by definition of Lipschitz continuity, and then show that|πππ πβπππ πΛπ| β€ 1
πΏππ½0
π(πΏ) ||ππ||πβ1 π
to obtain the result.
Therefore, as a byproduct, their analysis also proves that:
|πππ πβπππ πΛπ| β€ 1 πΏπ
π½0
π(πΏ) ||ππ||πβ1 π
(π)= π max π
π½π(πΏ) ||ππ||πβ1 π
, where (a) comes from substituting the relationship betweenπ½π(πΏ)and π½0
π(πΏ).
Also, the analysis in Filippi et al. (2010) begins with quantities in terms ofπ instead ofπ max, and then replacesπ by π maxby using that π β€ π max(which holds in their setting). Therefore, their result is still valid when reverting fromπ maxtoπ , so that:
|πππ πβπππ πΛπ| β€ π½π(πΏ) ||ππ||πβ1 π
.
To adapt this result to the desired form, it suffices to show that the following two statements are equivalent for any π1,π2 β Rπ, positive definite π β RπΓπ, and π½ > 0: 1)|ππ(π1βπ2) | β€ π½||π||πβ1 for allπ βRπ, and 2) ||π1βπ2||π β€ π½.
Firstly, if 1) holds, then in particular, it holds when settingπ= π(π1βπ2). Substitut- ing this definition ofπinto 1) yields:| (π1βπ2)ππ(π1βπ2) | β€ π½||π(π1βπ2) ||πβ1. Now, the left-hand side is | (π1 β π2)ππ(π1 β π2) | = ||π1β π2||2
π, while on the right-hand side,||π(π1βπ2) ||πβ1 =p
(π1β π2)ππ πβ1π(π1β π2)= ||π1βπ2||π. Thus, the statement is equivalent to ||π1 β π2||2
π β€ π½||π1 β π2||π, and therefore
||π1βπ2||π β€ π½as desired.
Secondly, if 2) holds, then for anyπ βRπ,
|ππ(π1βπ2) | =|πππβ1π(π1βπ2) | =
< π, π(π1βπ2) >
πβ1
(π)
β€ ||π||πβ1||π(π1βπ2) ||πβ1 =||π||πβ1
p(π1β π2)ππ πβ1π(π1β π2)
=||π||πβ1||π1βπ2||π (π)
β€ π½||π||πβ1,
where (a) applies the Cauchy-Schwarz inequality, while (b) uses 2).
This theoretical guarantee motivates the following modification to the Laplace- approximation posterior, which will be useful for guaranteeing asymptotic consis- tency:
π(π| Dπ) β N (πΛπ, π½π(πΏ)2(π0
π)β1), (4.12)
where πΛπ, and π½π(πΏ) are respectively defined in Eq.s (4.10) and (4.11), and π0
π is given analogously to Ξ£MAPπ , defined in Eq.s (4.6) and (4.9), by simply replacing πΛMAPπ withπΛπin the definition in Eq. (4.9):
(π0
π)β1=π πΌ+
πβ1
Γ
π=1
Λ
π(2π¦ππππ πΛπ)πππππ,
where Λπ(π₯) :=
π0
log(π₯) πlog(π₯)
2
β π
00 log(π₯)
πlog(π₯), andπlogis the sigmoid function.