• Tidak ada hasil yang ditemukan

Posterior Modeling for Utility Inference and Credit Assignment

Chapter IV: Dueling Posterior Sampling for Preference-Based Bandits and

4.3 Posterior Modeling for Utility Inference and Credit Assignment

concepts of a policy and a trajectory both reduce to the selected action. Thirdly, each action is associated with a𝑑-dimensional vector, rather than with just an index in {1, . . . , 𝐴}, as in tabular RL. While in the RL case, the utility vector 𝒓 denotes the utility of each state-action pair, in the generalized linear bandit case, it specifies a linear weight on each action space dimension. With these modifications, DPS is outlined in Algorithms 8-10.

Let𝑋𝑛 ∈R(π‘›βˆ’1)×𝑑be the observation matrix afterπ‘›βˆ’1 preferences, in which the𝑖th row contains observation𝒙𝑖 =𝒙𝑖2βˆ’π’™π‘–1. Furthermore, defineπ’šπ‘› ∈Rπ‘›βˆ’1as the vector of corresponding preference labels, with its𝑖th element equal to𝑦𝑖 ∈

βˆ’1

2,1

2 . The following two subsections discuss Bayesian linear regression and Bayesian logistic regression approaches to utility inference. In addition, Appendix A discusses Gaussian process methods for posterior inference of the utilities.

Remark 4. The posterior inference methods discussed in this section also apply to the multi-dueling setting (see Section 2.5), where in each iteration, more than 2 actions or policies can be sampled to yield multiple pairwise preferences. Posterior inference simply translates a preference dataset to a posterior distribution; this procedure is agnostic to the number of preference queries per iteration.

Utility Inference via Bayesian Linear Regression

Several different variants of Bayesian linear regression could be performed to infer the utilities,𝒓. Firstly, in standard Bayesian linear regression, one defines a Gaussian prior over the reward vector 𝒓 ∈ R𝑑: 𝒓 ∼ N (0,(πœ†0)βˆ’1𝐼). The likelihood of the data conditioned upon𝒓 is also Gaussian:

𝑝(y𝑛|𝑋𝑛,𝒓;𝜎2) = 1

(2πœ‹ 𝜎2)π‘›βˆ’21exp

βˆ’ 1 2𝜎2

||yπ‘›βˆ’π‘‹π‘›π’“||2

.

This conjugate Gaussian prior and likelihood pair result in the following closed-form Gaussian posterior:

𝒓|𝑋𝑛,π’šπ‘›, 𝜎2, πœ†0∼ N (𝑋𝑇

𝑛𝑋𝑛+𝜎2πœ†0𝐼)βˆ’1𝑋𝑇

π‘›π’šπ‘›, 𝜎2(𝑋𝑇

𝑛𝑋𝑛+𝜎2πœ†0𝐼)βˆ’1 . Definingπœ†:=𝜎2πœ†0, this posterior can equivalently be written as:

𝒓|D𝑛, 𝜎2, πœ† ∼ N

𝒓ˆ𝑛, 𝜎2π‘€βˆ’1

𝑛

, where (4.1)

𝑀𝑛 :=πœ† 𝐼+

π‘›βˆ’1

Γ•

𝑖=1

𝒙𝑖𝒙𝑇𝑖 forπœ† β‰₯ 1, and𝒓ˆ𝑛:= π‘€βˆ’1

𝑛 π‘›βˆ’1

Γ•

𝑖=1

𝑦𝑖𝒙𝑖. (4.2) Note that𝒓ˆ𝑛is the MAP estimate of 𝒓, as well as a ridge regression estimate of𝒓.

In the linear bandit setting, several posterior sampling regret analyses (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) use a modification of this posterior, which scales the covariance matrix by a factor that increases logarithmically in𝑛:

𝒓|D𝑛, πœ†βˆΌ N Λ†

𝒓𝑛, 𝛽𝑛(𝛿)2π‘€βˆ’1

𝑛

, (4.3)

where𝒓ˆ𝑛andπ‘€βˆ’1

𝑛 are defined in Eq. (4.2), and𝛽𝑛(𝛿)is given by:

𝛽𝑛(𝛿) := 𝑅 s

2 log

det(𝑀𝑛)1/2πœ†βˆ’π‘‘/2 𝛿

+√

πœ†π‘†π‘Ÿ ≀ 𝑅 vu t

𝑑log 1+ 𝐿2𝑛

π‘‘πœ†

𝛿

! +√

πœ†π‘†π‘Ÿ, (4.4) where𝛿 ∈ (0,1) is a failure probability to be discussed shortly, and 𝐿 is an upper bound such that ||𝒙𝑛||2 ≀ 𝐿 for all 𝑛. Note that in the RL setting, 𝐿 ≀ 2β„Ž, since

||𝒙𝑛||2 = ||𝒙𝑛2 βˆ’ 𝒙𝑛1||2 ≀ ||𝒙𝑛2 βˆ’ 𝒙𝑛1||1 ≀ ||𝒙𝑛2||1 + ||𝒙𝑛1||1 = 2β„Ž. Recall that 𝑅 ≀ 1 (see Section 3.3) due to the sub-Gaussianity of preference feedback, and that

||𝒓||2 ≀ π‘†π‘Ÿ (see Assumptions 1 and 6).

In Agrawal and Goyal (2013) and Abeille and Lazaric (2017), the factor of 𝛽𝑛(𝛿)2 is leveraged to prove regret bounds for linear bandits with sub-Gaussian noise in the feedback. In the current preference-based problem, meanwhile, scaling by 𝛽𝑛(𝛿)2 will lead to an asymptotic consistency guarantee. This is shown in Section 4.4 by applying the following result from Abbasi-Yadkori, PΓ‘l, and SzepesvΓ‘ri (2011), which arises from a martingale concentration inequality. This proposition demonstrates that with high probability,𝒓lies within a confidence ellipsoid centered at𝒓ˆ𝑛:

Proposition 1 (Theorem 2 from Abbasi-Yadkori, PΓ‘l, and SzepesvΓ‘ri, 2011). Let {𝐹𝑖}∞

𝑖=0be a filtration, and let{πœ‚π‘–}∞

𝑖=1be a real-valued stochastic process such that πœ‚π‘– is both 𝐹𝑖-measurable and conditionally 𝑅-sub-Gaussian for some 𝑅 β‰₯ 0. Let {𝒙𝑖} be an R𝑑-valued stochastic process such that 𝒙𝑖 is πΉπ‘–βˆ’1-measurable. Define 𝑦𝑖 := 𝒙𝑇𝑖 𝒓+πœ‚π‘–, and assume that||𝒓||2 ≀ π‘†π‘Ÿ and ||𝒙𝑖||2 ≀ 𝐿. Then, for any 𝛿 > 0, with probability at least1βˆ’π›Ώ and for all𝑖 > 0, ||𝒓ˆ𝑖 βˆ’π’“||𝑀𝑖 ≀ 𝛽𝑖(𝛿), where 𝒓ˆ𝑖, 𝑀𝑖 and𝛽𝑖(𝛿)are defined in Eq.s(4.2)and(4.4).

Importantly, this result does not require the noise in the labels𝑦𝑖to be Gaussian, but instead only assumes that the noise is sub-Gaussian. Furthermore, preference noise is sub-Gaussian as discussed in Section 3.3.

Note that while the distribution in Eq. (4.1) is an exact, conjugate posterior given a Gaussian prior and likelihood, the scaled distribution in Eq. (4.3) is no longer an exact posterior, as it is not proportional to the product of a prior and likelihood. In fact, as discussed in Agrawal and Goyal (2013) and Abeille and Lazaric (2017), posterior sampling does not need to sample from an actual Bayesian posterior distribution;

rather, any distribution satisfying appropriate concentration and anti-concentration inequalities will achieve low regret.

In fact, neither of the probability distributions in Eq.s (4.1) or (4.3) represents a true, exact posterior for modeling preference data, as binary preference feedback does not exhibit a Gaussian likelihood. Recall that𝑦𝑖 ∈

βˆ’1

2, 1

2 , such that𝑦𝑖 = 12 if𝒙𝑖2 𝒙𝑖1

and𝑦𝑖=βˆ’1

2 if𝒙𝑖1 𝒙𝑖2. The exact posterior takes the form:

𝑝(𝒓 | D𝑛) ∝ 𝑝(𝒓)

π‘›βˆ’1

Γ–

𝑖=1

𝑃(𝑦𝑖 |𝒙𝑖,𝒓) = 𝑝(𝒓)

π‘›βˆ’1

Γ–

𝑖=1

2𝑦𝑖𝒓𝑇𝒙𝑖+ 1 2

, (4.5)

where 𝑝(𝒓) is an appropriate prior for 𝒓 and 2𝑦𝑖𝒓𝑇𝒙𝑖 ∈ (βˆ’0.5,0.5) ensures that 𝑃(𝑦𝑖 | 𝒙𝑖,𝒓) ∈ [0,1]. To guarantee that 2𝑦𝑖𝒓𝑇𝒙𝑖 ∈ (βˆ’0.5,0.5), the prior 𝑝(𝒓) must have bounded support; possibilities include uniform and truncated Gaussian distributions.

While the posterior in Eq. (4.5) lacks a closed-form representation and is therefore intractable, one can sample from it via approximate posterior inference techniques such as the Laplace approximation (Section 2.1) or Markov chain Monte Carlo sampling. The Laplace approximation is only valid when the Hessian matrix of the negative log posterior is positive definite. In this case, the negative log posterior is:

βˆ’log𝑝(𝒓 | D𝑛) =βˆ’log𝑝(𝒓) βˆ’

π‘›βˆ’1

Γ•

𝑖=1

log

2𝑦𝑖𝒓𝑇𝒙𝑖+ 1 2

+𝐢 , where𝐢 is a constant. Its Hessian matrix is:

𝐻 :=βˆ‡2𝒓[βˆ’log𝑝(𝒓 | D𝑛)] =βˆ‡2𝒓[βˆ’log𝑝(𝒓)] +

π‘›βˆ’1

Γ•

𝑖=1

𝒙𝑖𝒙𝑇𝑖 (2𝑦𝑖𝒓𝑇𝒙𝑖+0.5)2.

Clearly, the terms containing 𝒙𝑖𝒙𝑖𝑇 are positive semidefinite. Thus, 𝐻 0 holds, provided that the selected prior satisfiesβˆ‡2𝒓[βˆ’log𝑝(𝒓)] 0. Under such a prior, the Laplace approximation is given by:

𝑝(𝒓 | D𝑛) β‰ˆ N Λ† 𝒓𝑛, π»βˆ’1

𝒓=𝒓ˆ𝑛

,

where𝒓ˆ𝑛=argmin𝒓[βˆ’log𝑝(𝒓 | D𝑛)]and𝐻 =βˆ‡2𝒓[βˆ’log𝑝(𝒓 | D𝑛)]. Utility Inference via Bayesian Logistic Regression

One can also infer the utilities 𝒓 ∈R𝑑 via Bayesian logistic regression, especially if a logistic link function is thought to govern user preferences. Typically, one sets a Gaussian prior over the utilities𝒓:𝒓 ∼ N (0, πœ† 𝐼), whereπœ† >0. As in the linear case, the outcomes are𝑦𝑖 ∈

βˆ’1

2,1

2 . Then, the logistic regression likelihood is:

𝑝(D𝑛|𝒓) =

π‘›βˆ’1

Γ–

𝑖=1

𝑝(𝑦𝑖|𝒓,𝒙𝑖) =

π‘›βˆ’1

Γ–

𝑖=1

1

1+exp(βˆ’2𝑦𝑖𝒙𝑇𝑖 𝒓) .

The posterior,𝑝(𝒓| D𝑛) ∝ 𝑝(D𝑛|𝒓)𝑝(𝒓), can be approximated as Gaussian via the Laplace approximation:

𝑝(𝒓| D𝑛) β‰ˆ N (𝒓ˆMAP𝑛 , Ξ£MAP𝑛 ), where: (4.6) 𝒓ˆMAP𝑛 =argmin

𝒓

𝑓(𝒓), 𝑓(𝒓) :=βˆ’log𝑝(D𝑛,𝒓)=βˆ’log𝑝(𝒓) βˆ’log𝑝(D𝑛|𝒓), (4.7) Ξ£MAP𝑛 =

βˆ‡2𝒓𝑓(𝒓) Λ†

𝒓MAP𝑛

βˆ’1

, (4.8)

and where the optimization problem in Eq. (4.7) is convex: similarly to the linear case, one can show that the Hessian matrix of 𝑓(𝒓),βˆ‡2𝒓[𝑓(𝒓)], is positive definite.

Note that the Laplace approximation’s inverse posterior covariance, (Ξ£MAP𝑛 )βˆ’1, is given by:

(Ξ£MAP𝑛 )βˆ’1 =πœ† 𝐼+

π‘›βˆ’1

Γ•

𝑖=1

˜

𝑔(2𝑦𝑖𝒙𝑖𝑇𝒓ˆMAP𝑛 )𝒙𝑖𝒙𝑇𝑖, (4.9) where Λœπ‘”(π‘₯) := 𝑔0

log(π‘₯) 𝑔log(π‘₯)

2

βˆ’ 𝑔

00 log(π‘₯)

𝑔log(π‘₯), and𝑔logis the sigmoid function.

The posterior 𝑝(𝒓 | D𝑛) can also be estimated using other posterior approximation methods, for instance Markov chain Monte Carlo.

Filippi et al. (2010) prove that for logistic regression, the true utilities 𝒓 lie within a confidence ellipsoid centered at the estimate𝒓ˆ𝑛with high probability, where𝒓ˆ𝑛is the maximum quasi-likelihood estimator of𝒓, projected onto the domainΘ:

Λ†

𝒓𝑛=argmin

π’“βˆˆΞ˜

π‘›βˆ’1

Γ•

𝑖=1

𝑔log(𝒙𝑇𝑖 𝒓)π’™π‘–βˆ’

𝑦𝑖+ 1 2

𝒙𝑖

π‘€βˆ’π‘›1

, (4.10)

where 𝑀𝑛 is as defined in Eq. (4.2) and 𝑔log is the sigmoidal function, 𝑔log(π‘₯) = (1+π‘’βˆ’π‘₯)βˆ’1.1

Proposition 2 (Proposition 1 from Filippi et al., 2010). Let {𝐹𝑖}∞

𝑖=0be a filtration, and let{πœ‚π‘–}∞

𝑖=1be a real-valued stochastic process such thatπœ‚π‘–is𝐹𝑖-measurable and conditionally𝑅-sub-Gaussian for some𝑅 β‰₯ 0. Let{𝒙𝑖} be anR𝑑-valued stochastic process such that 𝒙𝑖 is πΉπ‘–βˆ’1-measurable. The observations 𝑦𝑖 ∈

βˆ’1

2,1

2 are given

1The equationΓπ‘›βˆ’1 𝑖=1

𝑔log(𝒙𝑇𝑖 𝒓)π’™π‘–βˆ’ 𝑦𝑖+ 1

2

𝒙𝑖

=0 is known as thelikelihood equation, and its unique solution is the maximum likelihood estimate of the logistic regression problem. This fact is derived, for instance, on page 2 of Gourieroux and Monfort (1981). Because the typical form of the likelihood equation assumes that the labels belong to{0,1}, Eq. (4.10) adds 0.5 to the labels in order to use this standard form, mirroring Filippi et al. (2010).

by𝑦𝑖 =𝑔log(𝒙𝑇𝑖 𝒓) +πœ‚π‘–, where 𝑔logis a sigmoidal function. Assume that ||𝒙𝑖||2 ≀ 𝐿, 𝑔0

min is a positive lower bound on the slope of 𝑔log, and πœ† > 0 lower-bounds the minimum eigenvalue of 𝑀𝑖. Then, for any𝛿 > 0, with probability at least1βˆ’π›Ώand for all𝑖 > 0,||π’“Λ†π‘–βˆ’π’“||𝑀𝑖 ≀ 𝛽𝑖(𝛿), where:

𝛽𝑖(𝛿)= 2𝑅 𝑔0

min

s

3+2 log

1+ 2𝐿2 πœ†

p2𝑑log𝑖 s

log 𝑑

𝛿

. (4.11)

Recall from Eq. (4.2) thatπœ† >0 indeed lower-bounds 𝑀𝑖’s minimum eigenvalue.

Remark 5. In fact, the result in Filippi et al. (2010) applies not only to logistic regression on binary labels, but also to any generalized linear model. General- ized linear models are a well-known statistical framework in which outcomes are generated from distributions belonging to the exponential family, and are further described in McCullagh and Nelder (1989).

Proof. The stated result is in fact a slightly-adapted version of Proposition 1 from Filippi et al. (2010), which actually derives that|𝑔(𝒙𝑇𝑖 𝒓) βˆ’π‘”(𝒙𝑇𝑖 𝒓ˆ𝑖) | ≀ 𝛽0

𝑖(𝛿) ||𝒙𝑖||π‘€βˆ’1 𝑖

for all𝑖 β‰₯ 1 and vectors 𝒙𝑖 with probability 1βˆ’π›Ώ. The constant 𝛽0

𝑖(𝛿) is related to 𝛽𝑖(𝛿)(defined above) by𝛽0

𝑖(𝛿) = 𝑅max

𝑅

𝐿𝑔𝛽𝑖(𝛿), where𝐿𝑔is the Lipschitz constant of𝑔(denoted byπ‘˜πœ‡ in Filippi et al., 2010) and where each reward lies in [0, 𝑅max] in the setting described in Filippi et al. (2010). To prove their result, Filippi et al.

(2010) use that |𝑔(𝒙𝑇𝑖 𝒓) βˆ’π‘”(𝒙𝑇𝑖 𝒓ˆ𝑖) | ≀ 𝐿𝑔|𝒙𝑇𝑖 𝒓 βˆ’ 𝒙𝑇𝑖 𝒓ˆ𝑖| by definition of Lipschitz continuity, and then show that|𝒙𝑇𝑖 π’“βˆ’π’™π‘‡π‘– 𝒓ˆ𝑖| ≀ 1

𝐿𝑔𝛽0

𝑖(𝛿) ||𝒙𝑖||π‘€βˆ’1 𝑖

to obtain the result.

Therefore, as a byproduct, their analysis also proves that:

|𝒙𝑇𝑖 π’“βˆ’π’™π‘‡π‘– 𝒓ˆ𝑖| ≀ 1 𝐿𝑔

𝛽0

𝑖(𝛿) ||𝒙𝑖||π‘€βˆ’1 𝑖

(π‘Ž)= 𝑅max 𝑅

𝛽𝑖(𝛿) ||𝒙𝑖||π‘€βˆ’1 𝑖

, where (a) comes from substituting the relationship between𝛽𝑖(𝛿)and 𝛽0

𝑖(𝛿).

Also, the analysis in Filippi et al. (2010) begins with quantities in terms of𝑅instead of𝑅max, and then replaces𝑅by 𝑅maxby using that 𝑅 ≀ 𝑅max(which holds in their setting). Therefore, their result is still valid when reverting from𝑅maxto𝑅, so that:

|𝒙𝑇𝑖 π’“βˆ’π’™π‘‡π‘– 𝒓ˆ𝑖| ≀ 𝛽𝑖(𝛿) ||𝒙𝑖||π‘€βˆ’1 𝑖

.

To adapt this result to the desired form, it suffices to show that the following two statements are equivalent for any 𝒓1,𝒓2 ∈ R𝑑, positive definite 𝑀 ∈ R𝑑×𝑑, and 𝛽 > 0: 1)|𝒙𝑇(𝒓1βˆ’π’“2) | ≀ 𝛽||𝒙||π‘€βˆ’1 for all𝒙 ∈R𝑑, and 2) ||𝒓1βˆ’π’“2||𝑀 ≀ 𝛽.

Firstly, if 1) holds, then in particular, it holds when setting𝒙= 𝑀(𝒓1βˆ’π’“2). Substitut- ing this definition of𝒙into 1) yields:| (𝒓1βˆ’π’“2)𝑇𝑀(𝒓1βˆ’π’“2) | ≀ 𝛽||𝑀(𝒓1βˆ’π’“2) ||π‘€βˆ’1. Now, the left-hand side is | (𝒓1 βˆ’ 𝒓2)𝑇𝑀(𝒓1 βˆ’ 𝒓2) | = ||𝒓1βˆ’ 𝒓2||2

𝑀, while on the right-hand side,||𝑀(𝒓1βˆ’π’“2) ||π‘€βˆ’1 =p

(𝒓1βˆ’ 𝒓2)𝑇𝑀 π‘€βˆ’1𝑀(𝒓1βˆ’ 𝒓2)= ||𝒓1βˆ’π’“2||𝑀. Thus, the statement is equivalent to ||𝒓1 βˆ’ 𝒓2||2

𝑀 ≀ 𝛽||𝒓1 βˆ’ 𝒓2||𝑀, and therefore

||𝒓1βˆ’π’“2||𝑀 ≀ 𝛽as desired.

Secondly, if 2) holds, then for any𝒙 ∈R𝑑,

|𝒙𝑇(𝒓1βˆ’π’“2) | =|π’™π‘‡π‘€βˆ’1𝑀(𝒓1βˆ’π’“2) | =

< 𝒙, 𝑀(𝒓1βˆ’π’“2) >

π‘€βˆ’1

(π‘Ž)

≀ ||𝒙||π‘€βˆ’1||𝑀(𝒓1βˆ’π’“2) ||π‘€βˆ’1 =||𝒙||π‘€βˆ’1

p(𝒓1βˆ’ 𝒓2)𝑇𝑀 π‘€βˆ’1𝑀(𝒓1βˆ’ 𝒓2)

=||𝒙||π‘€βˆ’1||𝒓1βˆ’π’“2||𝑀 (𝑏)

≀ 𝛽||𝒙||π‘€βˆ’1,

where (a) applies the Cauchy-Schwarz inequality, while (b) uses 2).

This theoretical guarantee motivates the following modification to the Laplace- approximation posterior, which will be useful for guaranteeing asymptotic consis- tency:

𝑝(𝒓| D𝑛) β‰ˆ N (𝒓ˆ𝑛, 𝛽𝑛(𝛿)2(𝑀0

𝑛)βˆ’1), (4.12)

where 𝒓ˆ𝑛, and 𝛽𝑛(𝛿) are respectively defined in Eq.s (4.10) and (4.11), and 𝑀0

𝑛 is given analogously to Ξ£MAP𝑛 , defined in Eq.s (4.6) and (4.9), by simply replacing 𝒓ˆMAP𝑛 with𝒓ˆ𝑛in the definition in Eq. (4.9):

(𝑀0

𝑛)βˆ’1=πœ† 𝐼+

π‘›βˆ’1

Γ•

𝑖=1

˜

𝑔(2𝑦𝑖𝒙𝑇𝑖 𝒓ˆ𝑛)𝒙𝑖𝒙𝑖𝑇,

where Λœπ‘”(π‘₯) :=

𝑔0

log(π‘₯) 𝑔log(π‘₯)

2

βˆ’ 𝑔

00 log(π‘₯)

𝑔log(π‘₯), and𝑔logis the sigmoid function.