Gaussian Process Preference Model - Conclusions and Future Directions

Chapter VI: Conclusions and Future Directions

A.3 Gaussian Process Preference Model

counts (i.e., 𝑍) and the Gaussian process prior on 𝒓. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yields𝒓|𝑹 ∼ N (𝝁,Σ), where the expressions for𝝁andΣare given by Eq.s (A.2) and (A.3) above.

By substituting 𝒚⁰for 𝑹, the conditional posterior density of 𝒓can be expressed in terms of𝑍,𝒚⁰,𝐾_𝑟, and𝝁_𝒓, that is, in terms of observed data and the Gaussian process prior parameters.

state-action visit counts,𝒙𝑖1:𝑢(𝜏_𝑖₁) = 𝒓^𝑇𝒙𝑖1. Thus, the full likelihood expression is:

𝑃(D |𝒓) =

𝑁

𝑖=1

𝑔(𝑧_𝑖), (A.6)

𝑧_𝑖 := 𝑦⁰

𝑖(𝑢(𝜏_𝑖₂) −𝑢(𝜏_𝑖₁)) 𝑐

= 𝑦⁰

𝑖𝒓^𝑇(𝒙𝑖2−𝒙𝑖1) 𝑐

= 𝑦⁰

𝑖𝒓^𝑇𝒙𝑖

𝑐 . Given the preference datasetD, one can model the posterior probability of𝒓:

𝑝(𝒓| D) ∝ 𝑃(D |𝒓)𝑝(𝒓),

where the expressions for the prior 𝑝(𝒓) and likelihood 𝑃(D |𝒓) are given by Eq.s (A.5) and (A.6), respectively. This posterior can be estimated by the Laplace approximation, from which samples 𝒓˜of the utilities𝒓can easily be drawn:

𝒓 ∼ N (𝒓ˆ_MAP, 𝛼Σ_MAP), where: (A.7)

𝒓ˆ_MAP =argmin_𝒓𝑆(𝒓), (A.8)

Σ_MAP =

∇²_𝒓𝑆(𝒓) |𝒓ˆMAP

⁻1

, (A.9)

and𝑆(𝒓) := ¹₂𝒓^𝑇Σ⁻¹𝒓−Í𝑁

𝑖=1log𝑔(𝑧_𝑖)is the negative log posterior, neglecting constant terms with respect to𝒓; lastly,𝛼 >0 is a tunable hyperparameter that influences the balance between exploration and exploitation. In order for the Laplace approximation to be valid, 𝑆(𝒓)must be convex in 𝒓: this guarantees that the optimization problem in Eq. (A.8) is convex and that the covariance matrix defined by Eq. (A.9) is positive definite, and therefore a valid Gaussian covariance matrix. Convexity of 𝑆(𝒓)can be established by demonstrating that its Hessian matrix is positive definite.

It can be shown that for any 𝒓,∇²_𝒓𝑆(𝒓) = Σ⁻¹+Λ, where:

[Λ]𝑚 𝑛 := 1 𝑐²

𝑁

𝑖=1

[𝒙𝑖]𝑚[𝒙𝑖]𝑛

−𝑔⁰⁰(𝑧_𝑖) 𝑔(𝑧_𝑖) +

𝑔⁰(𝑧_𝑖) 𝑔(𝑧_𝑖)

, (A.10)

for 𝒙𝑖 =𝒙𝑖2−𝒙𝑖1. Because the prior covarianceΣis positive definite, to show that

∇²_𝒓𝑆(𝒓) is positive definite, it suffices to show thatΛis positive semidefinite. From Eq. (A.10), one can see that:

Λ = 1 𝑐²

𝑁

𝑖=1

−𝑔⁰⁰(𝑧_𝑖) 𝑔(𝑧_𝑖) +

𝑔⁰(𝑧_𝑖) 𝑔(𝑧_𝑖)

2# 𝒙𝑖𝒙^𝑇_𝑖.

Clearly,𝒙𝑖𝒙^𝑇_𝑖 is positive semidefinite, and thus, the following statement is a sufficient condition for convexity of𝑆(𝒓):

−𝑔⁰⁰(𝑧) 𝑔(𝑧) +

𝑔⁰(𝑧)

𝑔(𝑧) ≥ 0 for all𝑧 ∈R.

In particular, this condition is satisfied for the Gaussian link function,𝑔_Gaussian(·)= Φ(·), where Φ is the standard Gaussian CDF, as well as for the sigmoidal link function, 𝑔_log(𝑥) := 𝜎(𝑥) = _1+exp(−¹ _𝑥₎. In this work, the experiments utilize the sigmoidal link function.

Bayesian Logistic Regression

Notably, the Bayesian logistic regression inference model discussed in Section 4.3 is a special case of the Gaussian process preference model, in which𝑐 =1,𝑔 is the sigmoidal link function, and the prior covariance matrix is diagonal, i.e.Σ = 𝜆 𝐼; for instance, the latter condition occurs with the squared exponential kernel defined in Eq. (A.4) when its lengthscale 𝑙 is set to zero. In this thesis, a number of the experiments with the Gaussian process preference model fall under the special case of Bayesian logistic regression, and therefore, this model is briefly reviewed here.

In Bayesian logistic regression, the Gaussian prior over possible reward vectors 𝒓 ∈R^𝑑 is:𝒓 ∼ N (0, 𝜆 𝐼), where𝜆 > 0. Setting the𝑖^th preference label𝑦_𝑖 equal to ¹₂ if𝒙𝑖2 𝒙𝑖1, while𝑦_𝑖 =−¹

2 if𝒙𝑖1 𝒙𝑖2, the logistic regression likelihood is:

𝑝(D |𝒓) =

𝑁

𝑖=1

𝑝(𝑦_𝑖|𝒓,𝒙𝑖) =

𝑁

𝑖=1

1+exp(−2𝑦_𝑖𝒙^𝑇_𝑖 𝒓) .

The experiments approximate the posterior, 𝑝(𝒓| D) ∝ 𝑝(D |𝒓)𝑝(𝒓), as Gaussian via the Laplace approximation:

𝑝(𝒓| D) ≈ N (𝒓ˆ^MAP, 𝛼Σ^MAP), where:

𝒓^MAP =argmin

𝒓

𝑓(𝒓), 𝑓(𝒓):=−log𝑝(D,𝒓) =−log𝑝(𝒓) −log𝑝(D |𝒓), (A.11) Σ^MAP =

∇²_𝒓𝑓(𝒓) 𝒓ˆ

−1

where the optimization in Eq. (A.11) is convex, and 𝛼 > 0 is a tunable hyperparameter that influences the balance between exploration and exploitation. Note that multiplying the covariance by a well-tuned𝛼is more practical than using the 𝛽_𝑖(𝛿) parameters considered in the asymptotic consistency analysis (Section 4.4), as the latter results in overly-conservative covariance matrices in practice.

A p p e n d i x B

PROOFS OF ASYMPTOTIC CONSISTENCY FOR DUELING POSTERIOR SAMPLING

This appendix proves the asymptotic consistency results stated in Section 4.4. The details are organized into three sections, which prove:

1. In the preference-based RL setting, samples from the model posterior over transition dynamics parameters converge in distribution to the true transition probabilities.

2. In both the preference-based generalized linear bandit and RL settings, samples from the utility posterior converge in distribution to the true utilities.

3. DPS’s selected policies converge in distribution to the optimal policy in the preference-based RL setting. DPS’s selected actions converge to the optimal action in the generalized linear bandit setting with a finite action spaceA.

Please refer to Section 4.2 to review relevant notation, e.g. for the posterior samples drawn in each iteration. In addition, the following notation is used for the value function and for policies given by value iteration:

Definition 5 (Value function given transition dynamics, rewards, and a policy).

Define𝑉(𝒑,𝒓, 𝜋) as the value function over a length-ℎepisode—i.e., the expected total reward in the episode—under transition dynamics 𝒑 ∈R^𝑆

2𝐴

, rewards𝒓 ∈R^{𝑆 𝐴}, and policy𝜋:

𝑉(𝒑,𝒓, 𝜋) =Õ

𝑠∈S

𝑝₀(𝑠)E

" ℎ

𝑡=1

𝑟(𝑠_𝑡, 𝜋(𝑠_𝑡, 𝑡))

𝑠₁=𝑠, 𝒑 = 𝒑,𝒓 = 𝒓

# .

Definition 6(Optimal deterministic policy given transition dynamics and rewards).

Define 𝜋_𝑣𝑖(𝒑,𝒓) := argmax𝜋𝑉(𝒑,𝒓, 𝜋) as the optimal deterministic policy given transition dynamics 𝒑 ∈ R^𝑆

2𝐴

and rewards 𝒓 ∈ R^{𝑆 𝐴} (breaking ties randomly if multiple deterministic policies achieve the maximum). Note that 𝜋_𝑣𝑖(𝒑,𝒓) can be found via finite-horizon value iteration: defining 𝑉_𝜋,𝑡(𝑠) as in Eq. (3.6), set 𝑉_{𝜋, ℎ+}₁(𝑠) := 0 for each 𝑠 ∈ S and use the Bellman equation to calculate𝑉_𝜋,𝑡(𝑠)

successively for𝑡 ∈ {ℎ, ℎ−1, . . . ,1}given 𝒑and𝒓:

𝜋(𝑠, 𝑡) =argmax𝑎∈A

𝑟(𝑠, 𝑎) +Õ

𝑠⁰∈S

𝑃(𝑠_𝑡+1=𝑠⁰|𝑠_𝑡 =𝑠, 𝑎_𝑡 =𝑎)𝑉_𝜋,𝑡+1(𝑠⁰)

# ,

𝑉_𝜋,𝑡(𝑠) = Õ

𝑎∈A

I^{[𝜋(𝑠,𝑡)=}^𝑎]

𝑟(𝑠, 𝑎) +Õ

𝑠⁰∈S

𝑃(𝑠_𝑡+1=𝑠⁰|𝑠_𝑡 =𝑠, 𝑎_𝑡 =𝑎)𝑉_𝜋,𝑡+1(𝑠⁰)

# .

As value iteration results in only deterministic policies, of which there are finitely- many (more precisely, there are𝐴^{𝑆 ℎ}), the maximum argument𝜋_𝑣𝑖(𝒑,𝒓):=argmax_𝜋 𝑉(𝒑,𝒓, 𝜋)is taken over a finite policy class.

Recall that 𝑀_𝑛 := 𝜆 𝐼+Í𝑛−1

𝑖=1 𝒙𝑖𝒙^𝑇_𝑖 (see Eq. (4.2)). For the linear link function, the posterior sampling distribution’s covariance is given byΣ⁽^𝑖⁾ = 𝛽_𝑛(𝛿)²𝑀⁻¹

𝑛 . For the logistic link function, the posterior covariance is given byΣ⁽^𝑖⁾ = 𝛽_𝑛(𝛿)²𝑀⁰

𝑛, where 𝑀⁰

𝑛=𝜆 𝐼+Í𝑛−1

𝑖=1 𝑔˜(2𝑦_𝑖𝒙^𝑇_𝑖 𝒓ˆ𝑛)𝒙𝑖𝒙^𝑇_𝑖, ˜𝑔(𝑥) :=

𝑔⁰

log(𝑥) 𝑔_log(𝑥)

−^𝑔

00 log(𝑥)

𝑔_log(𝑥) comes from the Laplace approximation, and𝑔_logis the sigmoid function.

As in Section 4.4, it is notationally convenient to define a matrix ˜𝑀_𝑛 ∈ R^𝑑^×^𝑑 such that:











𝑀_𝑛 =𝑀_𝑛 =𝜆 𝐼+Í𝑛−1

𝑖=1 𝒙𝑖𝒙^𝑇_𝑖 for the linear link function, and

𝑀_𝑛 =𝑀⁰

𝑛 =𝜆 𝐼+Í𝑛−1

𝑖=1 𝑔˜(2𝑦_𝑖𝒙^𝑇_𝑖 𝒓ˆ𝑛)𝒙𝑖𝒙_𝑖^𝑇 for the logistic link function.

(B.1)

Then, under either link function, the posterior sampling distribution has covariance Σ⁽^𝑛⁾ = 𝛽_𝑛(𝛿)²𝑀˜⁻¹

𝑛 .

Finally, notation is defined for the eigenvectors and eigenvalues of the matrices 𝑀_𝑖 and ˜𝑀_𝑖:

Definition 7 (Eigenvalue notation). Let 𝜈^(𝑖)

𝑗 refer to the 𝑗^th-largest eigenvalue of 𝑀_𝑖, and𝒖⁽_𝑗^𝑖⁾ denote its corresponding eigenvector. Similarly, let𝜆⁽

𝑖)

𝑗 refer to the 𝑗^th- largest eigenvalue of 𝑀˜_𝑖, and 𝒗^(𝑖)_𝑗 denote its corresponding eigenvector. Note that 𝑀⁻¹

𝑖 also has eigenvectors𝒖⁽_𝑗^𝑖⁾, with corresponding eigenvalues ¹

𝜈⁽

𝑖) 𝑗

. Because𝑀_𝑖is positive definite, the eigenvectors{𝒖^(𝑖)_𝑗 }form an orthonormal basis, and𝜈^(𝑖)

𝑗 > 0for all𝑖, 𝑗. The equivalent statements also hold for 𝑀˜_𝑖, which is also positive definite because𝑔˜(𝑥) >0for all possible inputs.

B.1 Facts about Convergence in Distribution

Before proceeding with the asymptotic consistency proofs, two facts about convergence in distribution are reviewed; these will be applied later.

Recall that for a random variable𝑋and a sequence of random variables(𝑋_𝑛),𝑛 ∈N, 𝑋_𝑛

−→𝐷 𝑋 denotes that𝑋_𝑛 converges to 𝑋 in distribution, while 𝑋_𝑛

−→𝑃 𝑋 denotes that𝑋_𝑛converges to 𝑋 in probability.

Fact 8(Billingsley, 1968). For random variables𝒙,𝒙𝑛,∈R^𝑑, where𝑛∈N, and any continuous function𝑔:R^𝑑 −→ R, if𝒙𝑛

−→𝐷 𝒙, then𝑔(𝒙𝑛)−→^𝐷 𝑔(𝒙).

Fact 9 (Billingsley, 1968). For random variables 𝒙𝑛 ∈ R^𝑑, 𝑛 ∈ N, and constant vector 𝒄 ∈ R^𝑑, 𝒙𝑛

−→𝐷 𝒄 is equivalent to 𝒙𝑛

−→𝑃 𝒄. Convergence in probability means that for any𝜀 >0, 𝑃(||𝒙𝑛−𝒄||₂ ≥ 𝜀) −→0as𝑛−→ ∞.

B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 172-177)