• Tidak ada hasil yang ditemukan

Gaussian Process Preference Model

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 172-177)

Chapter VI: Conclusions and Future Directions

A.3 Gaussian Process Preference Model

counts (i.e., 𝑍) and the Gaussian process prior on 𝒓. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yields𝒓|𝑹 ∼ N (𝝁,Ξ£), where the expressions for𝝁andΞ£are given by Eq.s (A.2) and (A.3) above.

By substituting π’š0for 𝑹, the conditional posterior density of 𝒓can be expressed in terms of𝑍,π’š0,πΎπ‘Ÿ, and𝝁𝒓, that is, in terms of observed data and the Gaussian process prior parameters.

state-action visit counts,𝒙𝑖1:𝑒(πœπ‘–1) = 𝒓𝑇𝒙𝑖1. Thus, the full likelihood expression is:

𝑃(D |𝒓) =

𝑁

Γ–

𝑖=1

𝑔(𝑧𝑖), (A.6)

𝑧𝑖 := 𝑦0

𝑖(𝑒(πœπ‘–2) βˆ’π‘’(πœπ‘–1)) 𝑐

= 𝑦0

𝑖𝒓𝑇(𝒙𝑖2βˆ’π’™π‘–1) 𝑐

= 𝑦0

𝑖𝒓𝑇𝒙𝑖

𝑐 . Given the preference datasetD, one can model the posterior probability of𝒓:

𝑝(𝒓| D) ∝ 𝑃(D |𝒓)𝑝(𝒓),

where the expressions for the prior 𝑝(𝒓) and likelihood 𝑃(D |𝒓) are given by Eq.s (A.5) and (A.6), respectively. This posterior can be estimated by the Laplace approximation, from which samples π’“Λœof the utilities𝒓can easily be drawn:

˜

𝒓 ∼ N (𝒓ˆMAP, 𝛼ΣMAP), where: (A.7)

𝒓ˆMAP =argmin𝒓𝑆(𝒓), (A.8)

Ξ£MAP =

βˆ‡2𝒓𝑆(𝒓) |𝒓ˆMAP

βˆ’1

, (A.9)

and𝑆(𝒓) := 12π’“π‘‡Ξ£βˆ’1π’“βˆ’Γπ‘

𝑖=1log𝑔(𝑧𝑖)is the negative log posterior, neglecting con- stant terms with respect to𝒓; lastly,𝛼 >0 is a tunable hyperparameter that influences the balance between exploration and exploitation. In order for the Laplace approxi- mation to be valid, 𝑆(𝒓)must be convex in 𝒓: this guarantees that the optimization problem in Eq. (A.8) is convex and that the covariance matrix defined by Eq. (A.9) is positive definite, and therefore a valid Gaussian covariance matrix. Convexity of 𝑆(𝒓)can be established by demonstrating that its Hessian matrix is positive definite.

It can be shown that for any 𝒓,βˆ‡2𝒓𝑆(𝒓) = Ξ£βˆ’1+Ξ›, where:

[Ξ›]π‘š 𝑛 := 1 𝑐2

𝑁

Γ•

𝑖=1

[𝒙𝑖]π‘š[𝒙𝑖]𝑛

"

βˆ’π‘”00(𝑧𝑖) 𝑔(𝑧𝑖) +

𝑔0(𝑧𝑖) 𝑔(𝑧𝑖)

2#

, (A.10)

for 𝒙𝑖 =𝒙𝑖2βˆ’π’™π‘–1. Because the prior covarianceΞ£is positive definite, to show that

βˆ‡2𝒓𝑆(𝒓) is positive definite, it suffices to show thatΞ›is positive semidefinite. From Eq. (A.10), one can see that:

Ξ› = 1 𝑐2

𝑁

Γ•

𝑖=1

"

βˆ’π‘”00(𝑧𝑖) 𝑔(𝑧𝑖) +

𝑔0(𝑧𝑖) 𝑔(𝑧𝑖)

2# 𝒙𝑖𝒙𝑇𝑖.

Clearly,𝒙𝑖𝒙𝑇𝑖 is positive semidefinite, and thus, the following statement is a sufficient condition for convexity of𝑆(𝒓):

βˆ’π‘”00(𝑧) 𝑔(𝑧) +

𝑔0(𝑧)

𝑔(𝑧) β‰₯ 0 for all𝑧 ∈R.

In particular, this condition is satisfied for the Gaussian link function,𝑔Gaussian(Β·)= Ξ¦(Β·), where Ξ¦ is the standard Gaussian CDF, as well as for the sigmoidal link function, 𝑔log(π‘₯) := 𝜎(π‘₯) = 1+exp(βˆ’1 π‘₯). In this work, the experiments utilize the sigmoidal link function.

Bayesian Logistic Regression

Notably, the Bayesian logistic regression inference model discussed in Section 4.3 is a special case of the Gaussian process preference model, in which𝑐 =1,𝑔 is the sigmoidal link function, and the prior covariance matrix is diagonal, i.e.Ξ£ = πœ† 𝐼; for instance, the latter condition occurs with the squared exponential kernel defined in Eq. (A.4) when its lengthscale 𝑙 is set to zero. In this thesis, a number of the experiments with the Gaussian process preference model fall under the special case of Bayesian logistic regression, and therefore, this model is briefly reviewed here.

In Bayesian logistic regression, the Gaussian prior over possible reward vectors 𝒓 ∈R𝑑 is:𝒓 ∼ N (0, πœ† 𝐼), whereπœ† > 0. Setting the𝑖th preference label𝑦𝑖 equal to 12 if𝒙𝑖2 𝒙𝑖1, while𝑦𝑖 =βˆ’1

2 if𝒙𝑖1 𝒙𝑖2, the logistic regression likelihood is:

𝑝(D |𝒓) =

𝑁

Γ–

𝑖=1

𝑝(𝑦𝑖|𝒓,𝒙𝑖) =

𝑁

Γ–

𝑖=1

1

1+exp(βˆ’2𝑦𝑖𝒙𝑇𝑖 𝒓) .

The experiments approximate the posterior, 𝑝(𝒓| D) ∝ 𝑝(D |𝒓)𝑝(𝒓), as Gaussian via the Laplace approximation:

𝑝(𝒓| D) β‰ˆ N (𝒓ˆMAP, 𝛼ΣMAP), where:

Λ†

𝒓MAP =argmin

𝒓

𝑓(𝒓), 𝑓(𝒓):=βˆ’log𝑝(D,𝒓) =βˆ’log𝑝(𝒓) βˆ’log𝑝(D |𝒓), (A.11) Ξ£MAP =

βˆ‡2𝒓𝑓(𝒓) 𝒓ˆ

βˆ’1

,

where the optimization in Eq. (A.11) is convex, and 𝛼 > 0 is a tunable hyperpa- rameter that influences the balance between exploration and exploitation. Note that multiplying the covariance by a well-tuned𝛼is more practical than using the 𝛽𝑖(𝛿) parameters considered in the asymptotic consistency analysis (Section 4.4), as the latter results in overly-conservative covariance matrices in practice.

A p p e n d i x B

PROOFS OF ASYMPTOTIC CONSISTENCY FOR DUELING POSTERIOR SAMPLING

This appendix proves the asymptotic consistency results stated in Section 4.4. The details are organized into three sections, which prove:

1. In the preference-based RL setting, samples from the model posterior over transition dynamics parameters converge in distribution to the true transition probabilities.

2. In both the preference-based generalized linear bandit and RL settings, sam- ples from the utility posterior converge in distribution to the true utilities.

3. DPS’s selected policies converge in distribution to the optimal policy in the preference-based RL setting. DPS’s selected actions converge to the optimal action in the generalized linear bandit setting with a finite action spaceA.

Please refer to Section 4.2 to review relevant notation, e.g. for the posterior samples drawn in each iteration. In addition, the following notation is used for the value function and for policies given by value iteration:

Definition 5 (Value function given transition dynamics, rewards, and a policy).

Define𝑉(𝒑,𝒓, πœ‹) as the value function over a length-β„Žepisodeβ€”i.e., the expected total reward in the episodeβ€”under transition dynamics 𝒑 ∈R𝑆

2𝐴

, rewards𝒓 ∈R𝑆 𝐴, and policyπœ‹:

𝑉(𝒑,𝒓, πœ‹) =Γ•

π‘ βˆˆS

𝑝0(𝑠)E

" β„Ž

Γ•

𝑑=1

π‘Ÿ(𝑠𝑑, πœ‹(𝑠𝑑, 𝑑))

𝑠1=𝑠, 𝒑 = 𝒑,𝒓 = 𝒓

# .

Definition 6(Optimal deterministic policy given transition dynamics and rewards).

Define πœ‹π‘£π‘–(𝒑,𝒓) := argmaxπœ‹π‘‰(𝒑,𝒓, πœ‹) as the optimal deterministic policy given transition dynamics 𝒑 ∈ R𝑆

2𝐴

and rewards 𝒓 ∈ R𝑆 𝐴 (breaking ties randomly if multiple deterministic policies achieve the maximum). Note that πœ‹π‘£π‘–(𝒑,𝒓) can be found via finite-horizon value iteration: defining π‘‰πœ‹,𝑑(𝑠) as in Eq. (3.6), set π‘‰πœ‹, β„Ž+1(𝑠) := 0 for each 𝑠 ∈ S and use the Bellman equation to calculateπ‘‰πœ‹,𝑑(𝑠)

successively for𝑑 ∈ {β„Ž, β„Žβˆ’1, . . . ,1}given 𝒑and𝒓:

πœ‹(𝑠, 𝑑) =argmaxπ‘ŽβˆˆA

"

π‘Ÿ(𝑠, π‘Ž) +Γ•

𝑠0∈S

𝑃(𝑠𝑑+1=𝑠0|𝑠𝑑 =𝑠, π‘Žπ‘‘ =π‘Ž)π‘‰πœ‹,𝑑+1(𝑠0)

# ,

π‘‰πœ‹,𝑑(𝑠) = Γ•

π‘ŽβˆˆA

I[πœ‹(𝑠,𝑑)=π‘Ž]

"

π‘Ÿ(𝑠, π‘Ž) +Γ•

𝑠0∈S

𝑃(𝑠𝑑+1=𝑠0|𝑠𝑑 =𝑠, π‘Žπ‘‘ =π‘Ž)π‘‰πœ‹,𝑑+1(𝑠0)

# .

As value iteration results in only deterministic policies, of which there are finitely- many (more precisely, there are𝐴𝑆 β„Ž), the maximum argumentπœ‹π‘£π‘–(𝒑,𝒓):=argmaxπœ‹ 𝑉(𝒑,𝒓, πœ‹)is taken over a finite policy class.

Recall that 𝑀𝑛 := πœ† 𝐼+Γπ‘›βˆ’1

𝑖=1 𝒙𝑖𝒙𝑇𝑖 (see Eq. (4.2)). For the linear link function, the posterior sampling distribution’s covariance is given byΞ£(𝑖) = 𝛽𝑛(𝛿)2π‘€βˆ’1

𝑛 . For the logistic link function, the posterior covariance is given byΞ£(𝑖) = 𝛽𝑛(𝛿)2𝑀0

𝑛, where 𝑀0

𝑛=πœ† 𝐼+Γπ‘›βˆ’1

𝑖=1 π‘”Λœ(2𝑦𝑖𝒙𝑇𝑖 𝒓ˆ𝑛)𝒙𝑖𝒙𝑇𝑖, Λœπ‘”(π‘₯) :=

𝑔0

log(π‘₯) 𝑔log(π‘₯)

2

βˆ’π‘”

00 log(π‘₯)

𝑔log(π‘₯) comes from the Laplace approximation, and𝑔logis the sigmoid function.

As in Section 4.4, it is notationally convenient to define a matrix Λœπ‘€π‘› ∈ R𝑑×𝑑 such that:





ο£²



ο£³

˜

𝑀𝑛 =𝑀𝑛 =πœ† 𝐼+Γπ‘›βˆ’1

𝑖=1 𝒙𝑖𝒙𝑇𝑖 for the linear link function, and

˜

𝑀𝑛 =𝑀0

𝑛 =πœ† 𝐼+Γπ‘›βˆ’1

𝑖=1 π‘”Λœ(2𝑦𝑖𝒙𝑇𝑖 𝒓ˆ𝑛)𝒙𝑖𝒙𝑖𝑇 for the logistic link function.

(B.1)

Then, under either link function, the posterior sampling distribution has covariance Ξ£(𝑛) = 𝛽𝑛(𝛿)2π‘€Λœβˆ’1

𝑛 .

Finally, notation is defined for the eigenvectors and eigenvalues of the matrices 𝑀𝑖 and Λœπ‘€π‘–:

Definition 7 (Eigenvalue notation). Let 𝜈(𝑖)

𝑗 refer to the 𝑗th-largest eigenvalue of 𝑀𝑖, and𝒖(𝑗𝑖) denote its corresponding eigenvector. Similarly, letπœ†(

𝑖)

𝑗 refer to the 𝑗th- largest eigenvalue of π‘€Λœπ‘–, and 𝒗(𝑖)𝑗 denote its corresponding eigenvector. Note that π‘€βˆ’1

𝑖 also has eigenvectors𝒖(𝑗𝑖), with corresponding eigenvalues 1

𝜈(

𝑖) 𝑗

. Because𝑀𝑖is positive definite, the eigenvectors{𝒖(𝑖)𝑗 }form an orthonormal basis, and𝜈(𝑖)

𝑗 > 0for all𝑖, 𝑗. The equivalent statements also hold for π‘€Λœπ‘–, which is also positive definite becauseπ‘”Λœ(π‘₯) >0for all possible inputs.

B.1 Facts about Convergence in Distribution

Before proceeding with the asymptotic consistency proofs, two facts about conver- gence in distribution are reviewed; these will be applied later.

Recall that for a random variable𝑋and a sequence of random variables(𝑋𝑛),𝑛 ∈N, 𝑋𝑛

βˆ’β†’π· 𝑋 denotes that𝑋𝑛 converges to 𝑋 in distribution, while 𝑋𝑛

βˆ’β†’π‘ƒ 𝑋 denotes that𝑋𝑛converges to 𝑋 in probability.

Fact 8(Billingsley, 1968). For random variables𝒙,𝒙𝑛,∈R𝑑, whereπ‘›βˆˆN, and any continuous function𝑔:R𝑑 βˆ’β†’ R, if𝒙𝑛

βˆ’β†’π· 𝒙, then𝑔(𝒙𝑛)βˆ’β†’π· 𝑔(𝒙).

Fact 9 (Billingsley, 1968). For random variables 𝒙𝑛 ∈ R𝑑, 𝑛 ∈ N, and constant vector 𝒄 ∈ R𝑑, 𝒙𝑛

βˆ’β†’π· 𝒄 is equivalent to 𝒙𝑛

βˆ’β†’π‘ƒ 𝒄. Convergence in probability means that for anyπœ€ >0, 𝑃(||π’™π‘›βˆ’π’„||2 β‰₯ πœ€) βˆ’β†’0asπ‘›βˆ’β†’ ∞.

B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 172-177)