The CoSpar Algorithm for Preference-Based Learning

Chapter V: Mixed-Initiative Learning for Exoskeleton Gait Optimization

5.3 The CoSpar Algorithm for Preference-Based Learning

This work utilizes a pre-computed gait library, in which gaits are specified by parameters that include (among others) step dimensions (step length, width, height), step duration, and pelvis roll and pitch.

Algorithm 13CoSpar

1: Input:A= action set,𝑛= number of actions to select in each iteration,𝑏= buffer size, (Σ^pr, 𝑐)=utility prior parameters,𝛽= coactive feedback weight

2: D =∅ ⊲Initialize preference dataset

3: Initialize prior overA:(𝝁0,Σ₀) =(0,Σ^pr) 4: for𝑖=1,2, . . . , 𝑁do

5: for 𝑗 =1, . . . , 𝑛do

6: Sample utility function 𝑓_𝑗 from(𝝁^𝑖−1,Σ𝑖−1) 7: Select action𝒙𝑗(𝑖) =argmax_𝒙_{∈ A} 𝑓_𝑗(𝒙) 8: end for

9: Execute𝑛actions{𝒙1(𝑖), . . . ,𝒙^𝑛(𝑖)}

10: Observe pairwise preference feedback matrix𝑅 ∈ {0,1,∅}}^𝑛^×(^𝑛⁺^𝑏⁾ 11: for 𝑗 =1, . . . , 𝑛;𝑘 =1, . . . , 𝑛+𝑏do

12: if 𝑅_{𝑗 𝑘} ≠∅then

13: Append preference to datasetD

14: end if

15: end for

16: for 𝑗 =1, . . . , 𝑛do

17: Obtain coactive feedback𝒙⁰_𝑗(𝑖) ∈ A ∪ ∅ ⊲∅=no coactive feedback given 18: if𝒙⁰_𝑗(𝑖) ≠∅then

19: Add toD:𝒙⁰𝑗(𝑖)preferred to𝒙𝑗(𝑖), weight𝛽

20: end if

21: end for

22: Update Bayesian posterior overDto obtain(𝝁𝑖,Σ𝑖) 23: end for

Lester, Stone, and Stelling, 1999) which extends the SelfSparring algorithm to incor- porate coactive feedback. Similarly to SelfSparring, CoSpar maintains a Bayesian preference relation function over the possible actions, which is fitted to observed preference feedback. CoSpar updates this model with user feedback and uses it to select actions for new trials and to elicit feedback. We first define the Bayesian preference model, and then detail the steps of Algorithm 13.

Bayesian Modeling of Utilities from Preference Data

We adopt the preference-based Gaussian process model of Chu and Ghahramani (2005b). Gaussian process modeling is beneficial, as it enables us to model a Bayesian posterior over a class of smooth, non-parametric functions.

Let A ⊂ R^𝑑 be the finite set of available actions, with cardinality 𝐴 = |A |. At any point in time, CoSpar has collected a preference feedback datasetD = {𝒙𝑘1 𝒙𝑘2 | 𝑘 = 1, . . . , 𝑁} consisting of 𝑁 preferences, where 𝒙𝑘1 𝒙𝑘2 indicates that the user prefers action 𝒙𝑘1 ∈ A to action 𝒙𝑘2 ∈ A in the 𝑘^th preference.

Furthermore, we assume that each action𝒙 ∈ A has a latent, underlying utility to

the user, 𝑓(𝒙). For finite action spaces, the utilities can be written in vector form:

𝒇 := [𝑓(𝒙₁), 𝑓(𝒙₂), . . . , 𝑓(𝒙𝐴)]^𝑇. Given preference dataD, we are interested in the posterior probability of 𝒇:

𝑃(𝒇 | D) ∝𝑃(D | 𝒇)𝑃(𝒇). (5.4) We define a Gaussian prior over 𝒇:

𝑃(𝒇) = 1

(2𝜋)^𝐴^/2|Σ^pr|^1/2exp

−1

2 𝒇^𝑇[Σ^pr]⁻¹𝒇

whereΣ^pr ∈R^𝐴×𝐴is the prior covariance matrix, such that [Σ^pr]𝑗 𝑘 =K (𝒙𝑗,𝒙𝑘)for a kernel functionK, for instance the squared exponential kernel given in Eq. (A.4).

For computing the likelihood 𝑃(D | 𝒇), we assume that the user’s preference feedback may be corrupted by noise:

𝑃(𝒙𝑘1 𝒙𝑘2 | 𝒇) =𝑔

𝑓(𝒙𝑘1) − 𝑓(𝒙𝑘2) 𝑐

, (5.5)

where 𝑔(·) ∈ [0,1] is a monotonically-increasing link function, and 𝑐 > 0 is a hyperparameter indicating the degree of noise in the preferences. Note that the likelihood in Eq. (5.5) generalizes the one given in Chu and Ghahramani (2005b), which corresponds specifically to a Gaussian noise model as described in Section 2.2.

The likelihood from Chu and Ghahramani (2005b) is obtained by setting𝑐 =2√ 𝜎 and𝑔 = Φin Eq. (5.5), where Φis the standard Gaussian cumulative distribution function.

Thus, the full expression for the likelihood is:

𝑃(D | 𝒇) =

𝑁

𝑘=1

𝑔

𝑓(𝒙𝑘1) − 𝑓(𝒙𝑘2) 𝑐

. (5.6)

The posterior𝑃(𝒇 | D) can be estimated via the Laplace approximation as a mul- tivariate Gaussian distribution; see Section 2.1 and Chu and Ghahramani (2005b) for background on the Laplace approximation. The next subsection discusses math- ematical details of the Laplace approximation for the specific posterior in Eq. (5.4), and derives a condition on the link function 𝑔 that is necessary and sufficient in order for the Laplace approximation to exist.

Finally, in formulating the posterior, preferences can be weighted relatively to one another if some are thought to be noisier than others. This is accomplished by

changing 𝑐 to 𝑐_𝑘 in Eq. (5.6) to model differing values of the preference noise parameter among the data points, and is analogous to weighted Gaussian process regression (Hong et al., 2017).

The Laplace Approximation

The Laplace approximation yields a Gaussian distributionN (𝒇ˆ,Σ)ˆ centered at the MAP estimate 𝒇ˆ:

𝒇ˆ=argmin

𝒇

−log𝑃(𝒇 | D)

=argmin

𝒇

𝑆(𝒇), where:

𝑆(𝒇) := 1

2𝒇^𝑇Σ^pr𝒇 −

𝑁

𝑘=1

log

𝑔

𝑓(𝒙𝑘1) − 𝑓(𝒙𝑘2) 𝑐

Note that 𝑆(𝒇) simply drops the constant terms from −log𝑃(𝒇 | D) that do not depend on 𝒇. The Laplace approximation’s posterior covariance ˆΣis the inverse of the Hessian matrix of𝑆(𝒇), given by:

Σ =ˆ

∇²_𝒇𝑆(𝒇)−1

=(Σ^pr+Λ)⁻¹, whereΛ:=∇²_𝒇

(

−

𝑁

𝑘=1

log

𝑔

𝑓(𝒙𝑘1) − 𝑓(𝒙𝑘2) 𝑐

) ,

[Λ]𝑗 𝑙 = 1 𝑐²

𝑁

𝑘=1

𝑠_𝑘(𝑗)𝑠_𝑘(𝑙)







−𝑔⁰⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

+©

« 𝑔⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐





 ,

and𝑠_𝑘(𝑗) =













1, 𝑗 =𝑘1

−1, 𝑗 =𝑘2, 0, otherwise

Thus, each term of [Λ]𝑗 𝑙 is a matrix 𝑀 with only four nonzero elements, of the form:













[𝑀]𝑘1, 𝑘1 =[𝑀]𝑘2, 𝑘2=𝑎 [𝑀]𝑘1, 𝑘2 =[𝑀]𝑘2, 𝑘1=−𝑎 [𝑀]𝑗 𝑙 =0, otherwise,

, (5.7)

where𝑎= ¹

𝑐²







−𝑔⁰⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔 _𝑓_(𝒙𝑘

1)−𝑓(𝒙𝑘2) 𝑐

𝑔⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔 _𝑓₍_𝒙𝑘

1)−𝑓(𝒙𝑘2) 𝑐

!2



 .

It can be shown that the matrix in Eq. (5.7) is positive semidefinite if and only if 𝑎 > 0. Therefore, it suffices to show that:⁻

𝑔⁰⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔⁰

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

𝑔

𝑓(𝒙𝑘1)−𝑓(𝒙𝑘2) 𝑐

≥ 0.

Since this condition must hold for all input arguments of𝑔(·), we arrive at the fol- lowing final necessary and sufficient convexity condition for validity of the Laplace approximation:

−𝑔⁰⁰(𝑥) 𝑔(𝑥) +

𝑔⁰(𝑥) 𝑔(𝑥)

≥ 0, ∀𝑥 ∈R. (5.8)

Thus, in order to show that the Laplace approximation is valid for some candidate link function𝑔, one must simply calculate its derivatives and show that they satisfy the convexity condition in Eq. (5.8). Both the Gaussian link function𝑔= Φand the sigmoidal link function𝑔_log =(1+𝑒^−𝑥)⁻¹satisfy Eq. (5.8).

The CoSpar Learning Algorithm

The tuple (Σ^pr, 𝑐) contains the prior parameters of the Bayesian preference model, as defined above. These parameters are, respectively, the covariance matrix of the Gaussian process prior and a hyperparameter quantifying the degree of noise in the user’s preferences. From these parameters, one obtains the prior mean and covariance, (𝝁₀,Σ₀) (Line 3 in Alg. 13). In each iteration 𝑖, CoSpar updates the utility model (Line 22) via the Laplace approximation to the posterior in Eq. (5.4) to obtainN (𝝁𝑖,Σ𝑖).

To select actions in the 𝑖^th iteration (Lines 5-8), the algorithm first draws 𝑛 sam- ples from the posterior, N (𝝁𝑖−1,Σ𝑖−1). Each of these is a utility function 𝑓_𝑗, 𝑗 ∈ {1, . . . , 𝑛}, which assigns a utility value to each action inA. The corresponding selected action is simply the one maximizing 𝑓_𝑗(Line 7):𝒙𝑗(𝑖) =argmax_𝒙∈A 𝑓_𝑗(𝒙) for 𝑗 ∈ {1, . . . , 𝑛}. The 𝑛actions are executed (Line 9), and the user provides pairwise preference feedback between pairs of these actions (the user can also state “no preference”).

We extend SelfSparring (Sui, Zhuang, et al., 2017) to extract more preference comparisons from the available trials by assuming that additionally, the user can remember the𝑏actionsprecedingthe current𝑛actions:

Assumption 7(Recall buffer). The user remembers the𝑏trials preceding the current iteration, and can therefore give preferences (or state “no preference”) between any pair of actions among the𝑛trials in the current iteration and the 𝑏previous trials.

The user thus provides preferences between any combination of actions within the current 𝑛trials and the previous 𝑏 trials. For instance, with𝑛 = 1, 𝑏 > 0, one can interpret𝑏as a buffer of previous trials that the user remembers, such that each new sample is compared against all actions in the buffer. For 𝑛 = 𝑏 = 1, the user can report preferences between any pair of two consecutive trials, i.e., the user is asked,

“Did you like this trial more or less than the previous trial?” For 𝑛 = 1, 𝑏 = 2, the user would additionally be asked, “Did you like this trial more or less than the second-to-last trial?” Compared to𝑏 =0, a positive buffer size tends to extract more information from the available trials.

We expect that setting 𝑛 = 1 while increasing 𝑏 to as many trials as the user can accurately remember would minimize the trials required to reach a preferred gait. In Line 9, the pairwise preferences from iteration𝑖form a matrix 𝑅∈ {0,1,∅}^𝑛^×(^𝑛⁺^𝑏⁾; the values 0 and 1 express preference information, while ∅ denotes the lack of a preference between the actions concerned.

Finally, the user can suggest improvements in the form of coactive feedback (Line 17). For example, the user could request a longer or shorter step length. In Line 17,∅ indicates that no coactive feedback was provided. Otherwise, the user’s suggestion is appended to the dataDas preferred to the most recently executed action. In learning the model posterior, one can assign the coactive preferences a smaller weight relative to pairwise preferences via the input parameter 𝛽 >0.

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 126-131)