• Tidak ada hasil yang ditemukan

The CoSpar Algorithm for Preference-Based Learning

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 126-131)

Chapter V: Mixed-Initiative Learning for Exoskeleton Gait Optimization

5.3 The CoSpar Algorithm for Preference-Based Learning

This work utilizes a pre-computed gait library, in which gaits are specified by parameters that include (among others) step dimensions (step length, width, height), step duration, and pelvis roll and pitch.

Algorithm 13CoSpar

1: Input:A= action set,๐‘›= number of actions to select in each iteration,๐‘= buffer size, (ฮฃpr, ๐‘)=utility prior parameters,๐›ฝ= coactive feedback weight

2: D =โˆ… โŠฒInitialize preference dataset

3: Initialize prior overA:(๐0,ฮฃ0) =(0,ฮฃpr) 4: for๐‘–=1,2, . . . , ๐‘do

5: for ๐‘— =1, . . . , ๐‘›do

6: Sample utility function ๐‘“๐‘— from(๐๐‘–โˆ’1,ฮฃ๐‘–โˆ’1) 7: Select action๐’™๐‘—(๐‘–) =argmax๐’™โˆˆ A ๐‘“๐‘—(๐’™) 8: end for

9: Execute๐‘›actions{๐’™1(๐‘–), . . . ,๐’™๐‘›(๐‘–)}

10: Observe pairwise preference feedback matrix๐‘… โˆˆ {0,1,โˆ…}}๐‘›ร—(๐‘›+๐‘) 11: for ๐‘— =1, . . . , ๐‘›;๐‘˜ =1, . . . , ๐‘›+๐‘do

12: if ๐‘…๐‘— ๐‘˜ โ‰ โˆ…then

13: Append preference to datasetD

14: end if

15: end for

16: for ๐‘— =1, . . . , ๐‘›do

17: Obtain coactive feedback๐’™0๐‘—(๐‘–) โˆˆ A โˆช โˆ… โŠฒโˆ…=no coactive feedback given 18: if๐’™0๐‘—(๐‘–) โ‰ โˆ…then

19: Add toD:๐’™0๐‘—(๐‘–)preferred to๐’™๐‘—(๐‘–), weight๐›ฝ

20: end if

21: end for

22: Update Bayesian posterior overDto obtain(๐๐‘–,ฮฃ๐‘–) 23: end for

Lester, Stone, and Stelling, 1999) which extends the SelfSparring algorithm to incor- porate coactive feedback. Similarly to SelfSparring, CoSpar maintains a Bayesian preference relation function over the possible actions, which is fitted to observed preference feedback. CoSpar updates this model with user feedback and uses it to select actions for new trials and to elicit feedback. We first define the Bayesian preference model, and then detail the steps of Algorithm 13.

Bayesian Modeling of Utilities from Preference Data

We adopt the preference-based Gaussian process model of Chu and Ghahramani (2005b). Gaussian process modeling is beneficial, as it enables us to model a Bayesian posterior over a class of smooth, non-parametric functions.

Let A โŠ‚ R๐‘‘ be the finite set of available actions, with cardinality ๐ด = |A |. At any point in time, CoSpar has collected a preference feedback datasetD = {๐’™๐‘˜1 ๐’™๐‘˜2 | ๐‘˜ = 1, . . . , ๐‘} consisting of ๐‘ preferences, where ๐’™๐‘˜1 ๐’™๐‘˜2 indicates that the user prefers action ๐’™๐‘˜1 โˆˆ A to action ๐’™๐‘˜2 โˆˆ A in the ๐‘˜th preference.

Furthermore, we assume that each action๐’™ โˆˆ A has a latent, underlying utility to

the user, ๐‘“(๐’™). For finite action spaces, the utilities can be written in vector form:

๐’‡ := [๐‘“(๐’™1), ๐‘“(๐’™2), . . . , ๐‘“(๐’™๐ด)]๐‘‡. Given preference dataD, we are interested in the posterior probability of ๐’‡:

๐‘ƒ(๐’‡ | D) โˆ๐‘ƒ(D | ๐’‡)๐‘ƒ(๐’‡). (5.4) We define a Gaussian prior over ๐’‡:

๐‘ƒ(๐’‡) = 1

(2๐œ‹)๐ด/2|ฮฃpr|1/2exp

โˆ’1

2 ๐’‡๐‘‡[ฮฃpr]โˆ’1๐’‡

,

whereฮฃpr โˆˆR๐ดร—๐ดis the prior covariance matrix, such that [ฮฃpr]๐‘— ๐‘˜ =K (๐’™๐‘—,๐’™๐‘˜)for a kernel functionK, for instance the squared exponential kernel given in Eq. (A.4).

For computing the likelihood ๐‘ƒ(D | ๐’‡), we assume that the userโ€™s preference feedback may be corrupted by noise:

๐‘ƒ(๐’™๐‘˜1 ๐’™๐‘˜2 | ๐’‡) =๐‘”

๐‘“(๐’™๐‘˜1) โˆ’ ๐‘“(๐’™๐‘˜2) ๐‘

, (5.5)

where ๐‘”(ยท) โˆˆ [0,1] is a monotonically-increasing link function, and ๐‘ > 0 is a hyperparameter indicating the degree of noise in the preferences. Note that the likelihood in Eq. (5.5) generalizes the one given in Chu and Ghahramani (2005b), which corresponds specifically to a Gaussian noise model as described in Section 2.2.

The likelihood from Chu and Ghahramani (2005b) is obtained by setting๐‘ =2โˆš ๐œŽ and๐‘” = ฮฆin Eq. (5.5), where ฮฆis the standard Gaussian cumulative distribution function.

Thus, the full expression for the likelihood is:

๐‘ƒ(D | ๐’‡) =

๐‘

ร–

๐‘˜=1

๐‘”

๐‘“(๐’™๐‘˜1) โˆ’ ๐‘“(๐’™๐‘˜2) ๐‘

. (5.6)

The posterior๐‘ƒ(๐’‡ | D) can be estimated via the Laplace approximation as a mul- tivariate Gaussian distribution; see Section 2.1 and Chu and Ghahramani (2005b) for background on the Laplace approximation. The next subsection discusses math- ematical details of the Laplace approximation for the specific posterior in Eq. (5.4), and derives a condition on the link function ๐‘” that is necessary and sufficient in order for the Laplace approximation to exist.

Finally, in formulating the posterior, preferences can be weighted relatively to one another if some are thought to be noisier than others. This is accomplished by

changing ๐‘ to ๐‘๐‘˜ in Eq. (5.6) to model differing values of the preference noise parameter among the data points, and is analogous to weighted Gaussian process regression (Hong et al., 2017).

The Laplace Approximation

The Laplace approximation yields a Gaussian distributionN (๐’‡ห†,ฮฃ)ห† centered at the MAP estimate ๐’‡ห†:

๐’‡ห†=argmin

๐’‡

โˆ’log๐‘ƒ(๐’‡ | D)

=argmin

๐’‡

๐‘†(๐’‡), where:

๐‘†(๐’‡) := 1

2๐’‡๐‘‡ฮฃpr๐’‡ โˆ’

๐‘

ร•

๐‘˜=1

log

๐‘”

๐‘“(๐’™๐‘˜1) โˆ’ ๐‘“(๐’™๐‘˜2) ๐‘

.

Note that ๐‘†(๐’‡) simply drops the constant terms from โˆ’log๐‘ƒ(๐’‡ | D) that do not depend on ๐’‡. The Laplace approximationโ€™s posterior covariance ห†ฮฃis the inverse of the Hessian matrix of๐‘†(๐’‡), given by:

ฮฃ =ห†

โˆ‡2๐’‡๐‘†(๐’‡)โˆ’1

=(ฮฃpr+ฮ›)โˆ’1, whereฮ›:=โˆ‡2๐’‡

(

โˆ’

๐‘

ร•

๐‘˜=1

log

๐‘”

๐‘“(๐’™๐‘˜1) โˆ’ ๐‘“(๐’™๐‘˜2) ๐‘

) ,

[ฮ›]๐‘— ๐‘™ = 1 ๐‘2

๐‘

ร•

๐‘˜=1

๐‘ ๐‘˜(๐‘—)๐‘ ๐‘˜(๐‘™)

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

โˆ’๐‘”00

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘”

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

+ยฉ

ยญ

ยญ

ยซ ๐‘”0

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘”

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

ยช

ยฎ

ยฎ

ยฌ

2

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

and๐‘ ๐‘˜(๐‘—) =

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃณ

1, ๐‘— =๐‘˜1

โˆ’1, ๐‘— =๐‘˜2, 0, otherwise

.

Thus, each term of [ฮ›]๐‘— ๐‘™ is a matrix ๐‘€ with only four nonzero elements, of the form:

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃณ

[๐‘€]๐‘˜1, ๐‘˜1 =[๐‘€]๐‘˜2, ๐‘˜2=๐‘Ž [๐‘€]๐‘˜1, ๐‘˜2 =[๐‘€]๐‘˜2, ๐‘˜1=โˆ’๐‘Ž [๐‘€]๐‘— ๐‘™ =0, otherwise,

, (5.7)

where๐‘Ž= 1

๐‘2

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

โˆ’๐‘”00

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘” ๐‘“(๐’™๐‘˜

1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

+

๐‘”0

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘” ๐‘“(๐’™๐‘˜

1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

!2๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป .

It can be shown that the matrix in Eq. (5.7) is positive semidefinite if and only if ๐‘Ž > 0. Therefore, it suffices to show that:โˆ’

๐‘”00

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘”

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

+

๐‘”0

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

๐‘”

๐‘“(๐’™๐‘˜1)โˆ’๐‘“(๐’™๐‘˜2) ๐‘

!2

โ‰ฅ 0.

Since this condition must hold for all input arguments of๐‘”(ยท), we arrive at the fol- lowing final necessary and sufficient convexity condition for validity of the Laplace approximation:

โˆ’๐‘”00(๐‘ฅ) ๐‘”(๐‘ฅ) +

๐‘”0(๐‘ฅ) ๐‘”(๐‘ฅ)

2

โ‰ฅ 0, โˆ€๐‘ฅ โˆˆR. (5.8)

Thus, in order to show that the Laplace approximation is valid for some candidate link function๐‘”, one must simply calculate its derivatives and show that they satisfy the convexity condition in Eq. (5.8). Both the Gaussian link function๐‘”= ฮฆand the sigmoidal link function๐‘”log =(1+๐‘’โˆ’๐‘ฅ)โˆ’1satisfy Eq. (5.8).

The CoSpar Learning Algorithm

The tuple (ฮฃpr, ๐‘) contains the prior parameters of the Bayesian preference model, as defined above. These parameters are, respectively, the covariance matrix of the Gaussian process prior and a hyperparameter quantifying the degree of noise in the userโ€™s preferences. From these parameters, one obtains the prior mean and covariance, (๐0,ฮฃ0) (Line 3 in Alg. 13). In each iteration ๐‘–, CoSpar updates the utility model (Line 22) via the Laplace approximation to the posterior in Eq. (5.4) to obtainN (๐๐‘–,ฮฃ๐‘–).

To select actions in the ๐‘–th iteration (Lines 5-8), the algorithm first draws ๐‘› sam- ples from the posterior, N (๐๐‘–โˆ’1,ฮฃ๐‘–โˆ’1). Each of these is a utility function ๐‘“๐‘—, ๐‘— โˆˆ {1, . . . , ๐‘›}, which assigns a utility value to each action inA. The corresponding selected action is simply the one maximizing ๐‘“๐‘—(Line 7):๐’™๐‘—(๐‘–) =argmax๐’™โˆˆA ๐‘“๐‘—(๐’™) for ๐‘— โˆˆ {1, . . . , ๐‘›}. The ๐‘›actions are executed (Line 9), and the user provides pair- wise preference feedback between pairs of these actions (the user can also state โ€œno preferenceโ€).

We extend SelfSparring (Sui, Zhuang, et al., 2017) to extract more preference comparisons from the available trials by assuming that additionally, the user can remember the๐‘actionsprecedingthe current๐‘›actions:

Assumption 7(Recall buffer). The user remembers the๐‘trials preceding the current iteration, and can therefore give preferences (or state โ€œno preferenceโ€) between any pair of actions among the๐‘›trials in the current iteration and the ๐‘previous trials.

The user thus provides preferences between any combination of actions within the current ๐‘›trials and the previous ๐‘ trials. For instance, with๐‘› = 1, ๐‘ > 0, one can interpret๐‘as a buffer of previous trials that the user remembers, such that each new sample is compared against all actions in the buffer. For ๐‘› = ๐‘ = 1, the user can report preferences between any pair of two consecutive trials, i.e., the user is asked,

โ€œDid you like this trial more or less than the previous trial?โ€ For ๐‘› = 1, ๐‘ = 2, the user would additionally be asked, โ€œDid you like this trial more or less than the second-to-last trial?โ€ Compared to๐‘ =0, a positive buffer size tends to extract more information from the available trials.

We expect that setting ๐‘› = 1 while increasing ๐‘ to as many trials as the user can accurately remember would minimize the trials required to reach a preferred gait. In Line 9, the pairwise preferences from iteration๐‘–form a matrix ๐‘…โˆˆ {0,1,โˆ…}๐‘›ร—(๐‘›+๐‘); the values 0 and 1 express preference information, while โˆ… denotes the lack of a preference between the actions concerned.

Finally, the user can suggest improvements in the form of coactive feedback (Line 17). For example, the user could request a longer or shorter step length. In Line 17,โˆ… indicates that no coactive feedback was provided. Otherwise, the userโ€™s suggestion is appended to the dataDas preferred to the most recently executed action. In learning the model posterior, one can assign the coactive preferences a smaller weight relative to pairwise preferences via the input parameter ๐›ฝ >0.

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 126-131)