Chapter V: Mixed-Initiative Learning for Exoskeleton Gait Optimization
5.3 The CoSpar Algorithm for Preference-Based Learning
This work utilizes a pre-computed gait library, in which gaits are specified by parameters that include (among others) step dimensions (step length, width, height), step duration, and pelvis roll and pitch.
Algorithm 13CoSpar
1: Input:A= action set,๐= number of actions to select in each iteration,๐= buffer size, (ฮฃpr, ๐)=utility prior parameters,๐ฝ= coactive feedback weight
2: D =โ โฒInitialize preference dataset
3: Initialize prior overA:(๐0,ฮฃ0) =(0,ฮฃpr) 4: for๐=1,2, . . . , ๐do
5: for ๐ =1, . . . , ๐do
6: Sample utility function ๐๐ from(๐๐โ1,ฮฃ๐โ1) 7: Select action๐๐(๐) =argmax๐โ A ๐๐(๐) 8: end for
9: Execute๐actions{๐1(๐), . . . ,๐๐(๐)}
10: Observe pairwise preference feedback matrix๐ โ {0,1,โ }}๐ร(๐+๐) 11: for ๐ =1, . . . , ๐;๐ =1, . . . , ๐+๐do
12: if ๐ ๐ ๐ โ โ then
13: Append preference to datasetD
14: end if
15: end for
16: for ๐ =1, . . . , ๐do
17: Obtain coactive feedback๐0๐(๐) โ A โช โ โฒโ =no coactive feedback given 18: if๐0๐(๐) โ โ then
19: Add toD:๐0๐(๐)preferred to๐๐(๐), weight๐ฝ
20: end if
21: end for
22: Update Bayesian posterior overDto obtain(๐๐,ฮฃ๐) 23: end for
Lester, Stone, and Stelling, 1999) which extends the SelfSparring algorithm to incor- porate coactive feedback. Similarly to SelfSparring, CoSpar maintains a Bayesian preference relation function over the possible actions, which is fitted to observed preference feedback. CoSpar updates this model with user feedback and uses it to select actions for new trials and to elicit feedback. We first define the Bayesian preference model, and then detail the steps of Algorithm 13.
Bayesian Modeling of Utilities from Preference Data
We adopt the preference-based Gaussian process model of Chu and Ghahramani (2005b). Gaussian process modeling is beneficial, as it enables us to model a Bayesian posterior over a class of smooth, non-parametric functions.
Let A โ R๐ be the finite set of available actions, with cardinality ๐ด = |A |. At any point in time, CoSpar has collected a preference feedback datasetD = {๐๐1 ๐๐2 | ๐ = 1, . . . , ๐} consisting of ๐ preferences, where ๐๐1 ๐๐2 indicates that the user prefers action ๐๐1 โ A to action ๐๐2 โ A in the ๐th preference.
Furthermore, we assume that each action๐ โ A has a latent, underlying utility to
the user, ๐(๐). For finite action spaces, the utilities can be written in vector form:
๐ := [๐(๐1), ๐(๐2), . . . , ๐(๐๐ด)]๐. Given preference dataD, we are interested in the posterior probability of ๐:
๐(๐ | D) โ๐(D | ๐)๐(๐). (5.4) We define a Gaussian prior over ๐:
๐(๐) = 1
(2๐)๐ด/2|ฮฃpr|1/2exp
โ1
2 ๐๐[ฮฃpr]โ1๐
,
whereฮฃpr โR๐ดร๐ดis the prior covariance matrix, such that [ฮฃpr]๐ ๐ =K (๐๐,๐๐)for a kernel functionK, for instance the squared exponential kernel given in Eq. (A.4).
For computing the likelihood ๐(D | ๐), we assume that the userโs preference feedback may be corrupted by noise:
๐(๐๐1 ๐๐2 | ๐) =๐
๐(๐๐1) โ ๐(๐๐2) ๐
, (5.5)
where ๐(ยท) โ [0,1] is a monotonically-increasing link function, and ๐ > 0 is a hyperparameter indicating the degree of noise in the preferences. Note that the likelihood in Eq. (5.5) generalizes the one given in Chu and Ghahramani (2005b), which corresponds specifically to a Gaussian noise model as described in Section 2.2.
The likelihood from Chu and Ghahramani (2005b) is obtained by setting๐ =2โ ๐ and๐ = ฮฆin Eq. (5.5), where ฮฆis the standard Gaussian cumulative distribution function.
Thus, the full expression for the likelihood is:
๐(D | ๐) =
๐
ร
๐=1
๐
๐(๐๐1) โ ๐(๐๐2) ๐
. (5.6)
The posterior๐(๐ | D) can be estimated via the Laplace approximation as a mul- tivariate Gaussian distribution; see Section 2.1 and Chu and Ghahramani (2005b) for background on the Laplace approximation. The next subsection discusses math- ematical details of the Laplace approximation for the specific posterior in Eq. (5.4), and derives a condition on the link function ๐ that is necessary and sufficient in order for the Laplace approximation to exist.
Finally, in formulating the posterior, preferences can be weighted relatively to one another if some are thought to be noisier than others. This is accomplished by
changing ๐ to ๐๐ in Eq. (5.6) to model differing values of the preference noise parameter among the data points, and is analogous to weighted Gaussian process regression (Hong et al., 2017).
The Laplace Approximation
The Laplace approximation yields a Gaussian distributionN (๐ห,ฮฃ)ห centered at the MAP estimate ๐ห:
๐ห=argmin
๐
โlog๐(๐ | D)
=argmin
๐
๐(๐), where:
๐(๐) := 1
2๐๐ฮฃpr๐ โ
๐
ร
๐=1
log
๐
๐(๐๐1) โ ๐(๐๐2) ๐
.
Note that ๐(๐) simply drops the constant terms from โlog๐(๐ | D) that do not depend on ๐. The Laplace approximationโs posterior covariance หฮฃis the inverse of the Hessian matrix of๐(๐), given by:
ฮฃ =ห
โ2๐๐(๐)โ1
=(ฮฃpr+ฮ)โ1, whereฮ:=โ2๐
(
โ
๐
ร
๐=1
log
๐
๐(๐๐1) โ ๐(๐๐2) ๐
) ,
[ฮ]๐ ๐ = 1 ๐2
๐
ร
๐=1
๐ ๐(๐)๐ ๐(๐)
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
โ๐00
๐(๐๐1)โ๐(๐๐2) ๐
๐
๐(๐๐1)โ๐(๐๐2) ๐
+ยฉ
ยญ
ยญ
ยซ ๐0
๐(๐๐1)โ๐(๐๐2) ๐
๐
๐(๐๐1)โ๐(๐๐2) ๐
ยช
ยฎ
ยฎ
ยฌ
2
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
and๐ ๐(๐) =
๏ฃฑ๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃฒ
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด
๏ฃณ
1, ๐ =๐1
โ1, ๐ =๐2, 0, otherwise
.
Thus, each term of [ฮ]๐ ๐ is a matrix ๐ with only four nonzero elements, of the form:
๏ฃฑ๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃฒ
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด
๏ฃณ
[๐]๐1, ๐1 =[๐]๐2, ๐2=๐ [๐]๐1, ๐2 =[๐]๐2, ๐1=โ๐ [๐]๐ ๐ =0, otherwise,
, (5.7)
where๐= 1
๐2
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
โ๐00
๐(๐๐1)โ๐(๐๐2) ๐
๐ ๐(๐๐
1)โ๐(๐๐2) ๐
+
๐0
๐(๐๐1)โ๐(๐๐2) ๐
๐ ๐(๐๐
1)โ๐(๐๐2) ๐
!2๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป .
It can be shown that the matrix in Eq. (5.7) is positive semidefinite if and only if ๐ > 0. Therefore, it suffices to show that:โ
๐00
๐(๐๐1)โ๐(๐๐2) ๐
๐
๐(๐๐1)โ๐(๐๐2) ๐
+
๐0
๐(๐๐1)โ๐(๐๐2) ๐
๐
๐(๐๐1)โ๐(๐๐2) ๐
!2
โฅ 0.
Since this condition must hold for all input arguments of๐(ยท), we arrive at the fol- lowing final necessary and sufficient convexity condition for validity of the Laplace approximation:
โ๐00(๐ฅ) ๐(๐ฅ) +
๐0(๐ฅ) ๐(๐ฅ)
2
โฅ 0, โ๐ฅ โR. (5.8)
Thus, in order to show that the Laplace approximation is valid for some candidate link function๐, one must simply calculate its derivatives and show that they satisfy the convexity condition in Eq. (5.8). Both the Gaussian link function๐= ฮฆand the sigmoidal link function๐log =(1+๐โ๐ฅ)โ1satisfy Eq. (5.8).
The CoSpar Learning Algorithm
The tuple (ฮฃpr, ๐) contains the prior parameters of the Bayesian preference model, as defined above. These parameters are, respectively, the covariance matrix of the Gaussian process prior and a hyperparameter quantifying the degree of noise in the userโs preferences. From these parameters, one obtains the prior mean and covariance, (๐0,ฮฃ0) (Line 3 in Alg. 13). In each iteration ๐, CoSpar updates the utility model (Line 22) via the Laplace approximation to the posterior in Eq. (5.4) to obtainN (๐๐,ฮฃ๐).
To select actions in the ๐th iteration (Lines 5-8), the algorithm first draws ๐ sam- ples from the posterior, N (๐๐โ1,ฮฃ๐โ1). Each of these is a utility function ๐๐, ๐ โ {1, . . . , ๐}, which assigns a utility value to each action inA. The corresponding selected action is simply the one maximizing ๐๐(Line 7):๐๐(๐) =argmax๐โA ๐๐(๐) for ๐ โ {1, . . . , ๐}. The ๐actions are executed (Line 9), and the user provides pair- wise preference feedback between pairs of these actions (the user can also state โno preferenceโ).
We extend SelfSparring (Sui, Zhuang, et al., 2017) to extract more preference comparisons from the available trials by assuming that additionally, the user can remember the๐actionsprecedingthe current๐actions:
Assumption 7(Recall buffer). The user remembers the๐trials preceding the current iteration, and can therefore give preferences (or state โno preferenceโ) between any pair of actions among the๐trials in the current iteration and the ๐previous trials.
The user thus provides preferences between any combination of actions within the current ๐trials and the previous ๐ trials. For instance, with๐ = 1, ๐ > 0, one can interpret๐as a buffer of previous trials that the user remembers, such that each new sample is compared against all actions in the buffer. For ๐ = ๐ = 1, the user can report preferences between any pair of two consecutive trials, i.e., the user is asked,
โDid you like this trial more or less than the previous trial?โ For ๐ = 1, ๐ = 2, the user would additionally be asked, โDid you like this trial more or less than the second-to-last trial?โ Compared to๐ =0, a positive buffer size tends to extract more information from the available trials.
We expect that setting ๐ = 1 while increasing ๐ to as many trials as the user can accurately remember would minimize the trials required to reach a preferred gait. In Line 9, the pairwise preferences from iteration๐form a matrix ๐ โ {0,1,โ }๐ร(๐+๐); the values 0 and 1 express preference information, while โ denotes the lack of a preference between the actions concerned.
Finally, the user can suggest improvements in the form of coactive feedback (Line 17). For example, the user could request a longer or shorter step length. In Line 17,โ indicates that no coactive feedback was provided. Otherwise, the userโs suggestion is appended to the dataDas preferred to the most recently executed action. In learning the model posterior, one can assign the coactive preferences a smaller weight relative to pairwise preferences via the input parameter ๐ฝ >0.