• Tidak ada hasil yang ditemukan

Gaussian Process Regression

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 166-172)

Chapter VI: Conclusions and Future Directions

A.2 Gaussian Process Regression

Credit assignment via Gaussian processes (Rasmussen and Williams, 2006) ex- tends the linear credit assignment model in A.1 to larger numbers of features ๐‘‘ by generalizing across similar features. For instance, in the RL setting, one could

learn over larger state and action spaces by generalizing across nearby states and actions. This section and the following one consider two Gaussian process-based credit assignment approaches.

To perform credit assignment via Gaussian process regression, one can assign binary labels to each observation based on whether it is preferred or dominated. A Gaussian process prior is placed upon the underlying utilities๐’“; for instance, in the RL setting, this prior is placed on the utilities of the individual state-action pairs. Using that a bandit action or RL trajectoryโ€™s total utility is a sum over the utilities in each dimension (e.g., over each state-action pair in RL), this section shows how to perform inference over sums of Gaussian process variables to infer the component utilities in๐’“from the total utilities. As the total utility of each bandit action or RL trajectory is not observed in practice, the obtained binary preference labels are substituted as approximations in their place.

To avoid having to constantly distinguish between the bandit and RL settings, the rest of this section adapts all notation and terminology for preference-based RL. For instance, the dimensions of ๐’“ are referred to as utilities of state-action pairs, while in the bandit setting, they are utility weights corresponding to each dimension of the action space. Similarly, in the RL setting, the observations๐’™๐‘–1,๐’™๐‘–2correspond to trajectory features, while in the bandit setting, these are actions. However, the derived posterior update equations (Eq.s (A.2) and (A.3)) apply as-is to the generalized linear dueling bandit setting, and the two cases are mathematically-identical with respect to the methods introduced in this section.

Let{๐‘ หœ1, . . . ,๐‘ หœ๐‘‘}denote the๐‘‘ =๐‘† ๐ดstate-action pairs. In this section, the data matrix ๐‘ โˆˆ R2๐‘ร—๐‘‘ holds all state-action visitation vectors ๐’™๐‘˜1,๐’™๐‘˜2, for DPS iterations ๐‘˜ โˆˆ {1, . . . , ๐‘}. (This contrasts with the other credit assignment methods, which learn from their differences, ๐’™๐‘˜2 โˆ’ ๐’™๐‘˜1.) Let ๐’›๐‘‡๐‘– be the ๐‘–th row of ๐‘, such that ๐‘ = [๐’›1. . . ,๐’›2๐‘]๐‘‡, and๐’›๐‘– =๐’™๐‘˜ ๐‘— for some DPS iteration๐‘˜ and ๐‘— โˆˆ {1,2}, that is,๐’›๐‘–

contains the state-action visit counts for the๐‘–th trajectory rollout. In particular, the ๐‘– ๐‘—thmatrix element๐‘ง๐‘– ๐‘— =[๐‘]๐‘– ๐‘— is the number of times that the๐‘–thobserved trajectory ๐’›๐‘–visits state-action หœ๐‘ ๐‘—.

The label vector is ๐’š0 โˆˆ R2๐‘, where the ๐‘–th element ๐‘ฆ0

๐‘– is the preference label corresponding to the ๐‘–th-observed trajectory. For instance, if ๐’™๐‘–2 ๐’™๐‘–1, then ๐’™๐‘–2

receives a label of12, while๐’™๐‘–1is labelledโˆ’1

2. As before,๐‘Ÿ(๐‘ หœ)denotes the underlying utility of state-action pair หœ๐‘ , with ๐‘ข(๐œ) being trajectory ๐œโ€™s total utility along the

state-action pairs it encounters.1To infer ๐’“, each total utility๐‘ข(๐œ๐‘–) is approximated with its preference label๐‘ฆ0

๐‘–.

A Gaussian process prior is placed upon the rewards ๐’“: ๐’“ โˆผ GP (๐๐’“, ๐พ๐‘Ÿ), where ๐๐’“ โˆˆ R๐‘‘ is the prior mean and ๐พ๐‘Ÿ โˆˆ R๐‘‘ร—๐‘‘ is the prior covariance matrix, such that[๐พ๐‘Ÿ]๐‘– ๐‘— models the prior covariance between๐‘Ÿ(๐‘ หœ๐‘–) and๐‘Ÿ(๐‘ หœ๐‘—). The total utility of trajectory๐œ๐‘–, denoted๐‘ข(๐œ๐‘–), is modeled as a sum over the latent state-action utilities:

๐‘ข(๐œ๐‘–) = ร๐‘‘

๐‘—=1๐‘ง๐‘– ๐‘—๐‘Ÿ(๐‘ หœ๐‘—). Let ๐‘…๐‘– be a noisy version of ๐‘ข(๐œ๐‘–): ๐‘…๐‘– = ๐‘ข(๐œ๐‘–) +๐œ€๐‘–, where ๐œ€๐‘– โˆผ N (0, ๐œŽ2

๐œ€)is i.i.d. noise. Then, given rewards ๐’“:

๐‘…๐‘– =

๐‘‘

ร•

๐‘—=1

๐‘ง๐‘– ๐‘—๐‘Ÿ(๐‘ หœ๐‘—) +๐œ€๐‘–.

Because any linear combination of jointly Gaussian variables is Gaussian, ๐‘…๐‘– is a Gaussian process over the values {๐‘ง๐‘–1, . . . , ๐‘ง๐‘– ๐‘‘}. Let ๐‘น โˆˆ R2๐‘ be the vector with ๐‘–th element equal to ๐‘…๐‘–. This section will calculate the relevant expectations and covariances to show that๐’“ โˆผ GP (๐๐’“, ๐พ๐‘Ÿ)and๐‘นhave the following jointly-Gaussian distribution:

"

๐’“ ๐‘น

#

โˆผ N "

๐๐’“ ๐‘‹๐๐’“

# ,

"

๐พ๐‘Ÿ ๐พ๐‘Ÿ๐‘๐‘‡ ๐‘ ๐พ๐‘‡

๐‘Ÿ ๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ +๐œŽ2

๐œ€๐ผ

# !

. (A.1)

The standard approach for obtaining a conditional distribution from a joint Gaussian distribution (Rasmussen and Williams, 2006) yields ๐’“|๐‘นโˆผ N (๐,ฮฃ), where:

๐ = ๐๐’“ +๐พ๐‘Ÿ๐‘๐‘‡[๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ +๐œŽ2

๐œ€๐ผ]โˆ’1(๐‘นโˆ’๐‘๐๐’“) (A.2) ฮฃ =๐พ๐‘Ÿ โˆ’๐พ๐‘Ÿ๐‘๐‘‡[๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ +๐œŽ2

๐œ€๐ผ]โˆ’1๐‘ ๐พ๐‘‡

๐‘Ÿ . (A.3)

In practice, the variable ๐‘น is not observed. Instead, ๐‘น is approximated with the observed preference labels ๐’š0, ๐‘น โ‰ˆ ๐’š0, to perform credit assignment inference.

Next, this section derives the posterior inference equations (A.2) and (A.3) used in Gaussian process regression credit assignment. The state-action rewards ๐’“ are inferred given noisy observations๐‘นof the trajectoriesโ€™ total utilities via the following four steps, corresponding to the next four subsections:

1The concept of a trajectoryโ€™s total utility is analogous to a๐‘‘-dimensional actionโ€™s utility in the bandit setting,๐’“๐‘‡๐’™for an action๐’™ โˆˆ A. A state-action utility๐‘Ÿ(๐‘ หœ)is equal to a particular component of ๐’“: ๐’“๐‘‡๐’†๐‘— for some ๐‘—, where ๐’†๐‘— is a vector with 1 in the ๐‘—th component and zeros elsewhere. A state-action utility๐‘Ÿ(๐‘ หœ)corresponds to the utility weight of an action space dimension in the bandit setting, which is also๐’“๐‘‡๐’†๐‘—(for some ๐‘—).

A) Model the state-action utilities ๐‘Ÿ(๐‘ หœ) as a Gaussian process over state-action pairs หœ๐‘ .

B) Model the trajectory utilities ๐‘นas a Gaussian process that results from sum- ming the state-action utilities๐‘Ÿ(๐‘ หœ).

C) Using the two Gaussian processes defined in A) and B), obtain the covariance matrix between the values of{๐‘Ÿ(๐‘ หœ) |๐‘ หœ โˆˆ1, . . . , ๐‘‘}and{๐‘…๐‘–|๐‘– โˆˆ1, . . . ,2๐‘}.

D) Write the joint Gaussian distribution in Eq. (A.1) between the values of {๐‘Ÿ(๐‘ หœ) |๐‘ หœ โˆˆ1, . . . , ๐‘‘}and{๐‘…๐‘–|๐‘– โˆˆ1, . . . ,2๐‘}, and obtain the posterior distribu- tion of ๐’“over all state-action pairs given ๐‘น(Eq.s (A.2) and (A.3)).

The State-Action Utility Gaussian Process

The state-action utilities ๐’“ are modeled as a Gaussian process over หœ๐‘ , with mean E[๐‘Ÿ(๐‘ หœ)] = ๐œ‡๐‘Ÿ(๐‘ หœ) and covariance kernel Cov(๐‘Ÿ(๐‘ หœ๐‘–), ๐‘Ÿ(๐‘ หœ๐‘—)) = K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘—) for all state- action pairs หœ๐‘ ๐‘–,๐‘ หœ๐‘—. For instance,K๐‘Ÿ could be the squared exponential kernel:

K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘—) =๐œŽ2

๐‘“exp โˆ’1 2

||๐‘“(๐‘ หœ๐‘–) โˆ’ ๐‘“(๐‘ หœ๐‘—) ||

๐‘™

2! +๐œŽ2

๐‘›๐›ฟ๐‘– ๐‘—, (A.4) where๐œŽ2

๐‘“ is the signal variance,๐‘™ is the kernel lengthscale,๐œŽ2

๐‘› is the noise variance, ๐›ฟ๐‘– ๐‘— is the Kronecker delta function, and ๐‘“ : {1, . . . , ๐‘†} ร— {1, . . . , ๐ด} โˆ’โ†’R๐‘š maps each state-action pair to an ๐‘š-dimensional representation that encodes proximity between the state-action pairs. For instance, in the Mountain Car problem, each state-action pair could be represented by a position and velocity (encoding the state) and a one-dimensional action, so that๐‘š=3. Thus,

๐‘Ÿ(๐‘ หœ๐‘–) โˆผ GP (๐œ‡๐‘Ÿ(๐‘ หœ๐‘–),K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘—)).

Define ๐๐‘Ÿ โˆˆR๐‘‘ such that the๐‘–th element is [๐๐‘Ÿ]๐‘– = ๐œ‡๐‘Ÿ(๐‘ หœ๐‘–), the prior mean of state- action หœ๐‘ ๐‘–โ€™s utility. Let๐พ๐‘Ÿ โˆˆR๐‘‘ร—๐‘‘be the covariance matrix over state-action utilities, such that [๐พ๐‘Ÿ]๐‘– ๐‘— = K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘—). Therefore, the reward vector ๐’“ is also a Gaussian process:

๐’“ โˆผ GP (๐๐‘Ÿ, ๐พ๐‘Ÿ).

The Trajectory Utility Gaussian Process

By assumption, the trajectory utilities ๐‘น โˆˆ R2๐‘ are sums of the latent state-action utilities via the following relationship between ๐‘นand๐’“:

๐‘…(๐’›๐‘–):= ๐‘…๐‘– =

๐‘‘

ร•

๐‘—=1

๐‘ง๐‘– ๐‘—๐‘Ÿ(๐‘ หœ๐‘—) +๐œ€๐‘–,

where ๐œ€๐‘– are i.i.d. noise variables distributed according to N (0, ๐œŽ2

๐œ€). Note that ๐‘…(๐’›๐‘–) is a Gaussian process over ๐’›๐‘– โˆˆ R๐‘‘ because {๐‘Ÿ(๐‘ หœ๐‘—),โˆ€๐‘—} are jointly normally distributed by definition of a Gaussian process, and any linear combination of jointly Gaussian variables has a univariate normal distribution. Next, the expectation and covariance of๐‘นis calculated. The expectation of the๐‘–thelement๐‘…๐‘– = ๐‘…(๐’›๐‘–) can be expressed as:

E[๐‘…๐‘–] =E

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐‘‘

ร•

๐‘—=1

๐‘ง๐‘– ๐‘—๐‘Ÿ(๐‘ หœ๐‘—) +๐œ€๐‘–

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

=

๐‘‘

ร•

๐‘—=1

๐‘ง๐‘– ๐‘—E[๐‘Ÿ(๐‘ หœ๐‘—)] =

๐‘‘

ร•

๐‘—=1

๐‘ง๐‘– ๐‘—๐œ‡๐‘Ÿ(๐‘ หœ๐‘—).

The expectation over๐‘นcan thus be written asE[๐‘น(๐‘)] = ๐‘๐๐‘Ÿ. Next, the covariance matrix of๐‘นis computed. The๐‘– ๐‘—thelement of this matrix is the covariance of ๐‘…(๐’›๐‘–) and๐‘…(๐’›๐‘—):

Cov(๐‘…(๐’›๐‘–), ๐‘…(๐’›๐‘—))=E[๐‘…(๐’›๐‘–)๐‘…(๐’›๐‘—)] โˆ’E[๐‘…(๐’›๐‘–)]E[๐‘…(๐’›๐‘—)]

=E

" ๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘– ๐‘˜๐‘Ÿ(๐‘ หœ๐‘˜) +๐œ€๐‘–

! ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘— ๐‘š๐‘Ÿ(๐‘ หœ๐‘š) +๐œ€๐‘—

! #

โˆ’

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘– ๐‘˜๐œ‡๐‘Ÿ(๐‘ หœ๐‘˜)

! ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘— ๐‘š๐œ‡๐‘Ÿ(๐‘ หœ๐‘š)

!

=

๐‘‘

ร•

๐‘˜=1 ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘šE[๐‘Ÿ(๐‘ หœ๐‘˜)๐‘Ÿ(๐‘ หœ๐‘š)] +E[๐œ€๐‘–๐œ€๐‘—] โˆ’

๐‘‘

ร•

๐‘˜=1 ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘š๐œ‡๐‘Ÿ(๐‘ หœ๐‘˜)๐œ‡๐‘Ÿ(๐‘ หœ๐‘š)

=

๐‘‘

ร•

๐‘˜=1 ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘š[Cov(๐‘Ÿ(๐‘ หœ๐‘˜), ๐‘Ÿ(๐‘ หœ๐‘š)) +๐œ‡๐‘Ÿ(๐‘ หœ๐‘˜)๐œ‡๐‘Ÿ(๐‘ หœ๐‘š)] โˆ’๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘š๐œ‡๐‘Ÿ(๐‘ หœ๐‘˜)๐œ‡๐‘Ÿ(๐‘ หœ๐‘š) +๐œŽ2

๐œ€I[๐‘–=๐‘—]

=

๐‘‘

ร•

๐‘˜=1 ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘šCov(๐‘Ÿ(๐‘ หœ๐‘˜), ๐‘Ÿ(๐‘ หœ๐‘š)) +๐œŽ2

๐œ€I[๐‘–=๐‘—]

=

๐‘‘

ร•

๐‘˜=1 ๐‘‘

ร•

๐‘š=1

๐‘ง๐‘– ๐‘˜๐‘ง๐‘— ๐‘šK๐‘Ÿ(๐‘ หœ๐‘˜,๐‘ หœ๐‘š) +๐œŽ2

๐œ€I[๐‘–=๐‘—] =๐’›๐‘‡๐‘– ๐พ๐‘Ÿ๐’›๐‘—+๐œŽ2

๐œ€I[๐‘–=๐‘—].

One can then write the covariance matrix of ๐‘นas๐พ๐‘…, where:

[๐พ๐‘…]๐‘– ๐‘— :=Cov(๐‘…(๐’›๐‘–), ๐‘…(๐’›๐‘—)) =๐’›๐‘‡๐‘– ๐พ๐‘Ÿ๐’›๐‘— +๐œŽ2

๐œ€I[๐‘–=๐‘—].

From here, it can be seen that๐พ๐‘… = ๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ +๐œŽ2

๐œ€๐ผ:

๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐’›๐‘‡1 ๐’›๐‘‡2 .. . ๐’›๐‘‡2๐‘

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ๐พ๐‘Ÿ

h

๐’›1 ๐’›2 . . . ๐’›2๐‘

i

=

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐’›๐‘‡1๐พ๐‘Ÿ๐’›1 . . . ๐’›๐‘‡1๐พ๐‘Ÿ๐’›2๐‘

.. .

.. .

.. . ๐’›๐‘‡2

๐‘

๐พ๐‘Ÿ๐’›1 . . . ๐’›๐‘‡2

๐‘

๐พ๐‘Ÿ๐’›2๐‘

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

=๐พ๐‘…โˆ’๐œŽ2

๐œ€๐ผ .

Covariance between State-Action and Trajectory Utilities

This subsection considers the covariance between๐’“ and๐‘น, denoted๐พ๐‘Ÿ , ๐‘…: [๐พ๐‘Ÿ , ๐‘…]๐‘– ๐‘— =Cov( [๐’“]๐‘–,[๐‘น]๐‘—) =Cov(๐‘Ÿ(๐‘ หœ๐‘–), ๐‘…(๐’›๐‘—)).

This covariance matrix can be expressed in terms of๐‘ , ๐พ๐‘Ÿ, and ๐๐‘Ÿ: [๐พ๐‘Ÿ , ๐‘…]๐‘– ๐‘— =Cov(๐‘Ÿ(๐‘ หœ๐‘–), ๐‘…(๐’›๐‘—))=Cov ๐‘Ÿ(๐‘ หœ๐‘–),

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜๐‘Ÿ(๐‘ หœ๐‘˜) +๐œ€๐‘—

!

=E

"

๐‘Ÿ(๐‘ หœ๐‘–)

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜๐‘Ÿ(๐‘ หœ๐‘˜) +๐œ€๐‘—๐‘Ÿ(๐‘ หœ๐‘–)

#

โˆ’E[๐‘Ÿ(๐‘ หœ๐‘–)]E

" ๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜๐‘Ÿ(๐‘ หœ๐‘˜) +๐œ€๐‘—

#

=

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜E[๐‘Ÿ(๐‘ หœ๐‘–)๐‘Ÿ(๐‘ หœ๐‘˜)] โˆ’ [๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)] [๐’›๐‘‡๐‘—๐๐‘Ÿ]

=

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜{Cov(๐‘Ÿ(๐‘ หœ๐‘–), ๐‘Ÿ(๐‘ หœ๐‘˜)) +E[๐‘Ÿ(๐‘ หœ๐‘–)]E[๐‘Ÿ(๐‘ หœ๐‘˜)]} โˆ’๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)๐’›๐‘‡๐‘—๐๐‘Ÿ

=

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜[K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘˜) +๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)๐œ‡๐‘Ÿ(๐‘ หœ๐‘˜)] โˆ’๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)๐’›๐‘‡๐‘—๐๐‘Ÿ

=

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘˜) +๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)๐’›๐‘‡๐‘—๐๐‘Ÿ โˆ’๐œ‡๐‘Ÿ(๐‘ หœ๐‘–)๐’›๐‘‡๐‘—๐๐‘Ÿ =

๐‘‘

ร•

๐‘˜=1

๐‘ง๐‘— ๐‘˜K๐‘Ÿ(๐‘ หœ๐‘–,๐‘ หœ๐‘˜) =๐’›๐‘‡๐‘—[๐พ๐‘Ÿ]๐‘‡๐‘–,:, where [๐พ๐‘Ÿ]๐‘‡

๐‘–,: is the column vector obtained by transposing the๐‘–th row of ๐พ๐‘Ÿ. It is evident that๐พ๐‘Ÿ , ๐‘… =๐พ๐‘Ÿ๐‘๐‘‡.

Posterior Inference over State-Action Utilities

Merging the previous three subsectionsโ€™ results, one obtains the following joint probability density between๐’“ and๐‘น:

"

๐’“ ๐‘น

#

โˆผ N "

๐๐‘Ÿ

๐‘๐๐‘Ÿ

# ,

"

๐พ๐‘Ÿ ๐พ๐‘Ÿ๐‘๐‘‡ ๐‘ ๐พ๐‘‡

๐‘Ÿ ๐‘ ๐พ๐‘Ÿ๐‘๐‘‡ +๐œŽ2

๐œ€๐ผ

# ! .

This relationship expresses all components of the joint Gaussian density in terms of ๐‘ , ๐พ๐‘Ÿ, and ๐๐‘Ÿ, or in other words, in terms of the observed state-action visitation

counts (i.e., ๐‘) and the Gaussian process prior on ๐’“. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yields๐’“|๐‘น โˆผ N (๐,ฮฃ), where the expressions for๐andฮฃare given by Eq.s (A.2) and (A.3) above.

By substituting ๐’š0for ๐‘น, the conditional posterior density of ๐’“can be expressed in terms of๐‘,๐’š0,๐พ๐‘Ÿ, and๐๐’“, that is, in terms of observed data and the Gaussian process prior parameters.

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 166-172)