Gaussian Process Regression - Conclusions and Future Directions

Chapter VI: Conclusions and Future Directions

A.2 Gaussian Process Regression

Credit assignment via Gaussian processes (Rasmussen and Williams, 2006) ex- tends the linear credit assignment model in A.1 to larger numbers of features 𝑑 by generalizing across similar features. For instance, in the RL setting, one could

learn over larger state and action spaces by generalizing across nearby states and actions. This section and the following one consider two Gaussian process-based credit assignment approaches.

To perform credit assignment via Gaussian process regression, one can assign binary labels to each observation based on whether it is preferred or dominated. A Gaussian process prior is placed upon the underlying utilities𝒓; for instance, in the RL setting, this prior is placed on the utilities of the individual state-action pairs. Using that a bandit action or RL trajectory’s total utility is a sum over the utilities in each dimension (e.g., over each state-action pair in RL), this section shows how to perform inference over sums of Gaussian process variables to infer the component utilities in𝒓from the total utilities. As the total utility of each bandit action or RL trajectory is not observed in practice, the obtained binary preference labels are substituted as approximations in their place.

To avoid having to constantly distinguish between the bandit and RL settings, the rest of this section adapts all notation and terminology for preference-based RL. For instance, the dimensions of 𝒓 are referred to as utilities of state-action pairs, while in the bandit setting, they are utility weights corresponding to each dimension of the action space. Similarly, in the RL setting, the observations𝒙𝑖1,𝒙𝑖2correspond to trajectory features, while in the bandit setting, these are actions. However, the derived posterior update equations (Eq.s (A.2) and (A.3)) apply as-is to the generalized linear dueling bandit setting, and the two cases are mathematically-identical with respect to the methods introduced in this section.

Let{𝑠˜₁, . . . ,𝑠˜_𝑑}denote the𝑑 =𝑆 𝐴state-action pairs. In this section, the data matrix 𝑍 ∈ R²^𝑁×𝑑 holds all state-action visitation vectors 𝒙𝑘1,𝒙𝑘2, for DPS iterations 𝑘 ∈ {1, . . . , 𝑁}. (This contrasts with the other credit assignment methods, which learn from their differences, 𝒙𝑘2 − 𝒙𝑘1.) Let 𝒛^𝑇_𝑖 be the 𝑖^th row of 𝑍, such that 𝑍 = [𝒛₁. . . ,𝒛₂𝑁]^𝑇, and𝒛𝑖 =𝒙𝑘 𝑗 for some DPS iteration𝑘 and 𝑗 ∈ {1,2}, that is,𝒛𝑖

contains the state-action visit counts for the𝑖^th trajectory rollout. In particular, the 𝑖 𝑗^thmatrix element𝑧_{𝑖 𝑗} =[𝑍]𝑖 𝑗 is the number of times that the𝑖^thobserved trajectory 𝒛𝑖visits state-action ˜𝑠_𝑗.

The label vector is 𝒚⁰ ∈ R²^𝑁, where the 𝑖^th element 𝑦⁰

𝑖 is the preference label corresponding to the 𝑖^th-observed trajectory. For instance, if 𝒙𝑖2 𝒙𝑖1, then 𝒙𝑖2

receives a label of¹₂, while𝒙𝑖1is labelled−¹

2. As before,𝑟(𝑠˜)denotes the underlying utility of state-action pair ˜𝑠, with 𝑢(𝜏) being trajectory 𝜏’s total utility along the

state-action pairs it encounters.1To infer 𝒓, each total utility𝑢(𝜏_𝑖) is approximated with its preference label𝑦⁰

𝑖.

A Gaussian process prior is placed upon the rewards 𝒓: 𝒓 ∼ GP (𝝁_𝒓, 𝐾_𝑟), where 𝝁_𝒓 ∈ R^𝑑 is the prior mean and 𝐾_𝑟 ∈ R^𝑑^×^𝑑 is the prior covariance matrix, such that[𝐾_𝑟]𝑖 𝑗 models the prior covariance between𝑟(𝑠˜_𝑖) and𝑟(𝑠˜_𝑗). The total utility of trajectory𝜏_𝑖, denoted𝑢(𝜏_𝑖), is modeled as a sum over the latent state-action utilities:

𝑢(𝜏_𝑖) = Í^𝑑

𝑗=1𝑧_{𝑖 𝑗}𝑟(𝑠˜_𝑗). Let 𝑅_𝑖 be a noisy version of 𝑢(𝜏_𝑖): 𝑅_𝑖 = 𝑢(𝜏_𝑖) +𝜀_𝑖, where 𝜀_𝑖 ∼ N (0, 𝜎²

𝜀)is i.i.d. noise. Then, given rewards 𝒓:

𝑅_𝑖 =

𝑑

𝑗=1

𝑧_{𝑖 𝑗}𝑟(𝑠˜_𝑗) +𝜀_𝑖.

Because any linear combination of jointly Gaussian variables is Gaussian, 𝑅_𝑖 is a Gaussian process over the values {𝑧_𝑖₁, . . . , 𝑧_{𝑖 𝑑}}. Let 𝑹 ∈ R²^𝑁 be the vector with 𝑖^th element equal to 𝑅_𝑖. This section will calculate the relevant expectations and covariances to show that𝒓 ∼ GP (𝝁_𝒓, 𝐾_𝑟)and𝑹have the following jointly-Gaussian distribution:

𝒓 𝑹

∼ N "

𝝁_𝒓 𝑋𝝁_𝒓

# ,

𝐾_𝑟 𝐾_𝑟𝑍^𝑇 𝑍 𝐾^𝑇

𝑟 𝑍 𝐾_𝑟𝑍^𝑇 +𝜎²

𝜀𝐼

# !

. (A.1)

The standard approach for obtaining a conditional distribution from a joint Gaussian distribution (Rasmussen and Williams, 2006) yields 𝒓|𝑹∼ N (𝝁,Σ), where:

𝝁 = 𝝁_𝒓 +𝐾_𝑟𝑍^𝑇[𝑍 𝐾_𝑟𝑍^𝑇 +𝜎²

𝜀𝐼]⁻¹(𝑹−𝑍𝝁_𝒓) (A.2) Σ =𝐾_𝑟 −𝐾_𝑟𝑍^𝑇[𝑍 𝐾_𝑟𝑍^𝑇 +𝜎²

𝜀𝐼]⁻¹𝑍 𝐾^𝑇

𝑟 . (A.3)

In practice, the variable 𝑹 is not observed. Instead, 𝑹 is approximated with the observed preference labels 𝒚⁰, 𝑹 ≈ 𝒚⁰, to perform credit assignment inference.

Next, this section derives the posterior inference equations (A.2) and (A.3) used in Gaussian process regression credit assignment. The state-action rewards 𝒓 are inferred given noisy observations𝑹of the trajectories’ total utilities via the following four steps, corresponding to the next four subsections:

1The concept of a trajectory’s total utility is analogous to a𝑑-dimensional action’s utility in the bandit setting,𝒓^𝑇𝒙for an action𝒙 ∈ A. A state-action utility𝑟(𝑠˜)is equal to a particular component of 𝒓: 𝒓^𝑇𝒆𝑗 for some 𝑗, where 𝒆𝑗 is a vector with 1 in the 𝑗^th component and zeros elsewhere. A state-action utility𝑟(𝑠˜)corresponds to the utility weight of an action space dimension in the bandit setting, which is also𝒓^𝑇𝒆𝑗(for some 𝑗).

A) Model the state-action utilities 𝑟(𝑠˜) as a Gaussian process over state-action pairs ˜𝑠.

B) Model the trajectory utilities 𝑹as a Gaussian process that results from sum- ming the state-action utilities𝑟(𝑠˜).

C) Using the two Gaussian processes defined in A) and B), obtain the covariance matrix between the values of{𝑟(𝑠˜) |𝑠˜ ∈1, . . . , 𝑑}and{𝑅_𝑖|𝑖 ∈1, . . . ,2𝑁}.

D) Write the joint Gaussian distribution in Eq. (A.1) between the values of {𝑟(𝑠˜) |𝑠˜ ∈1, . . . , 𝑑}and{𝑅_𝑖|𝑖 ∈1, . . . ,2𝑁}, and obtain the posterior distribution of 𝒓over all state-action pairs given 𝑹(Eq.s (A.2) and (A.3)).

The State-Action Utility Gaussian Process

The state-action utilities 𝒓 are modeled as a Gaussian process over ˜𝑠, with mean E[𝑟(𝑠˜)] = 𝜇_𝑟(𝑠˜) and covariance kernel Cov(𝑟(𝑠˜_𝑖), 𝑟(𝑠˜_𝑗)) = K𝑟(𝑠˜_𝑖,𝑠˜_𝑗) for all state- action pairs ˜𝑠_𝑖,𝑠˜_𝑗. For instance,K𝑟 could be the squared exponential kernel:

K𝑟(𝑠˜_𝑖,𝑠˜_𝑗) =𝜎²

𝑓exp −1 2

||𝑓(𝑠˜_𝑖) − 𝑓(𝑠˜_𝑗) ||

𝑙

2! +𝜎²

𝑛𝛿_{𝑖 𝑗}, (A.4) where𝜎²

𝑓 is the signal variance,𝑙 is the kernel lengthscale,𝜎²

𝑛 is the noise variance, 𝛿_{𝑖 𝑗} is the Kronecker delta function, and 𝑓 : {1, . . . , 𝑆} × {1, . . . , 𝐴} −→R^𝑚 maps each state-action pair to an 𝑚-dimensional representation that encodes proximity between the state-action pairs. For instance, in the Mountain Car problem, each state-action pair could be represented by a position and velocity (encoding the state) and a one-dimensional action, so that𝑚=3. Thus,

𝑟(𝑠˜_𝑖) ∼ GP (𝜇_𝑟(𝑠˜_𝑖),K𝑟(𝑠˜_𝑖,𝑠˜_𝑗)).

Define 𝝁𝑟 ∈R^𝑑 such that the𝑖^th element is [𝝁𝑟]𝑖 = 𝜇_𝑟(𝑠˜_𝑖), the prior mean of state- action ˜𝑠_𝑖’s utility. Let𝐾_𝑟 ∈R^𝑑^×^𝑑be the covariance matrix over state-action utilities, such that [𝐾_𝑟]𝑖 𝑗 = K𝑟(𝑠˜_𝑖,𝑠˜_𝑗). Therefore, the reward vector 𝒓 is also a Gaussian process:

𝒓 ∼ GP (𝝁𝑟, 𝐾_𝑟).

The Trajectory Utility Gaussian Process

By assumption, the trajectory utilities 𝑹 ∈ R²^𝑁 are sums of the latent state-action utilities via the following relationship between 𝑹and𝒓:

𝑅(𝒛𝑖):= 𝑅_𝑖 =

𝑑

𝑗=1

𝑧_{𝑖 𝑗}𝑟(𝑠˜_𝑗) +𝜀_𝑖,

where 𝜀_𝑖 are i.i.d. noise variables distributed according to N (0, 𝜎²

𝜀). Note that 𝑅(𝒛𝑖) is a Gaussian process over 𝒛𝑖 ∈ R^𝑑 because {𝑟(𝑠˜_𝑗),∀𝑗} are jointly normally distributed by definition of a Gaussian process, and any linear combination of jointly Gaussian variables has a univariate normal distribution. Next, the expectation and covariance of𝑹is calculated. The expectation of the𝑖^thelement𝑅_𝑖 = 𝑅(𝒛𝑖) can be expressed as:

E[𝑅_𝑖] =E







𝑑

𝑗=1

𝑧_{𝑖 𝑗}𝑟(𝑠˜_𝑗) +𝜀_𝑖







𝑑

𝑗=1

𝑧_{𝑖 𝑗}E[𝑟(𝑠˜_𝑗)] =

𝑑

𝑗=1

𝑧_{𝑖 𝑗}𝜇_𝑟(𝑠˜_𝑗).

The expectation over𝑹can thus be written asE[𝑹(𝑍)] = 𝑍𝝁𝑟. Next, the covariance matrix of𝑹is computed. The𝑖 𝑗^thelement of this matrix is the covariance of 𝑅(𝒛𝑖) and𝑅(𝒛𝑗):

Cov(𝑅(𝒛𝑖), 𝑅(𝒛𝑗))=E[𝑅(𝒛𝑖)𝑅(𝒛𝑗)] −E[𝑅(𝒛𝑖)]E[𝑅(𝒛𝑗)]

" 𝑑

𝑘=1

𝑧_{𝑖 𝑘}𝑟(𝑠˜_𝑘) +𝜀_𝑖

! 𝑑

𝑚=1

𝑧_{𝑗 𝑚}𝑟(𝑠˜_𝑚) +𝜀_𝑗

! #

−

𝑑

𝑘=1

𝑧_{𝑖 𝑘}𝜇_𝑟(𝑠˜_𝑘)

! 𝑑

𝑚=1

𝑧_{𝑗 𝑚}𝜇_𝑟(𝑠˜_𝑚)

𝑑

𝑘=1 𝑑

𝑚=1

𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}E[𝑟(𝑠˜_𝑘)𝑟(𝑠˜_𝑚)] +E[𝜀_𝑖𝜀_𝑗] −

𝑑

𝑘=1 𝑑

𝑚=1

𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}𝜇_𝑟(𝑠˜_𝑘)𝜇_𝑟(𝑠˜_𝑚)

𝑑

𝑘=1 𝑑

𝑚=1

𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}[Cov(𝑟(𝑠˜_𝑘), 𝑟(𝑠˜_𝑚)) +𝜇_𝑟(𝑠˜_𝑘)𝜇_𝑟(𝑠˜_𝑚)] −𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}𝜇_𝑟(𝑠˜_𝑘)𝜇_𝑟(𝑠˜_𝑚) +𝜎²

𝜀I[𝑖=𝑗]

𝑑

𝑘=1 𝑑

𝑚=1

𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}Cov(𝑟(𝑠˜_𝑘), 𝑟(𝑠˜_𝑚)) +𝜎²

𝜀I^[𝑖=𝑗]

𝑑

𝑘=1 𝑑

𝑚=1

𝑧_{𝑖 𝑘}𝑧_{𝑗 𝑚}K𝑟(𝑠˜_𝑘,𝑠˜_𝑚) +𝜎²

𝜀I[𝑖=𝑗] =𝒛^𝑇𝑖 𝐾_𝑟𝒛𝑗+𝜎²

𝜀I[𝑖=𝑗].

One can then write the covariance matrix of 𝑹as𝐾_𝑅, where:

[𝐾_𝑅]𝑖 𝑗 :=Cov(𝑅(𝒛𝑖), 𝑅(𝒛𝑗)) =𝒛^𝑇_𝑖 𝐾_𝑟𝒛𝑗 +𝜎²

𝜀I[𝑖=𝑗].

From here, it can be seen that𝐾_𝑅 = 𝑍 𝐾_𝑟𝑍^𝑇 +𝜎²

𝜀𝐼:

𝑍 𝐾_𝑟𝑍^𝑇 =





 𝒛^𝑇₁ 𝒛^𝑇₂ .. . 𝒛^𝑇₂_𝑁





 𝐾_𝑟

𝒛₁ 𝒛₂ . . . 𝒛₂𝑁







𝒛^𝑇₁𝐾_𝑟𝒛₁ . . . 𝒛^𝑇₁𝐾_𝑟𝒛₂𝑁

.. .

.. . 𝒛^𝑇₂

𝑁

𝐾_𝑟𝒛₁ . . . 𝒛^𝑇₂

𝑁

𝐾_𝑟𝒛₂𝑁







=𝐾_𝑅−𝜎²

𝜀𝐼 .

Covariance between State-Action and Trajectory Utilities

This subsection considers the covariance between𝒓 and𝑹, denoted𝐾_{𝑟 , 𝑅}: [𝐾_{𝑟 , 𝑅}]𝑖 𝑗 =Cov( [𝒓]𝑖,[𝑹]𝑗) =Cov(𝑟(𝑠˜_𝑖), 𝑅(𝒛𝑗)).

This covariance matrix can be expressed in terms of𝑍 , 𝐾_𝑟, and 𝝁𝑟: [𝐾_{𝑟 , 𝑅}]𝑖 𝑗 =Cov(𝑟(𝑠˜_𝑖), 𝑅(𝒛𝑗))=Cov 𝑟(𝑠˜_𝑖),

𝑑

𝑘=1

𝑧_{𝑗 𝑘}𝑟(𝑠˜_𝑘) +𝜀_𝑗

𝑟(𝑠˜_𝑖)

𝑑

𝑘=1

𝑧_{𝑗 𝑘}𝑟(𝑠˜_𝑘) +𝜀_𝑗𝑟(𝑠˜_𝑖)

−E[𝑟(𝑠˜_𝑖)]E

" 𝑑

𝑘=1

𝑧_{𝑗 𝑘}𝑟(𝑠˜_𝑘) +𝜀_𝑗

𝑑

𝑘=1

𝑧_{𝑗 𝑘}E[𝑟(𝑠˜_𝑖)𝑟(𝑠˜_𝑘)] − [𝜇_𝑟(𝑠˜_𝑖)] [𝒛^𝑇_𝑗𝝁𝑟]

𝑑

𝑘=1

𝑧_{𝑗 𝑘}{Cov(𝑟(𝑠˜_𝑖), 𝑟(𝑠˜_𝑘)) +E[𝑟(𝑠˜_𝑖)]E[𝑟(𝑠˜_𝑘)]} −𝜇_𝑟(𝑠˜_𝑖)𝒛^𝑇_𝑗𝝁𝑟

𝑑

𝑘=1

𝑧_{𝑗 𝑘}[K𝑟(𝑠˜_𝑖,𝑠˜_𝑘) +𝜇_𝑟(𝑠˜_𝑖)𝜇_𝑟(𝑠˜_𝑘)] −𝜇_𝑟(𝑠˜_𝑖)𝒛^𝑇_𝑗𝝁𝑟

𝑑

𝑘=1

𝑧_{𝑗 𝑘}K𝑟(𝑠˜_𝑖,𝑠˜_𝑘) +𝜇_𝑟(𝑠˜_𝑖)𝒛^𝑇_𝑗𝝁𝑟 −𝜇_𝑟(𝑠˜_𝑖)𝒛^𝑇_𝑗𝝁𝑟 =

𝑑

𝑘=1

𝑧_{𝑗 𝑘}K𝑟(𝑠˜_𝑖,𝑠˜_𝑘) =𝒛^𝑇_𝑗[𝐾_𝑟]^𝑇_𝑖,_:, where [𝐾_𝑟]^𝑇

𝑖,: is the column vector obtained by transposing the𝑖^th row of 𝐾_𝑟. It is evident that𝐾_{𝑟 , 𝑅} =𝐾_𝑟𝑍^𝑇.

Posterior Inference over State-Action Utilities

Merging the previous three subsections’ results, one obtains the following joint probability density between𝒓 and𝑹:

𝒓 𝑹

∼ N "

𝝁𝑟

𝑍𝝁𝑟

# ,

𝐾_𝑟 𝐾_𝑟𝑍^𝑇 𝑍 𝐾^𝑇

𝑟 𝑍 𝐾_𝑟𝑍^𝑇 +𝜎²

𝜀𝐼

# ! .

This relationship expresses all components of the joint Gaussian density in terms of 𝑍 , 𝐾_𝑟, and 𝝁𝑟, or in other words, in terms of the observed state-action visitation

counts (i.e., 𝑍) and the Gaussian process prior on 𝒓. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yields𝒓|𝑹 ∼ N (𝝁,Σ), where the expressions for𝝁andΣare given by Eq.s (A.2) and (A.3) above.

By substituting 𝒚⁰for 𝑹, the conditional posterior density of 𝒓can be expressed in terms of𝑍,𝒚⁰,𝐾_𝑟, and𝝁_𝒓, that is, in terms of observed data and the Gaussian process prior parameters.

Dalam dokumen Applications to Exoskeleton Gait Optimization (Halaman 166-172)