Chapter VI: Conclusions and Future Directions
A.2 Gaussian Process Regression
Credit assignment via Gaussian processes (Rasmussen and Williams, 2006) ex- tends the linear credit assignment model in A.1 to larger numbers of features ๐ by generalizing across similar features. For instance, in the RL setting, one could
learn over larger state and action spaces by generalizing across nearby states and actions. This section and the following one consider two Gaussian process-based credit assignment approaches.
To perform credit assignment via Gaussian process regression, one can assign binary labels to each observation based on whether it is preferred or dominated. A Gaussian process prior is placed upon the underlying utilities๐; for instance, in the RL setting, this prior is placed on the utilities of the individual state-action pairs. Using that a bandit action or RL trajectoryโs total utility is a sum over the utilities in each dimension (e.g., over each state-action pair in RL), this section shows how to perform inference over sums of Gaussian process variables to infer the component utilities in๐from the total utilities. As the total utility of each bandit action or RL trajectory is not observed in practice, the obtained binary preference labels are substituted as approximations in their place.
To avoid having to constantly distinguish between the bandit and RL settings, the rest of this section adapts all notation and terminology for preference-based RL. For instance, the dimensions of ๐ are referred to as utilities of state-action pairs, while in the bandit setting, they are utility weights corresponding to each dimension of the action space. Similarly, in the RL setting, the observations๐๐1,๐๐2correspond to trajectory features, while in the bandit setting, these are actions. However, the derived posterior update equations (Eq.s (A.2) and (A.3)) apply as-is to the generalized linear dueling bandit setting, and the two cases are mathematically-identical with respect to the methods introduced in this section.
Let{๐ ห1, . . . ,๐ ห๐}denote the๐ =๐ ๐ดstate-action pairs. In this section, the data matrix ๐ โ R2๐ร๐ holds all state-action visitation vectors ๐๐1,๐๐2, for DPS iterations ๐ โ {1, . . . , ๐}. (This contrasts with the other credit assignment methods, which learn from their differences, ๐๐2 โ ๐๐1.) Let ๐๐๐ be the ๐th row of ๐, such that ๐ = [๐1. . . ,๐2๐]๐, and๐๐ =๐๐ ๐ for some DPS iteration๐ and ๐ โ {1,2}, that is,๐๐
contains the state-action visit counts for the๐th trajectory rollout. In particular, the ๐ ๐thmatrix element๐ง๐ ๐ =[๐]๐ ๐ is the number of times that the๐thobserved trajectory ๐๐visits state-action ห๐ ๐.
The label vector is ๐0 โ R2๐, where the ๐th element ๐ฆ0
๐ is the preference label corresponding to the ๐th-observed trajectory. For instance, if ๐๐2 ๐๐1, then ๐๐2
receives a label of12, while๐๐1is labelledโ1
2. As before,๐(๐ ห)denotes the underlying utility of state-action pair ห๐ , with ๐ข(๐) being trajectory ๐โs total utility along the
state-action pairs it encounters.1To infer ๐, each total utility๐ข(๐๐) is approximated with its preference label๐ฆ0
๐.
A Gaussian process prior is placed upon the rewards ๐: ๐ โผ GP (๐๐, ๐พ๐), where ๐๐ โ R๐ is the prior mean and ๐พ๐ โ R๐ร๐ is the prior covariance matrix, such that[๐พ๐]๐ ๐ models the prior covariance between๐(๐ ห๐) and๐(๐ ห๐). The total utility of trajectory๐๐, denoted๐ข(๐๐), is modeled as a sum over the latent state-action utilities:
๐ข(๐๐) = ร๐
๐=1๐ง๐ ๐๐(๐ ห๐). Let ๐ ๐ be a noisy version of ๐ข(๐๐): ๐ ๐ = ๐ข(๐๐) +๐๐, where ๐๐ โผ N (0, ๐2
๐)is i.i.d. noise. Then, given rewards ๐:
๐ ๐ =
๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐.
Because any linear combination of jointly Gaussian variables is Gaussian, ๐ ๐ is a Gaussian process over the values {๐ง๐1, . . . , ๐ง๐ ๐}. Let ๐น โ R2๐ be the vector with ๐th element equal to ๐ ๐. This section will calculate the relevant expectations and covariances to show that๐ โผ GP (๐๐, ๐พ๐)and๐นhave the following jointly-Gaussian distribution:
"
๐ ๐น
#
โผ N "
๐๐ ๐๐๐
# ,
"
๐พ๐ ๐พ๐๐๐ ๐ ๐พ๐
๐ ๐ ๐พ๐๐๐ +๐2
๐๐ผ
# !
. (A.1)
The standard approach for obtaining a conditional distribution from a joint Gaussian distribution (Rasmussen and Williams, 2006) yields ๐|๐นโผ N (๐,ฮฃ), where:
๐ = ๐๐ +๐พ๐๐๐[๐ ๐พ๐๐๐ +๐2
๐๐ผ]โ1(๐นโ๐๐๐) (A.2) ฮฃ =๐พ๐ โ๐พ๐๐๐[๐ ๐พ๐๐๐ +๐2
๐๐ผ]โ1๐ ๐พ๐
๐ . (A.3)
In practice, the variable ๐น is not observed. Instead, ๐น is approximated with the observed preference labels ๐0, ๐น โ ๐0, to perform credit assignment inference.
Next, this section derives the posterior inference equations (A.2) and (A.3) used in Gaussian process regression credit assignment. The state-action rewards ๐ are inferred given noisy observations๐นof the trajectoriesโ total utilities via the following four steps, corresponding to the next four subsections:
1The concept of a trajectoryโs total utility is analogous to a๐-dimensional actionโs utility in the bandit setting,๐๐๐for an action๐ โ A. A state-action utility๐(๐ ห)is equal to a particular component of ๐: ๐๐๐๐ for some ๐, where ๐๐ is a vector with 1 in the ๐th component and zeros elsewhere. A state-action utility๐(๐ ห)corresponds to the utility weight of an action space dimension in the bandit setting, which is also๐๐๐๐(for some ๐).
A) Model the state-action utilities ๐(๐ ห) as a Gaussian process over state-action pairs ห๐ .
B) Model the trajectory utilities ๐นas a Gaussian process that results from sum- ming the state-action utilities๐(๐ ห).
C) Using the two Gaussian processes defined in A) and B), obtain the covariance matrix between the values of{๐(๐ ห) |๐ ห โ1, . . . , ๐}and{๐ ๐|๐ โ1, . . . ,2๐}.
D) Write the joint Gaussian distribution in Eq. (A.1) between the values of {๐(๐ ห) |๐ ห โ1, . . . , ๐}and{๐ ๐|๐ โ1, . . . ,2๐}, and obtain the posterior distribu- tion of ๐over all state-action pairs given ๐น(Eq.s (A.2) and (A.3)).
The State-Action Utility Gaussian Process
The state-action utilities ๐ are modeled as a Gaussian process over ห๐ , with mean E[๐(๐ ห)] = ๐๐(๐ ห) and covariance kernel Cov(๐(๐ ห๐), ๐(๐ ห๐)) = K๐(๐ ห๐,๐ ห๐) for all state- action pairs ห๐ ๐,๐ ห๐. For instance,K๐ could be the squared exponential kernel:
K๐(๐ ห๐,๐ ห๐) =๐2
๐exp โ1 2
||๐(๐ ห๐) โ ๐(๐ ห๐) ||
๐
2! +๐2
๐๐ฟ๐ ๐, (A.4) where๐2
๐ is the signal variance,๐ is the kernel lengthscale,๐2
๐ is the noise variance, ๐ฟ๐ ๐ is the Kronecker delta function, and ๐ : {1, . . . , ๐} ร {1, . . . , ๐ด} โโR๐ maps each state-action pair to an ๐-dimensional representation that encodes proximity between the state-action pairs. For instance, in the Mountain Car problem, each state-action pair could be represented by a position and velocity (encoding the state) and a one-dimensional action, so that๐=3. Thus,
๐(๐ ห๐) โผ GP (๐๐(๐ ห๐),K๐(๐ ห๐,๐ ห๐)).
Define ๐๐ โR๐ such that the๐th element is [๐๐]๐ = ๐๐(๐ ห๐), the prior mean of state- action ห๐ ๐โs utility. Let๐พ๐ โR๐ร๐be the covariance matrix over state-action utilities, such that [๐พ๐]๐ ๐ = K๐(๐ ห๐,๐ ห๐). Therefore, the reward vector ๐ is also a Gaussian process:
๐ โผ GP (๐๐, ๐พ๐).
The Trajectory Utility Gaussian Process
By assumption, the trajectory utilities ๐น โ R2๐ are sums of the latent state-action utilities via the following relationship between ๐นand๐:
๐ (๐๐):= ๐ ๐ =
๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐,
where ๐๐ are i.i.d. noise variables distributed according to N (0, ๐2
๐). Note that ๐ (๐๐) is a Gaussian process over ๐๐ โ R๐ because {๐(๐ ห๐),โ๐} are jointly normally distributed by definition of a Gaussian process, and any linear combination of jointly Gaussian variables has a univariate normal distribution. Next, the expectation and covariance of๐นis calculated. The expectation of the๐thelement๐ ๐ = ๐ (๐๐) can be expressed as:
E[๐ ๐] =E
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
=
๐
ร
๐=1
๐ง๐ ๐E[๐(๐ ห๐)] =
๐
ร
๐=1
๐ง๐ ๐๐๐(๐ ห๐).
The expectation over๐นcan thus be written asE[๐น(๐)] = ๐๐๐. Next, the covariance matrix of๐นis computed. The๐ ๐thelement of this matrix is the covariance of ๐ (๐๐) and๐ (๐๐):
Cov(๐ (๐๐), ๐ (๐๐))=E[๐ (๐๐)๐ (๐๐)] โE[๐ (๐๐)]E[๐ (๐๐)]
=E
" ๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐
! ๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐
! #
โ
๐
ร
๐=1
๐ง๐ ๐๐๐(๐ ห๐)
! ๐
ร
๐=1
๐ง๐ ๐๐๐(๐ ห๐)
!
=
๐
ร
๐=1 ๐
ร
๐=1
๐ง๐ ๐๐ง๐ ๐E[๐(๐ ห๐)๐(๐ ห๐)] +E[๐๐๐๐] โ
๐
ร
๐=1 ๐
ร
๐=1
๐ง๐ ๐๐ง๐ ๐๐๐(๐ ห๐)๐๐(๐ ห๐)
=
๐
ร
๐=1 ๐
ร
๐=1
๐ง๐ ๐๐ง๐ ๐[Cov(๐(๐ ห๐), ๐(๐ ห๐)) +๐๐(๐ ห๐)๐๐(๐ ห๐)] โ๐ง๐ ๐๐ง๐ ๐๐๐(๐ ห๐)๐๐(๐ ห๐) +๐2
๐I[๐=๐]
=
๐
ร
๐=1 ๐
ร
๐=1
๐ง๐ ๐๐ง๐ ๐Cov(๐(๐ ห๐), ๐(๐ ห๐)) +๐2
๐I[๐=๐]
=
๐
ร
๐=1 ๐
ร
๐=1
๐ง๐ ๐๐ง๐ ๐K๐(๐ ห๐,๐ ห๐) +๐2
๐I[๐=๐] =๐๐๐ ๐พ๐๐๐+๐2
๐I[๐=๐].
One can then write the covariance matrix of ๐นas๐พ๐ , where:
[๐พ๐ ]๐ ๐ :=Cov(๐ (๐๐), ๐ (๐๐)) =๐๐๐ ๐พ๐๐๐ +๐2
๐I[๐=๐].
From here, it can be seen that๐พ๐ = ๐ ๐พ๐๐๐ +๐2
๐๐ผ:
๐ ๐พ๐๐๐ =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐๐1 ๐๐2 .. . ๐๐2๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ๐พ๐
h
๐1 ๐2 . . . ๐2๐
i
=
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐๐1๐พ๐๐1 . . . ๐๐1๐พ๐๐2๐
.. .
.. .
.. . ๐๐2
๐
๐พ๐๐1 . . . ๐๐2
๐
๐พ๐๐2๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
=๐พ๐ โ๐2
๐๐ผ .
Covariance between State-Action and Trajectory Utilities
This subsection considers the covariance between๐ and๐น, denoted๐พ๐ , ๐ : [๐พ๐ , ๐ ]๐ ๐ =Cov( [๐]๐,[๐น]๐) =Cov(๐(๐ ห๐), ๐ (๐๐)).
This covariance matrix can be expressed in terms of๐ , ๐พ๐, and ๐๐: [๐พ๐ , ๐ ]๐ ๐ =Cov(๐(๐ ห๐), ๐ (๐๐))=Cov ๐(๐ ห๐),
๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐
!
=E
"
๐(๐ ห๐)
๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐๐(๐ ห๐)
#
โE[๐(๐ ห๐)]E
" ๐
ร
๐=1
๐ง๐ ๐๐(๐ ห๐) +๐๐
#
=
๐
ร
๐=1
๐ง๐ ๐E[๐(๐ ห๐)๐(๐ ห๐)] โ [๐๐(๐ ห๐)] [๐๐๐๐๐]
=
๐
ร
๐=1
๐ง๐ ๐{Cov(๐(๐ ห๐), ๐(๐ ห๐)) +E[๐(๐ ห๐)]E[๐(๐ ห๐)]} โ๐๐(๐ ห๐)๐๐๐๐๐
=
๐
ร
๐=1
๐ง๐ ๐[K๐(๐ ห๐,๐ ห๐) +๐๐(๐ ห๐)๐๐(๐ ห๐)] โ๐๐(๐ ห๐)๐๐๐๐๐
=
๐
ร
๐=1
๐ง๐ ๐K๐(๐ ห๐,๐ ห๐) +๐๐(๐ ห๐)๐๐๐๐๐ โ๐๐(๐ ห๐)๐๐๐๐๐ =
๐
ร
๐=1
๐ง๐ ๐K๐(๐ ห๐,๐ ห๐) =๐๐๐[๐พ๐]๐๐,:, where [๐พ๐]๐
๐,: is the column vector obtained by transposing the๐th row of ๐พ๐. It is evident that๐พ๐ , ๐ =๐พ๐๐๐.
Posterior Inference over State-Action Utilities
Merging the previous three subsectionsโ results, one obtains the following joint probability density between๐ and๐น:
"
๐ ๐น
#
โผ N "
๐๐
๐๐๐
# ,
"
๐พ๐ ๐พ๐๐๐ ๐ ๐พ๐
๐ ๐ ๐พ๐๐๐ +๐2
๐๐ผ
# ! .
This relationship expresses all components of the joint Gaussian density in terms of ๐ , ๐พ๐, and ๐๐, or in other words, in terms of the observed state-action visitation
counts (i.e., ๐) and the Gaussian process prior on ๐. The standard approach for obtaining a conditional distribution from a joint Gaussian distribution yields๐|๐น โผ N (๐,ฮฃ), where the expressions for๐andฮฃare given by Eq.s (A.2) and (A.3) above.
By substituting ๐0for ๐น, the conditional posterior density of ๐can be expressed in terms of๐,๐0,๐พ๐, and๐๐, that is, in terms of observed data and the Gaussian process prior parameters.