Chapter V: Improving Sequential Latent Variable Models with Autoregressive
6.2 Background
We consider Markov decision processes (MDPs), where sπ‘ β S and aπ‘ β A are the state and action at time π‘, resulting in reward ππ‘ = π(sπ‘,aπ‘). Environment state transitions are given by sπ‘+1 βΌ πenv(sπ‘+1|sπ‘,aπ‘), and the agent is defined by a parametric distribution, ππ(aπ‘|sπ‘), with parameters π. The discounted sum of rewards is denoted asR (π) =Γ
π‘πΎπ‘ππ‘, whereπΎ β (0,1] is the discount factor, and π= (s1,a1, . . .)is a trajectory. The distribution over trajectories is
π(π) = π(s1)
π
Γ
π‘=1
πenv(sπ‘+1|sπ‘,aπ‘)ππ(aπ‘|sπ‘), (6.1) where the initial state is drawn from the distribution π(s1). The standard RL objective consists of maximizing the expected discounted return,Eπ(π) [R (π)]. For convenience of presentation, we use the undiscounted setting (πΎ =1), though the formulation can be applied with any validπΎ.
KL-Regularized Reinforcement Learning
Various works have formulated RL, planning, and control problems in terms of probabilistic inference (Dayan and Hinton, 1997; Attias, 2003; Verma and Rao, 2006; Toussaint and Storkey, 2006; Todorov, 2008; Botvinick and Toussaint, 2012;
Levine, 2018). These approaches consider the agent-environment interaction as a graphical model, then convert reward maximization into maximum marginal likelihood estimation, learning and inferring a policy that results in maximal reward.
This conversion is accomplished by introducing one or more binary observed variables (Cooper, 1988), denoted asO, with
π(O=1|π) βexp R (π)/πΌ ,
whereπΌis a temperature hyper-parameter. These new variables are often referred to as βoptimalityβ variables (Levine, 2018). We would like to infer latent variables,π, and learn parameters,π, that yield the maximum log-likelihood of optimality, i.e., logπ(O=1). Evaluating this likelihood requires marginalizing the joint distribution, π(O =1) =β«
π(π,O =1)π π. This involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we can use variational inference to lower bound this objective, introducing a structured approximate posterior distribution:
π(π|O) =
π
Γ
π‘=1
πenv(sπ‘+1|sπ‘,aπ‘)π(aπ‘|sπ‘,O). (6.2) This provides the following lower bound on the objective, logπ(O=1):
log
β«
π(O =1|π)π(π)π π β₯
β«
π(π|O)
logπ(O =1|π) +log π(π) π(π|O)
π π (6.3)
=Eπ[R (π)/πΌ] βπ·KL(π(π|O) kπ(π)). (6.4) Equivalently, we can multiply byπΌ, defining the variational RL objective as
J (π, π) β‘Eπ[R (π)] βπΌ π·KL(π(π|O) kπ(π)). (6.5) This objective consists of the expected return (i.e., the standard RL objective) and a KL divergence between π(π|O) and π(π). In terms of states and actions, this objective is written as
J (π, π) =Esπ‘,ππ‘βΌπenv
aπ‘βΌπ
" π Γ
π‘=1
ππ‘βπΌlogπ(aπ‘|sπ‘,O) ππ(aπ‘|sπ‘)
#
. (6.6)
At a given timestep,π‘, one can optimize this objective by estimating the future terms in the summation using a βsoftβ action-value (ππ) network (Haarnoja, H. Tang, et al., 2017) or model (PichΓ© et al., 2019). For instance, samplingsπ‘ βΌ πenv, slightly abusing notation, we can write the objective at timeπ‘as
J (π, π) =Eπ[ππ(sπ‘,aπ‘)] βπΌ π·KL(π(aπ‘|sπ‘,O) ||ππ(aπ‘|sπ‘)). (6.7) Policy optimization in the KL-regularized setting corresponds to maximizing J w.r.t.π. We often consider parametric policies, in whichπ is defined by distribution parameters, Ξ», e.g., Gaussian mean, Β΅, and variance, Ο2. In this case, policy optimization corresponds to maximizing:
Ξ»β arg max
Ξ» J (π, π). (6.8)
Optionally, we can then also learn the policy prior parameters,π(Abdolmaleki et al., 2018).
β1.0 β0.5 0.0 0.5 1.0 tanh(Β΅1)
β1.0
β0.5 0.0 0.5
tanh(Β΅3)
Direct Policy Network Iterative Policy Network Optimal Estimate
330 331 332 333 334
J
(a)
21
Direct
<latexit sha1_base64="NHTlNGKuf9jvAAkt9p+DKT273ZM=">AAAChnicdVHLTgIxFC3jC/EB6NJNIzFxRWZ8BJdEXbjERMAEJqRTLtDYTiftHQOZ8CVu9aP8GzvIQkRv0uTk3NfpuVEihUXf/yx4G5tb2zvF3dLe/sFhuVI96lidGg5trqU2zxGzIEUMbRQo4TkxwFQkoRu93OX57isYK3T8hLMEQsXGsRgJztBRg0q5jzBFo7J7YYDjfFCp+XV/EXQdBEtQI8toDaqFQX+oeaogRi6Ztb3ATzDMmEHBJcxL/dRCwvgLG0PPwZgpsGG2UD6nZ44Z0pE27sVIF+zPjowpa2cqcpWK4cT+zuXkX7leiqObMBNxkiLE/HvRKJUUNc1toMPFd+XMAcaNcFopnzDDODqzViY9BWGWi8vHrKyP1Lx0Rn8yuYwE1XS1jsmxdhsm6h9a8PWGxELqXNVDZ6A7SfD7AOugc1EPLusXj1e15u3yOEVyQk7JOQlIgzTJA2mRNuEkJW/knXx4Ra/uXXuN71KvsOw5JivhNb8A8zLIng==</latexit> Iterative<latexit sha1_base64="su8BrL+95YX6wSbmN0WLJ+hC4sU=">AAACiXicdVHLSgMxFE3Hd+uj6tJNsAiuyowKiivRje4U+hDaoWTS2zY0mQzJHWkZ5lfc6i/5N2ZqF7bVCxcO574O50aJFBZ9/6vkra1vbG5t75Qru3v7B9XDo5bVqeHQ5Fpq8xoxC1LE0ESBEl4TA0xFEtrR+KGot9/AWKHjBk4TCBUbxmIgOENH9apHXYQJGpU9IRjHvUHeq9b8uj8LugqCOaiReTz3Dku9bl/zVEGMXDJrO4GfYJgxg4JLyMvd1ELC+JgNoeNgzBTYMJuJz+mZY/p0oI3LGOmM/T2RMWXtVEWuUzEc2eVaQf5V66Q4uAkzEScpQsx/Dg1SSVHTwgnaFwY4yqkDjBvhtFI+YoZx58TipkYQZoW4Ys3C+Ujl5TP6mylkJKgmi31MDrW7MFL/0IKvDiQWUueq7jsD3UuC5QesgtZFPbisX7xc1e7u58/ZJifklJyTgFyTO/JInkmTcDIh7+SDfHoVL/BuvNufVq80nzkmC+E9fAM0u8oO</latexit>
Q<latexit sha1_base64="MJE81UhXK+zdK2r8FWvIKlHN2Uw=">AAACmXicdVHLSgMxFE3Hd31VXboJLYKClBkVdOljI65atFVohyGTZtpgMgnJHbEM3fs1bvVX/BsztYK1eiFw7rnPnBtrwS34/kfJm5tfWFxaXimvrq1vbFa2tttWZYayFlVCmYeYWCZ4ylrAQbAHbRiRsWD38eNVEb9/YsZyld7BULNQkn7KE04JOCqqVJtRV3O835UEBnGS21EEh/jbI847iCo1v+6PDc+CYAJqaGKNaKsUdXuKZpKlQAWxthP4GsKcGOBUsFG5m1mmCX0kfdZxMCWS2TAff2aE9xzTw4ky7qWAx+zPipxIa4cydpnFlvZ3rCD/inUySM7CnKc6A5bSr0FJJjAoXCiDe9wwCmLoAKGGu10xHRBDKDj9pjrdBWFeLFe0mRofy1F5D/9kijU0yOfpPCL6yk0YyH9oTmcLtGWZU1X1nIDuJMHvA8yC9lE9OK4fNU9q55eT4yyjXVRF+yhAp+gcXaMGaiGKXtArekPv3q534V17N1+pXmlSs4OmzLv9BH4Tz7E=</latexit> β‘(st,at)
β‘ (at|st,O)
<latexit sha1_base64="cf2O3KIgEuC4a9BQLouEV1r8y6Q=">AAACqnicdVHdTtswFHbDNljHtsIud2OtQgIJVQlMgks0brgbmygwNVF04jithR1b9glaFfIge5rdskfY28xpO4lSOJKl73zn7/M5mZHCYRj+7QRrL16+Wt943X2z+fbd+97W9qXTlWV8yLTU9joDx6Uo+RAFSn5tLAeVSX6V3Zy28atbbp3Q5QVODU8UjEtRCAboqbR3GBuRxmYi6G6sACdZUUOTIr2j/13n3f25x0DWX5u9tNcPB+HM6CqIFqBPFnaebnXSONesUrxEJsG5URQaTGqwKJjkTTeuHDfAbmDMRx6WoLhL6tnvGrrjmZwW2vpXIp2xDytqUM5NVeYzW5Hucawln4qNKiyOk1qUpkJesvmgopIUNW1XRXNhOUM59QCYFV4rZROwwNAvdKnTRZTUrbi2zdL4TDXdHfqQaWUYVD+X80COtZ8wUc/Qgq0WGMcrv1Wd+wX6k0SPD7AKLg8G0eHg4Nvn/smXxXE2yEfyieySiByRE3JGzsmQMPKL/Cb35E+wH3wPfgSjeWrQWdR8IEsW5P8AlJfW4A==</latexit>
pβ(at|st)
<latexit sha1_base64="GH6lmsh/3Bla21UIjWoARjx3LaU=">AAACnXicdVHdahNBFJ6s1db401QvvXAwVOpN2K1CvSxKoRdFKjRNIVmWs5OzydCZnWHmrBjWvIFP4219kb5NZ9MITVMPDHznO7/zndwq6SmOr1vRo43HTza3nrafPX/xcruz8+rcm8oJ7AujjLvIwaOSJfZJksIL6xB0rnCQX35t4oMf6Lw05RnNLKYaJqUspAAKVNZ5b7MRTZGA74000DQvaphnxH/xf64P7oes04178cL4OkiWoMuWdprttLLR2IhKY0lCgffDJLaU1uBICoXz9qjyaEFcwgSHAZag0af14kNzvhuYMS+MC68kvmDvVtSgvZ/pPGQ2W/r7sYZ8KDasqPic1rK0FWEpbgcVleJkeKMOH0uHgtQsABBOhl25mIIDQUHDlU5nSVo3yzVtVsbnet7e5XeZZg1L+udqHqiJCROm+j+0FOsF1mMVVDXjIGA4SXL/AOvgfL+XfOztf//UPfyyPM4We8PesT2WsAN2yI7ZKeszwX6zP+yK/Y3eRkfRSfTtNjVqLWtesxWLBjcLFNGl</latexit>
penv(st|st 1,at 1)
<latexit sha1_base64="XVEFz5wPUvNmOe8ykG++NxVNAiY=">AAACu3icdVHbbhMxEHWWWwm3FB55sYgqFQmi3YIEL0gV8MBjkZq2UrJaeZ3ZxK1vsmerRmZ/iK/htXwN3nSFkgZGsnR8ZsZzfKa0UnhM0+tecufuvfsPdh72Hz1+8vTZYPf5iTe14zDmRhp3VjIPUmgYo0AJZ9YBU6WE0/LiS5s/vQTnhdHHuLSQKzbXohKcYaSKwVdbhCnCFToVQF82Dd2fKoaLsgq+KZD+oGvXgG+z5s1fhnXM62IwTEfpKug2yDowJF0cFbu9YjozvFagkUvm/SRLLeaBORRcQtOf1h4s4xdsDpMINVPg87D6bkP3IjOjlXHxaKQrdr0jMOX9UpWxslXqb+da8l+5SY3VxzwIbWsEzW8GVbWkaGjrHZ0JBxzlMgLGnYhaKV8wxzhGhzdeOs7y0Iprn9kYX6qmv0fXmVaGRXW1Wcfk3MQJC/UfWvDtBuuhjq6aWTQwriS7vYBtcHIwyt6NDr6/Hx5+7pazQ16SV2SfZOQDOSTfyBEZE05+kl/kmvxOPiU8OU/kTWnS63pekI1I6j9rsd6c</latexit>
Eβ‘[Qβ‘(st,at)]
<latexit sha1_base64="zPQeLlZ86knaMERRK34nAbJ7OzI=">AAACt3icdVFda9swFFXcbm2zj6bd417EQqGDEex20O0ttAz22ELTFmJjZOU6FpUsTboeC8Z/p79mryvs30xOUmia9YLg3HO/ju7NjBQOw/BvJ9jYfPFya3un++r1m7e7vb39K6cry2HEtdT2JmMOpChhhAIl3BgLTGUSrrPbszZ+/ROsE7q8xJmBRLFpKXLBGXoq7Q1jxbDIsvpbk8ZG0FhCjmN6MXcOF8G8dk2Kn+iDx7z3kcZWTAtM0l4/HIRzo+sgWoI+Wdp5utdJ44nmlYISuWTOjaPQYFIzi4JLaLpx5cAwfsumMPawZApcUs+/2tADz0xorq1/JdI5+7iiZsq5mcp8ZqvWPY215P9i4wrzL0ktSlMhlHwxKK8kRU3bvdGJsMBRzjxg3AqvlfKCWcbRb3el02WU1K24ts3K+Ew13QP6mGllGFS/VvOYnGo/oVDP0IKvFxgHld+qnvgF+pNETw+wDq6OBtHx4Ojic394ujzONnlPPpBDEpETMiTfyTkZEU7uyG/yh9wHX4M0yINikRp0ljXvyIoFP/4BRpjcEQ==</latexit>
β΅Eβ‘
ο£Ώ
logβ‘(at|st,O) pβ(at|st)
<latexit sha1_base64="/MMpvXKoB+n+blZ12rYPUy/58KI=">AAADBXicfVHdatRAFJ7Ev7r+dKt3ejO4FlrQJamCXhZF8M4K3bawCeFkdpIMnckMmRNxGXPtnW/inXjrc/gePoCT3RW6XfXAMN/5zi/ny40UFqPoZxBeuXrt+o2tm4Nbt+/c3R7u3Duxum0YnzAtdXOWg+VS1HyCAiU/Mw0HlUt+mp+/7uOnH3hjha6PcW54qqCsRSEYoKey4ZenCUhTAU0UYJXn7k2XJUbQRPICp/7TJU2KBpjzrI9Ugu4tUwsHXYb0E/3j2t59snQZSPeu2++cyRKsOMJ/q/Y7mjSirDDNhqNoHC2MboJ4BUZkZUfZTpAlM81axWtkEqydxpHB1EGDgkneDZLWcgPsHEo+9bAGxW3qFofr6K5nZrTQjX810gV7scKBsnaucp/Zr2svx3ryb7Fpi8XL1InatMhrthxUtJKipr0KdCYazlDOPQDWCL8rZRX4M6PXaq3TcZy6frm+zdr4XHWDXXqR6dcwqD6u54EstZ9QqX/Qgm0WGMtbf1U98wf0ksSXBdgEJwfj+Nn44P3z0eGrlThb5CF5RPZITF6QQ/KWHJEJYeRX8CAYBY/Dz+HX8Fv4fZkaBqua+2TNwh+/AQJ1+Y4=</latexit>
(b)
Figure 6.1: Amortized Policy Optimization. (a)2D visualization of policy opti- mization. A direct amortized policy network fails to output an optimal estimate, resulting in an amortization gap in performance. An iterative amortized policy network finds an improved estimate. (b)Diagrams of direct and iterative amortized policy optimization. Larger circles denote distributions, and smaller red circles denote terms in the objective, J. Dashed arrows denote amortized optimizers.
Iterative amortization uses gradient-based feedback during optimization, whereas direct amortization does not.
Entropy & KL Regularized Policy Networks Perform Direct Amortization Policy-based approaches to RL typically do not directly optimize the action distri- bution parameters, e.g., through gradient-based optimization. Instead, the action distribution parameters are output by a function approximator (deep network), ππ, which is trained using deterministic (Silver et al., 2014; Lillicrap et al., 2016) or stochastic gradients (Williams, 1992; Heess et al., 2015). When combined with entropy or KL regularization, this policy network is a form ofamortizedoptimization (Gershman and Goodman, 2014), learning to estimate policies. Again, denoting the action distribution parameters, e.g., mean and variance, asΞ», for a given state,s, we can express this direct mapping as
Ξ»β ππ(s), (direct amortization) (6.9) denoting the corresponding policy asππ(a|s,O;Ξ»). Thus, ππ attempts tolearnto optimize Eq. 6.8. This setup is shown in Figure 6.1 (Right). Without entropy or KL regularization, i.e., ππ(a|s) = ππ(a|s), we can instead interpret the network as directly integrating the LHS of Eq. 6.3, which is less efficient and more chal- lenging. Regularization smooths the optimization landscape, yielding more stable improvement and higher asymptotic performance (Ahmed et al., 2019).
Viewing policy networks as a form of direct amortized variational optimizer (Eq. 6.9) allows us to see that they are similar to βencoderβ networks in variational autoencoders
Algorithm 3Direct Amortization Initializeπ
foreach environment stepdo Ξ»β ππ(sπ‘)
aπ‘ βΌ ππ(aπ‘|sπ‘,O;Ξ») sπ‘+1βΌ πenv(sπ‘+1|sπ‘,aπ‘) end for
foreach training stepdo πβ π+πβπJ end for
Algorithm 4Iterative Amortization Initializeπ
foreach environment stepdo InitializeΞ»
foreach policy optimization it- erationdo
Ξ»β ππ(sπ‘,Ξ»,βΞ»J ) end for
aπ‘ βΌππ(aπ‘|sπ‘,O;Ξ») sπ‘+1 βΌ πenv(sπ‘+1|sπ‘,aπ‘) end for
foreach training stepdo πβ π+πβπJ end for
(VAEs) (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014).
However, there are several drawbacks to direct amortization.
Amortization Gap. Direct amortization results in suboptimal approximate pos- terior estimates, with the resulting gap in the variational bound referred to as the amortization gap(Cremer, Li, and Duvenaud, 2018). Thus, in the RL setting, an amortized policy,ππ, results in worse performance than the optimal policy within the parametric policy class, denoted asbπ. The amortization gap is the gap in following inequality:
J (ππ, π) β€ J ( bπ, π).
BecauseJ is a variational bound on the RL objective, i.e., expected return, a looser bound, due to amortization, prevents one from more completely optimizing this objective.
This is shown in Figure 6.1 (Left),1 where J is plotted over two dimensions of the policy mean at a particular state in the MuJoCo environmentHopper-v2. The estimate of a direct amortized policy () is suboptimal, far from the optimal estimate (F). While the relative difference in the objective is relatively small, suboptimal estimates prevent sampling and exploring high-value regions of the action-space.
That is, suboptimal estimates have only aminorimpact on evaluation performance (see Appendix B.4) but hinder effective data collection.
1Additional 2D plots are shown in Appendix Figure B.3.
Single Estimate. Direct amortization is limited to a single, static estimate. In other words, if there are multiple high-value regions of the action-space, a uni- modal (e.g., Gaussian) direct amortized policy is restricted to only one region, thereby limiting exploration. Note that this is an additional restriction beyond simply considering uni-modal distributions, as a generic optimization procedure may arrive at multiple uni-modal estimates depending on initialization and stochastic sampling (see Section 6.3). While multi-modal distributions reduce the severity of this restriction (Y. Tang and Agrawal, 2018; Haarnoja, Hartikainen, et al., 2018), the other limitations of direct amortization still persist.
Inability to Generalize Across Objectives. Direct amortization is a feedforward procedure, receiving gradients from the objective only after estimation. This is contrast to other forms of optimization, which receive gradients (feedback)during estimation. Thus, unlike other optimizers, direct amortization is incapable of generalizing to new objectives, e.g., if ππ(s,a) or ππ(a|s) change, which is a desirable capability for adapting to new tasks or environments.
To improve upon this scheme and overcome these drawbacks, in Section 6.3, we turn toiterative amortization(Chapter 3), retaining the efficiency of amortization while employing a more flexible iterative estimation procedure.
Related Work
Previous works have investigated methods for improving policy optimization. QT-Opt (Kalashnikov et al., 2018) uses the cross-entropy method (CEM) (Rubinstein and Kroese, 2013), an iterative derivative-free optimizer, to optimize aπ-value estimator for robotic grasping. CEM and related methods are also used in model-based RL for performing model-predictive control (Nagabandi et al., 2018; Chua et al., 2018;
Piché et al., 2019; Hafner et al., 2019). Gradient-based policy optimization (Henaff, Whitney, and LeCun, 2017; Srinivas et al., 2018; Bharadhwaj, Xie, and Shkurti, 2020), in contrast, is less common, however, gradient-based optimization can also be combined with CEM (Amos and Yarats, 2020). Most policy-based methods use direct amortization, either using a feedforward (Haarnoja, Zhou, Abbeel, et al., 2018) or recurrent (Guez et al., 2019) network. Similar approaches have also been applied to model-based value estimates (Byravan et al., 2020; Clavera, Y. Fu, and Abbeel, 2020; Amos, Stanton, et al., 2020), as well as combining direct amortization with model predictive control (Lee, Saigol, and Theodorou, 2019) and planning (Rivière et al., 2020). A separate line of work has explored improving the policy distribution,
using normalizing flows (Haarnoja, Hartikainen, et al., 2018; Y. Tang and Agrawal, 2018) and latent variables (Tirumala et al., 2019). In principle, iterative amortization can perform policy optimization in each of these settings.
Iterative amortized policy optimization is conceptually similar to negative feedback control (Astrom and Murray, 2008), using errors to update policy estimates. However, while conventional feedback control methods are often restricted in their applicability, e.g., linear systems and quadratic cost, iterative amortization is generally applicable to any differentiable control objective. This is analogous to the generalization of Kalman filtering (Kalman, 1960) to amortized variational filtering (Chapter 4) for state estimation.
6.3 Iterative Amortized Policy Optimization