Background - Improving Sequential Latent Variable Models with Autoregressive

Chapter V: Improving Sequential Latent Variable Models with Autoregressive

6.2 Background

We consider Markov decision processes (MDPs), where s𝑡 ∈ S and a𝑡 ∈ A are the state and action at time 𝑡, resulting in reward 𝑟_𝑡 = 𝑟(s𝑡,a𝑡). Environment state transitions are given by s𝑡+1 ∼ 𝑝_env(s𝑡+1|s𝑡,a𝑡), and the agent is defined by a parametric distribution, 𝑝_𝜃(a𝑡|s𝑡), with parameters 𝜃. The discounted sum of rewards is denoted asR (𝜏) =Í

𝑡𝛾^𝑡𝑟_𝑡, where𝛾 ∈ (0,1] is the discount factor, and 𝜏= (s₁,a₁, . . .)is a trajectory. The distribution over trajectories is

𝑝(𝜏) = 𝜌(s1)

𝑇

𝑡=1

𝑝_env(s𝑡+1|s𝑡,a𝑡)𝑝_𝜃(a𝑡|s𝑡), (6.1) where the initial state is drawn from the distribution 𝜌(s₁). The standard RL objective consists of maximizing the expected discounted return,E^𝑝(𝜏) [R (𝜏)]. For convenience of presentation, we use the undiscounted setting (𝛾 =1), though the formulation can be applied with any valid𝛾.

KL-Regularized Reinforcement Learning

Various works have formulated RL, planning, and control problems in terms of probabilistic inference (Dayan and Hinton, 1997; Attias, 2003; Verma and Rao, 2006; Toussaint and Storkey, 2006; Todorov, 2008; Botvinick and Toussaint, 2012;

Levine, 2018). These approaches consider the agent-environment interaction as a graphical model, then convert reward maximization into maximum marginal likelihood estimation, learning and inferring a policy that results in maximal reward.

This conversion is accomplished by introducing one or more binary observed variables (Cooper, 1988), denoted asO, with

𝑝(O=1|𝜏) ∝exp R (𝜏)/𝛼 ,

where𝛼is a temperature hyper-parameter. These new variables are often referred to as “optimality” variables (Levine, 2018). We would like to infer latent variables,𝜏, and learn parameters,𝜃, that yield the maximum log-likelihood of optimality, i.e., log𝑝(O=1). Evaluating this likelihood requires marginalizing the joint distribution, 𝑝(O =1) =∫

𝑝(𝜏,O =1)𝑑 𝜏. This involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we can use variational inference to lower bound this objective, introducing a structured approximate posterior distribution:

𝜋(𝜏|O) =

𝑇

𝑡=1

𝑝_env(s𝑡+1|s𝑡,a𝑡)𝜋(a𝑡|s𝑡,O). (6.2) This provides the following lower bound on the objective, log𝑝(O=1):

log

∫

𝑝(O =1|𝜏)𝑝(𝜏)𝑑 𝜏 ≥

∫

𝜋(𝜏|O)

log𝑝(O =1|𝜏) +log 𝑝(𝜏) 𝜋(𝜏|O)

𝑑 𝜏 (6.3)

=E^𝜋[R (𝜏)/𝛼] −𝐷_KL(𝜋(𝜏|O) k𝑝(𝜏)). (6.4) Equivalently, we can multiply by𝛼, defining the variational RL objective as

J (𝜋, 𝜃) ≡E^𝜋[R (𝜏)] −𝛼 𝐷_KL(𝜋(𝜏|O) k𝑝(𝜏)). (6.5) This objective consists of the expected return (i.e., the standard RL objective) and a KL divergence between 𝜋(𝜏|O) and 𝑝(𝜏). In terms of states and actions, this objective is written as

J (𝜋, 𝜃) =E^s^𝑡^,𝑟^𝑡^∼𝑝^env

a𝑡∼𝜋

" _𝑇 Õ

𝑡=1

𝑟_𝑡−𝛼log𝜋(a𝑡|s𝑡,O) 𝑝_𝜃(a𝑡|s𝑡)

. (6.6)

At a given timestep,𝑡, one can optimize this objective by estimating the future terms in the summation using a “soft” action-value (𝑄_𝜋) network (Haarnoja, H. Tang, et al., 2017) or model (Piché et al., 2019). For instance, samplings𝑡 ∼ 𝑝_env, slightly abusing notation, we can write the objective at time𝑡as

J (𝜋, 𝜃) =E^𝜋[𝑄_𝜋(s𝑡,a𝑡)] −𝛼 𝐷_KL(𝜋(a𝑡|s𝑡,O) ||𝑝_𝜃(a𝑡|s𝑡)). (6.7) Policy optimization in the KL-regularized setting corresponds to maximizing J w.r.t.𝜋. We often consider parametric policies, in which𝜋 is defined by distribution parameters, λ, e.g., Gaussian mean, µ, and variance, σ². In this case, policy optimization corresponds to maximizing:

λ← arg max

λ J (𝜋, 𝜃). (6.8)

Optionally, we can then also learn the policy prior parameters,𝜃(Abdolmaleki et al., 2018).

−1.0 −0.5 0.0 0.5 1.0 tanh(µ1)

−1.0

−0.5 0.0 0.5

tanh(µ3)

Direct Policy Network Iterative Policy Network Optimal Estimate

330 331 332 333 334

(a)

Direct

<latexit sha1_base64="NHTlNGKuf9jvAAkt9p+DKT273ZM=">AAAChnicdVHLTgIxFC3jC/EB6NJNIzFxRWZ8BJdEXbjERMAEJqRTLtDYTiftHQOZ8CVu9aP8GzvIQkRv0uTk3NfpuVEihUXf/yx4G5tb2zvF3dLe/sFhuVI96lidGg5trqU2zxGzIEUMbRQo4TkxwFQkoRu93OX57isYK3T8hLMEQsXGsRgJztBRg0q5jzBFo7J7YYDjfFCp+XV/EXQdBEtQI8toDaqFQX+oeaogRi6Ztb3ATzDMmEHBJcxL/dRCwvgLG0PPwZgpsGG2UD6nZ44Z0pE27sVIF+zPjowpa2cqcpWK4cT+zuXkX7leiqObMBNxkiLE/HvRKJUUNc1toMPFd+XMAcaNcFopnzDDODqzViY9BWGWi8vHrKyP1Lx0Rn8yuYwE1XS1jsmxdhsm6h9a8PWGxELqXNVDZ6A7SfD7AOugc1EPLusXj1e15u3yOEVyQk7JOQlIgzTJA2mRNuEkJW/knXx4Ra/uXXuN71KvsOw5JivhNb8A8zLIng==</latexit> Iterative<latexit sha1_base64="su8BrL+95YX6wSbmN0WLJ+hC4sU=">AAACiXicdVHLSgMxFE3Hd+uj6tJNsAiuyowKiivRje4U+hDaoWTS2zY0mQzJHWkZ5lfc6i/5N2ZqF7bVCxcO574O50aJFBZ9/6vkra1vbG5t75Qru3v7B9XDo5bVqeHQ5Fpq8xoxC1LE0ESBEl4TA0xFEtrR+KGot9/AWKHjBk4TCBUbxmIgOENH9apHXYQJGpU9IRjHvUHeq9b8uj8LugqCOaiReTz3Dku9bl/zVEGMXDJrO4GfYJgxg4JLyMvd1ELC+JgNoeNgzBTYMJuJz+mZY/p0oI3LGOmM/T2RMWXtVEWuUzEc2eVaQf5V66Q4uAkzEScpQsx/Dg1SSVHTwgnaFwY4yqkDjBvhtFI+YoZx58TipkYQZoW4Ys3C+Ujl5TP6mylkJKgmi31MDrW7MFL/0IKvDiQWUueq7jsD3UuC5QesgtZFPbisX7xc1e7u58/ZJifklJyTgFyTO/JInkmTcDIh7+SDfHoVL/BuvNufVq80nzkmC+E9fAM0u8oO</latexit>

Q<latexit sha1_base64="MJE81UhXK+zdK2r8FWvIKlHN2Uw=">AAACmXicdVHLSgMxFE3Hd31VXboJLYKClBkVdOljI65atFVohyGTZtpgMgnJHbEM3fs1bvVX/BsztYK1eiFw7rnPnBtrwS34/kfJm5tfWFxaXimvrq1vbFa2tttWZYayFlVCmYeYWCZ4ylrAQbAHbRiRsWD38eNVEb9/YsZyld7BULNQkn7KE04JOCqqVJtRV3O835UEBnGS21EEh/jbI847iCo1v+6PDc+CYAJqaGKNaKsUdXuKZpKlQAWxthP4GsKcGOBUsFG5m1mmCX0kfdZxMCWS2TAff2aE9xzTw4ky7qWAx+zPipxIa4cydpnFlvZ3rCD/inUySM7CnKc6A5bSr0FJJjAoXCiDe9wwCmLoAKGGu10xHRBDKDj9pjrdBWFeLFe0mRofy1F5D/9kijU0yOfpPCL6yk0YyH9oTmcLtGWZU1X1nIDuJMHvA8yC9lE9OK4fNU9q55eT4yyjXVRF+yhAp+gcXaMGaiGKXtArekPv3q534V17N1+pXmlSs4OmzLv9BH4Tz7E=</latexit> ⇡(st,at)

⇡ (at|st,O)

<latexit sha1_base64="cf2O3KIgEuC4a9BQLouEV1r8y6Q=">AAACqnicdVHdTtswFHbDNljHtsIud2OtQgIJVQlMgks0brgbmygwNVF04jithR1b9glaFfIge5rdskfY28xpO4lSOJKl73zn7/M5mZHCYRj+7QRrL16+Wt943X2z+fbd+97W9qXTlWV8yLTU9joDx6Uo+RAFSn5tLAeVSX6V3Zy28atbbp3Q5QVODU8UjEtRCAboqbR3GBuRxmYi6G6sACdZUUOTIr2j/13n3f25x0DWX5u9tNcPB+HM6CqIFqBPFnaebnXSONesUrxEJsG5URQaTGqwKJjkTTeuHDfAbmDMRx6WoLhL6tnvGrrjmZwW2vpXIp2xDytqUM5NVeYzW5Hucawln4qNKiyOk1qUpkJesvmgopIUNW1XRXNhOUM59QCYFV4rZROwwNAvdKnTRZTUrbi2zdL4TDXdHfqQaWUYVD+X80COtZ8wUc/Qgq0WGMcrv1Wd+wX6k0SPD7AKLg8G0eHg4Nvn/smXxXE2yEfyieySiByRE3JGzsmQMPKL/Cb35E+wH3wPfgSjeWrQWdR8IEsW5P8AlJfW4A==</latexit>

p✓(at|st)

<latexit sha1_base64="GH6lmsh/3Bla21UIjWoARjx3LaU=">AAACnXicdVHdahNBFJ6s1db401QvvXAwVOpN2K1CvSxKoRdFKjRNIVmWs5OzydCZnWHmrBjWvIFP4219kb5NZ9MITVMPDHznO7/zndwq6SmOr1vRo43HTza3nrafPX/xcruz8+rcm8oJ7AujjLvIwaOSJfZJksIL6xB0rnCQX35t4oMf6Lw05RnNLKYaJqUspAAKVNZ5b7MRTZGA74000DQvaphnxH/xf64P7oes04178cL4OkiWoMuWdprttLLR2IhKY0lCgffDJLaU1uBICoXz9qjyaEFcwgSHAZag0af14kNzvhuYMS+MC68kvmDvVtSgvZ/pPGQ2W/r7sYZ8KDasqPic1rK0FWEpbgcVleJkeKMOH0uHgtQsABBOhl25mIIDQUHDlU5nSVo3yzVtVsbnet7e5XeZZg1L+udqHqiJCROm+j+0FOsF1mMVVDXjIGA4SXL/AOvgfL+XfOztf//UPfyyPM4We8PesT2WsAN2yI7ZKeszwX6zP+yK/Y3eRkfRSfTtNjVqLWtesxWLBjcLFNGl</latexit>

penv(st|st 1,at 1)

<latexit sha1_base64="XVEFz5wPUvNmOe8ykG++NxVNAiY=">AAACu3icdVHbbhMxEHWWWwm3FB55sYgqFQmi3YIEL0gV8MBjkZq2UrJaeZ3ZxK1vsmerRmZ/iK/htXwN3nSFkgZGsnR8ZsZzfKa0UnhM0+tecufuvfsPdh72Hz1+8vTZYPf5iTe14zDmRhp3VjIPUmgYo0AJZ9YBU6WE0/LiS5s/vQTnhdHHuLSQKzbXohKcYaSKwVdbhCnCFToVQF82Dd2fKoaLsgq+KZD+oGvXgG+z5s1fhnXM62IwTEfpKug2yDowJF0cFbu9YjozvFagkUvm/SRLLeaBORRcQtOf1h4s4xdsDpMINVPg87D6bkP3IjOjlXHxaKQrdr0jMOX9UpWxslXqb+da8l+5SY3VxzwIbWsEzW8GVbWkaGjrHZ0JBxzlMgLGnYhaKV8wxzhGhzdeOs7y0Iprn9kYX6qmv0fXmVaGRXW1Wcfk3MQJC/UfWvDtBuuhjq6aWTQwriS7vYBtcHIwyt6NDr6/Hx5+7pazQ16SV2SfZOQDOSTfyBEZE05+kl/kmvxOPiU8OU/kTWnS63pekI1I6j9rsd6c</latexit>

E⇡[Q⇡(st,at)]

<latexit sha1_base64="zPQeLlZ86knaMERRK34nAbJ7OzI=">AAACt3icdVFda9swFFXcbm2zj6bd417EQqGDEex20O0ttAz22ELTFmJjZOU6FpUsTboeC8Z/p79mryvs30xOUmia9YLg3HO/ju7NjBQOw/BvJ9jYfPFya3un++r1m7e7vb39K6cry2HEtdT2JmMOpChhhAIl3BgLTGUSrrPbszZ+/ROsE7q8xJmBRLFpKXLBGXoq7Q1jxbDIsvpbk8ZG0FhCjmN6MXcOF8G8dk2Kn+iDx7z3kcZWTAtM0l4/HIRzo+sgWoI+Wdp5utdJ44nmlYISuWTOjaPQYFIzi4JLaLpx5cAwfsumMPawZApcUs+/2tADz0xorq1/JdI5+7iiZsq5mcp8ZqvWPY215P9i4wrzL0ktSlMhlHwxKK8kRU3bvdGJsMBRzjxg3AqvlfKCWcbRb3el02WU1K24ts3K+Ew13QP6mGllGFS/VvOYnGo/oVDP0IKvFxgHld+qnvgF+pNETw+wDq6OBtHx4Ojic394ujzONnlPPpBDEpETMiTfyTkZEU7uyG/yh9wHX4M0yINikRp0ljXvyIoFP/4BRpjcEQ==</latexit>

↵E⇡



log⇡(at|st,O) p✓(at|st)

<latexit sha1_base64="/MMpvXKoB+n+blZ12rYPUy/58KI=">AAADBXicfVHdatRAFJ7Ev7r+dKt3ejO4FlrQJamCXhZF8M4K3bawCeFkdpIMnckMmRNxGXPtnW/inXjrc/gePoCT3RW6XfXAMN/5zi/ny40UFqPoZxBeuXrt+o2tm4Nbt+/c3R7u3Duxum0YnzAtdXOWg+VS1HyCAiU/Mw0HlUt+mp+/7uOnH3hjha6PcW54qqCsRSEYoKey4ZenCUhTAU0UYJXn7k2XJUbQRPICp/7TJU2KBpjzrI9Ugu4tUwsHXYb0E/3j2t59snQZSPeu2++cyRKsOMJ/q/Y7mjSirDDNhqNoHC2MboJ4BUZkZUfZTpAlM81axWtkEqydxpHB1EGDgkneDZLWcgPsHEo+9bAGxW3qFofr6K5nZrTQjX810gV7scKBsnaucp/Zr2svx3ryb7Fpi8XL1InatMhrthxUtJKipr0KdCYazlDOPQDWCL8rZRX4M6PXaq3TcZy6frm+zdr4XHWDXXqR6dcwqD6u54EstZ9QqX/Qgm0WGMtbf1U98wf0ksSXBdgEJwfj+Nn44P3z0eGrlThb5CF5RPZITF6QQ/KWHJEJYeRX8CAYBY/Dz+HX8Fv4fZkaBqua+2TNwh+/AQJ1+Y4=</latexit>

(b)

Figure 6.1: Amortized Policy Optimization. (a)2D visualization of policy optimization. A direct amortized policy network fails to output an optimal estimate, resulting in an amortization gap in performance. An iterative amortized policy network finds an improved estimate. (b)Diagrams of direct and iterative amortized policy optimization. Larger circles denote distributions, and smaller red circles denote terms in the objective, J. Dashed arrows denote amortized optimizers.

Iterative amortization uses gradient-based feedback during optimization, whereas direct amortization does not.

Entropy & KL Regularized Policy Networks Perform Direct Amortization Policy-based approaches to RL typically do not directly optimize the action distribution parameters, e.g., through gradient-based optimization. Instead, the action distribution parameters are output by a function approximator (deep network), 𝑓_𝜙, which is trained using deterministic (Silver et al., 2014; Lillicrap et al., 2016) or stochastic gradients (Williams, 1992; Heess et al., 2015). When combined with entropy or KL regularization, this policy network is a form ofamortizedoptimization (Gershman and Goodman, 2014), learning to estimate policies. Again, denoting the action distribution parameters, e.g., mean and variance, asλ, for a given state,s, we can express this direct mapping as

λ← 𝑓_𝜙(s), (direct amortization) (6.9) denoting the corresponding policy as𝜋_𝜙(a|s,O;λ). Thus, 𝑓_𝜙 attempts tolearnto optimize Eq. 6.8. This setup is shown in Figure 6.1 (Right). Without entropy or KL regularization, i.e., 𝜋_𝜙(a|s) = 𝑝_𝜃(a|s), we can instead interpret the network as directly integrating the LHS of Eq. 6.3, which is less efficient and more chal- lenging. Regularization smooths the optimization landscape, yielding more stable improvement and higher asymptotic performance (Ahmed et al., 2019).

Viewing policy networks as a form of direct amortized variational optimizer (Eq. 6.9) allows us to see that they are similar to “encoder” networks in variational autoencoders

Algorithm 3Direct Amortization Initialize𝜙

foreach environment stepdo λ← 𝑓_𝜙(s𝑡)

a𝑡 ∼ 𝜋_𝜙(a𝑡|s𝑡,O;λ) s𝑡+1∼ 𝑝_env(s𝑡+1|s𝑡,a𝑡) end for

foreach training stepdo 𝜙← 𝜙+𝜂∇𝜙J end for

Algorithm 4Iterative Amortization Initialize𝜙

foreach environment stepdo Initializeλ

foreach policy optimization it- erationdo

λ← 𝑓_𝜙(s𝑡,λ,∇_λJ ) end for

a𝑡 ∼𝜋_𝜙(a𝑡|s𝑡,O;λ) s𝑡+1 ∼ 𝑝_env(s𝑡+1|s𝑡,a𝑡) end for

foreach training stepdo 𝜙← 𝜙+𝜂∇𝜙J end for

(VAEs) (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014).

However, there are several drawbacks to direct amortization.

Amortization Gap. Direct amortization results in suboptimal approximate posterior estimates, with the resulting gap in the variational bound referred to as the amortization gap(Cremer, Li, and Duvenaud, 2018). Thus, in the RL setting, an amortized policy,𝜋_𝜙, results in worse performance than the optimal policy within the parametric policy class, denoted asb𝜋. The amortization gap is the gap in following inequality:

J (𝜋_𝜙, 𝜃) ≤ J ( b𝜋, 𝜃).

BecauseJ is a variational bound on the RL objective, i.e., expected return, a looser bound, due to amortization, prevents one from more completely optimizing this objective.

This is shown in Figure 6.1 (Left),1 where J is plotted over two dimensions of the policy mean at a particular state in the MuJoCo environmentHopper-v2. The estimate of a direct amortized policy () is suboptimal, far from the optimal estimate (F). While the relative difference in the objective is relatively small, suboptimal estimates prevent sampling and exploring high-value regions of the action-space.

That is, suboptimal estimates have only aminorimpact on evaluation performance (see Appendix B.4) but hinder effective data collection.

1Additional 2D plots are shown in Appendix Figure B.3.

Single Estimate. Direct amortization is limited to a single, static estimate. In other words, if there are multiple high-value regions of the action-space, a uni- modal (e.g., Gaussian) direct amortized policy is restricted to only one region, thereby limiting exploration. Note that this is an additional restriction beyond simply considering uni-modal distributions, as a generic optimization procedure may arrive at multiple uni-modal estimates depending on initialization and stochastic sampling (see Section 6.3). While multi-modal distributions reduce the severity of this restriction (Y. Tang and Agrawal, 2018; Haarnoja, Hartikainen, et al., 2018), the other limitations of direct amortization still persist.

Inability to Generalize Across Objectives. Direct amortization is a feedforward procedure, receiving gradients from the objective only after estimation. This is contrast to other forms of optimization, which receive gradients (feedback)during estimation. Thus, unlike other optimizers, direct amortization is incapable of generalizing to new objectives, e.g., if 𝑄_𝜋(s,a) or 𝑝_𝜃(a|s) change, which is a desirable capability for adapting to new tasks or environments.

To improve upon this scheme and overcome these drawbacks, in Section 6.3, we turn toiterative amortization(Chapter 3), retaining the efficiency of amortization while employing a more flexible iterative estimation procedure.

Related Work

Previous works have investigated methods for improving policy optimization. QT-Opt (Kalashnikov et al., 2018) uses the cross-entropy method (CEM) (Rubinstein and Kroese, 2013), an iterative derivative-free optimizer, to optimize a𝑄-value estimator for robotic grasping. CEM and related methods are also used in model-based RL for performing model-predictive control (Nagabandi et al., 2018; Chua et al., 2018;

Piché et al., 2019; Hafner et al., 2019). Gradient-based policy optimization (Henaff, Whitney, and LeCun, 2017; Srinivas et al., 2018; Bharadhwaj, Xie, and Shkurti, 2020), in contrast, is less common, however, gradient-based optimization can also be combined with CEM (Amos and Yarats, 2020). Most policy-based methods use direct amortization, either using a feedforward (Haarnoja, Zhou, Abbeel, et al., 2018) or recurrent (Guez et al., 2019) network. Similar approaches have also been applied to model-based value estimates (Byravan et al., 2020; Clavera, Y. Fu, and Abbeel, 2020; Amos, Stanton, et al., 2020), as well as combining direct amortization with model predictive control (Lee, Saigol, and Theodorou, 2019) and planning (Rivière et al., 2020). A separate line of work has explored improving the policy distribution,

using normalizing flows (Haarnoja, Hartikainen, et al., 2018; Y. Tang and Agrawal, 2018) and latent variables (Tirumala et al., 2019). In principle, iterative amortization can perform policy optimization in each of these settings.

Iterative amortized policy optimization is conceptually similar to negative feedback control (Astrom and Murray, 2008), using errors to update policy estimates. However, while conventional feedback control methods are often restricted in their applicability, e.g., linear systems and quadratic cost, iterative amortization is generally applicable to any differentiable control objective. This is analogous to the generalization of Kalman filtering (Kalman, 1960) to amortized variational filtering (Chapter 4) for state estimation.

6.3 Iterative Amortized Policy Optimization

Dalam dokumen Learned Feedback & Feedforward Perception & Control (Halaman 114-119)