• Tidak ada hasil yang ditemukan

Iterative Amortized Policy Optimization

Chapter V: Improving Sequential Latent Variable Models with Autoregressive

6.3 Iterative Amortized Policy Optimization

using normalizing flows (Haarnoja, Hartikainen, et al., 2018; Y. Tang and Agrawal, 2018) and latent variables (Tirumala et al., 2019). In principle, iterative amortization can perform policy optimization in each of these settings.

Iterative amortized policy optimization is conceptually similar to negative feedback control (Astrom and Murray, 2008), using errors to update policy estimates. However, while conventional feedback control methods are often restricted in their applicability, e.g., linear systems and quadratic cost, iterative amortization is generally applicable to any differentiable control objective. This is analogous to the generalization of Kalman filtering (Kalman, 1960) to amortized variational filtering (Chapter 4) for state estimation.

6.3 Iterative Amortized Policy Optimization

Benefits of Iterative Amortization

Reduced Amortization Gap. Iterative amortized optimizers are more flexible than their direct counterparts, incorporating feedback from the objective during policy optimization (Algorithm 4), rather than onlyafteroptimization (Algorithm 3).

Increased flexibility improves the accuracy of optimization, thereby tightening the variational bound (Chapters 3 & 4). We see this flexibility in Figure 6.1 (Left), where an iterative amortized policy network iteratively refines the policy estimate (β€’), quickly arriving near the optimal estimate.

Multiple Estimates. Iterative amortization, by using stochastic gradients and random initialization, can traverse the optimization landscape. As with any iterative optimization scheme, this allows iterative amortization to obtain multiple valid estimates, referred to as β€œmulti-stability” in the perception literature (Greff et al., 2019). We illustrate this capability across two action dimensions in Figure 6.2 for a state in theAnt-v2MuJoCo environment. Over multiple policy optimization runs, iterative amortization finds multiple modes, sampling from two high-value regions of the action space. This provides increased flexibility in action exploration, despite only using a uni-modal policy distribution.

Generalization Across Objectives. Iterative amortization uses the gradients of the objectiveduringoptimization, i.e., feedback, allowing it to potentially generalize to new or updated objectives. We see this in Figure 6.1 (Left), where iterative amortization, despite being trained with a different value estimator, is capable of generalizing to this new objective. We demonstrate this capability further in Section 6.4. This opens the possibility of accurately and efficiently performing policy optimization in new settings, e.g., a rapidly changing model or new tasks.

Consideration: Mitigating Value Overestimation

Why are more powerful policy optimizers typically not used in practice? As we now describe, part of the issue stems from value overestimation. Model-free approaches generally estimateπ‘„πœ‹using function approximation and temporal difference learning.

However, this has the pitfall of value overestimation, i.e., positive bias in the estimate, 𝑄bπœ‹(Thrun and Schwartz, 1993). This issue is tied to uncertainty in the value estimate, though it is distinct from optimism under uncertainty. If the policy can exploit regions of high uncertainty, the resulting target values will introduce positive bias into the estimate. More flexible policy optimizers exacerbate the problem, exploiting

βˆ’1.0 βˆ’0.5 0.0 0.5 1.0 tanh(Β΅2)

βˆ’1.0

βˆ’0.5 0.0 0.5 1.0

tanh(Β΅6)

395 400 405 410

J

βˆ’1 0 1

Act. Dim. 2 0.0

2.5 5.0 7.5 10.0 12.5

πφ(a|s,O)

βˆ’1 0 1

Act. Dim. 6 0

1 2

3 Opt. Run

1 2 3 4

Figure 6.2: Estimating Multiple Policy Modes. Unlike direct amortization, which is restricted to a single estimate, the stochasticity of iterative optimization allows iterative amortization to effectively sample from multiple high-value modes in the action space. This capability is shown for a particular state in Ant-v2, showing multiple optimization runs across two action dimensions (Left). Each colored square denotes an initialization. The optimizer finds both modes, with the assigned density plotted on theRight. This capability provides increased flexibility in action exploration.

this uncertainty to a greater degree. Further, a rapidly changing policy increases the difficulty of value estimation (Rajeswaran, Mordatch, and V. Kumar, 2020).

Various techniques have been proposed for mitigating value overestimation in deep RL. The most prominent technique, double deep𝑄-network (Van Hasselt, Guez, and Silver, 2016) maintains two𝑄-value estimates (Van Hasselt, 2010), attempting to decouple policy optimization from value estimation. Fujimoto, Hoof, and Meger, 2018 apply and improve upon this technique for actor-critic settings, estimating the target𝑄-value as the minimum of two𝑄-networks,π‘„πœ“

1 andπ‘„πœ“

2: 𝑄bπœ‹(s,a) =min

𝑖=1,2

π‘„πœ“0 𝑖

(s,a), where πœ“0

𝑖 denotes the β€œtarget” network parameters. As noted by Fujimoto, Hoof, and Meger, 2018, this not only counteracts value overestimation, but also penalizes high-variance value estimates, because the minimum decreases with the variance of the estimate. Ciosek et al., 2019 noted that, for a bootstrapped ensemble of two 𝑄-networks, the minimum operation can be interpreted as estimating

𝑄bπœ‹(s,a) =πœ‡π‘„(s,a) βˆ’π›½πœŽπ‘„(s,a),

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Million Steps

βˆ’50 0 50 100 150

bQΟ€βˆ’QΟ€

direct,Ξ²= 1 iterative,Ξ²= 1 iterative,Ξ²= 2.5

(a)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Million Steps 100

101 102

DKL(Ο€new||Ο€old)

(b)

Figure 6.3: Mitigating Value Overestimation. Using the same value estimation setup (𝛽=1), shown onAnt-v2, iterative policy optimization results in (a) higher value overestimation bias and (b) a more rapidly changing policy as compared with direct policy optimization. Increasing 𝛽 helps to mitigate these issues by further penalizing variance in the value estimate.

with mean and standard deviation πœ‡π‘„(s,a) ≑ 1

2 Γ•

𝑖=1,2

π‘„πœ“0 𝑖

(s,a),

πœŽπ‘„(s,a) ≑

"

1 2

Γ•

𝑖=1,2

π‘„πœ“0

𝑖

(s,a) βˆ’πœ‡π‘„(s,a)2#1/2

,

and 𝛽 = 1. Thus, to further penalize high-variance value estimates, preventing value overestimation, we can increase 𝛽. For large 𝛽, however, value estimates become overly pessimistic, negatively impacting training. Thus, 𝛽reduces target value variance at the cost of increased bias.

Due to the flexibility of iterative amortization, the default 𝛽=1 results in increased value bias (Figure 6.3a) and a more rapidly changing policy (Figure 6.3b) as compared with direct amortization. Further penalizing high-variance target values with𝛽 =2.5 reduces value overestimation and improves policy stability. For details, see Marino, PichΓ©, et al., 2020. Recent techniques for mitigating overestimation have been proposed, such as adjusting the temperature,𝛼(Fox, 2019). In offline RL, this issue has been tackled through the action prior (Scott Fujimoto, David Meger, and Precup, 2019; A. Kumar, J. Fu, et al., 2019; Wu, Tucker, and Nachum, 2019) or by altering 𝑄-network training (Agarwal, Schuurmans, and Norouzi, 2019; A. Kumar, Zhou, et al., 2020). While such techniques could be used here, increasing 𝛽provides a simple solution with no additional computational overhead.

0 5 10 15 20 25 Environment Time Step

βˆ’0.5 0.0 0.5

ActionDim.5 Opt. Iter.

1 2 3 4 5

(a) Policy

0 5 10 15 20 25

Environment Time Step 0

20 40

βˆ†J

Opt. Iter.

1 2 3 4 5

(b) Improvement

Figure 6.4: Policy Optimization. Visualization over time steps of(a)one dimension of the policy distribution and (b) the improvement in the objective, Ξ”J, across policy optimization iterations.

6.4 Experiments