• Tidak ada hasil yang ditemukan

Chapter VII: Sequential Autoregressive Flow-Based Policies

7.3 Experiments

We incorporate the proposed autoregressive feedforward component within soft actor-critic (SAC), which we refer to as autoregressive SAC (ARSAC). The affine transform parameters in ARSAC are conditioned on a varying number of previous actions, which we denote as, e.g., ARSAC-5 for 5 previous actions. Unless otherwise stated, we adopt the default hyperparameters from Haarnoja, Zhou, Abbeel, et al., 2018, as well as the automatic entropy-tuning scheme proposed in Haarnoja, Zhou, Hartikainen, et al., 2018. We evaluate ARSAC across a range of locomotion tasks in environments from thedm_controlsuite (Tassa et al., 2020).

0 1 2 3 Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(a)hopper hop

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(b)hopper stand

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(c)walker walk

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(d)walker run

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(e)swimmer swimmer6

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(f)cheetah run

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(g)quadruped walk

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(h)quadruped run

0 2 4

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(i)humanoid stand

0 2 4

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(j)humanoid walk

0 2 4

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(k)humanoid run

Figure 7.2: Performance Comparison,2×256 (Default) Policy Network. Com- parison between SAC and ARSAC with an action window of 5 time steps across dm_controlsuite environments. Each curve shows the mean and standard deviation across 5 random seeds.

Performance Comparison

In Figure 7.2, we compare the cumulative reward of SAC and ARSAC-5 throughout training, with 5 random seeds for each setup on each environment. On most environments, ARSAC-5 performs roughly as well or better than SAC, with improved sample efficiency and/or final performance. However, ARSAC-5 struggles on the more difficult tasks,humanoid walkandhumanoid run. In Figure 7.3, we present performance results on these latter environments with varying autoregressive window sizes. We see that changing the size improves performance, however, this is still not able to bridge the performance gap onhumanoid run.

We hypothesize that the autoregressive transform will have a larger impact when the (state-dependent) base distribution is constrained in some way, either computationally or through delays. To probe this aspect, we decrease the size of the base distribution network, restricting the network to a single hidden layer with 32 units (the default size is 2 hidden layers, each with 256 units). The results, on a subset of environments, are shown in Figure 7.4, where we see that, indeed, the gap in performance slightly increases.

0 2 4 Million Steps 0

250 500 750

CumulativeReward

(a)humanoid walk

0 2 4

Million Steps 0

250 500 750

CumulativeReward SAC

ARSAC-3 ARSAC-5 ARSAC-10

(b)humanoid run

Figure 7.3: Humanoid Performance Comparison,2×256(Default) Policy Net- work. Comparison between SAC and ARSAC with varying action window sizes on humanoidtasks fromdm_controlsuite. Each curve shows the mean and standard deviation across 5 random seeds.

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(a)hopper stand

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(b)walker walk

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(c)walker run

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(d)cheetah run

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward

SAC ARSAC-5

(e)quadruped walk

0 1 2 3

Million Steps 0

250 500 750 1000

CumulativeReward SAC

ARSAC-5

(f)quadruped run

Figure 7.4: Performance Comparison, 1 × 32 Policy Network. Comparison between SAC and ARSAC with an action window of 5 time steps acrossdm_control suite environments. Each curve shows the mean and standard deviation across 5 random seeds.

Policy Visualization & Feedforward Distillation

We now analyze the policy distributions learned by ARSAC. On the left side of Figure 7.5, we plot the base distribution,𝜋(u|s), as well as the autoregressive affine transform,δ𝜃±β𝜃, and the actions (before applying the tanh transform). Arbitrarily, we only select the first action dimension for purposes of visualization. We see that the autoregressive flow contains relatively little temporal structure at the end of training.

This is not entirely surprising, as there is nothing in the objective that explicitly requires the policy to be distilled into the autoregressive flow. Thus, while the flow is beneficial for training, it may not necessarily automatically provide dynamical

“motor primitives,” part of the motivation for this approach.

To explore this capability, we attempt to distill the policy entirely into the affine transform, arriving at a purely feedforward policy. We do so by restricting the base

5 0 5

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(a)walker run, dim. 0

2.5 0.0 2.5

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(b)walker run (dist.), dim. 0

5 0 5

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(c)cheetah run, dim. 0

2.5 0.0 2.5

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(d)cheetah run (dist.), dim. 0

5 0 5

π(u|s)

100 150 200 250

Step 0.0

2.5

tanh1(a) δθ±βθ tanh−1(a)

(e)quadruped run, dim. 0

2.5 0.0

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(f)quadruped run (dist.), dim. 0

0 5

π(u|s)

100 150 200 250

Step 0

2

tanh1(a) δθ±βθ tanh−1(a)

(g)humanoid run, dim. 0

2.5 0.0 2.5

π(u|s)

100 150 200 250

Step

2.5 0.0 2.5

tanh1(a) δθ±βθ tanh−1(a)

(h)humanoid run (dist.), dim. 0

Figure 7.5: Policy Visualization & Distillation. Left: Visualization of the base distribution and autoregressive flow for various action dimensions across various environments. Right: Visualization after policy distillation.

distribution, progressively penalizing the L2 norm of the mean (µ𝜙) and log-standard deviation (logσ𝜙), thereby bringing the base distribution toward a standard Gaussian, N (u;0,I). The results are shown on the right side of Figure 7.5, where we see that

the flow now contains nearly all of the temporal structure. In other words, these policies are effectively autoregressive Gaussian densities:

𝜋𝜃(a𝑡|a<𝑡) =N (a𝑡𝜃(a<𝑡),diag(β2𝜃(a<𝑡))), (7.4) almost entirely independent of the state,s. Comparing the left and right columns, we see that the feedforward policies are more consistent overall, resulting from the limited temporal window of the affine flow. Thus, the autoregressive nature of the policy provides an inductive bias, resulting in policies with more regular rhythmic structure. As will be described in Chapter 8, this may hold some connections with central pattern generator circuits in biological neural systems, which produce rhythmic patterns in the absence of any state input (Marder and Bucher, 2001).