Chapter VII: Sequential Autoregressive Flow-Based Policies
7.3 Experiments
We incorporate the proposed autoregressive feedforward component within soft actor-critic (SAC), which we refer to as autoregressive SAC (ARSAC). The affine transform parameters in ARSAC are conditioned on a varying number of previous actions, which we denote as, e.g., ARSAC-5 for 5 previous actions. Unless otherwise stated, we adopt the default hyperparameters from Haarnoja, Zhou, Abbeel, et al., 2018, as well as the automatic entropy-tuning scheme proposed in Haarnoja, Zhou, Hartikainen, et al., 2018. We evaluate ARSAC across a range of locomotion tasks in environments from thedm_controlsuite (Tassa et al., 2020).
0 1 2 3 Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(a)hopper hop
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(b)hopper stand
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(c)walker walk
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(d)walker run
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(e)swimmer swimmer6
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(f)cheetah run
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(g)quadruped walk
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(h)quadruped run
0 2 4
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(i)humanoid stand
0 2 4
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(j)humanoid walk
0 2 4
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(k)humanoid run
Figure 7.2: Performance Comparison,2×256 (Default) Policy Network. Com- parison between SAC and ARSAC with an action window of 5 time steps across dm_controlsuite environments. Each curve shows the mean and standard deviation across 5 random seeds.
Performance Comparison
In Figure 7.2, we compare the cumulative reward of SAC and ARSAC-5 throughout training, with 5 random seeds for each setup on each environment. On most environments, ARSAC-5 performs roughly as well or better than SAC, with improved sample efficiency and/or final performance. However, ARSAC-5 struggles on the more difficult tasks,humanoid walkandhumanoid run. In Figure 7.3, we present performance results on these latter environments with varying autoregressive window sizes. We see that changing the size improves performance, however, this is still not able to bridge the performance gap onhumanoid run.
We hypothesize that the autoregressive transform will have a larger impact when the (state-dependent) base distribution is constrained in some way, either computationally or through delays. To probe this aspect, we decrease the size of the base distribution network, restricting the network to a single hidden layer with 32 units (the default size is 2 hidden layers, each with 256 units). The results, on a subset of environments, are shown in Figure 7.4, where we see that, indeed, the gap in performance slightly increases.
0 2 4 Million Steps 0
250 500 750
CumulativeReward
(a)humanoid walk
0 2 4
Million Steps 0
250 500 750
CumulativeReward SAC
ARSAC-3 ARSAC-5 ARSAC-10
(b)humanoid run
Figure 7.3: Humanoid Performance Comparison,2×256(Default) Policy Net- work. Comparison between SAC and ARSAC with varying action window sizes on humanoidtasks fromdm_controlsuite. Each curve shows the mean and standard deviation across 5 random seeds.
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(a)hopper stand
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(b)walker walk
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(c)walker run
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(d)cheetah run
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward
SAC ARSAC-5
(e)quadruped walk
0 1 2 3
Million Steps 0
250 500 750 1000
CumulativeReward SAC
ARSAC-5
(f)quadruped run
Figure 7.4: Performance Comparison, 1 × 32 Policy Network. Comparison between SAC and ARSAC with an action window of 5 time steps acrossdm_control suite environments. Each curve shows the mean and standard deviation across 5 random seeds.
Policy Visualization & Feedforward Distillation
We now analyze the policy distributions learned by ARSAC. On the left side of Figure 7.5, we plot the base distribution,𝜋(u|s), as well as the autoregressive affine transform,δ𝜃±β𝜃, and the actions (before applying the tanh transform). Arbitrarily, we only select the first action dimension for purposes of visualization. We see that the autoregressive flow contains relatively little temporal structure at the end of training.
This is not entirely surprising, as there is nothing in the objective that explicitly requires the policy to be distilled into the autoregressive flow. Thus, while the flow is beneficial for training, it may not necessarily automatically provide dynamical
“motor primitives,” part of the motivation for this approach.
To explore this capability, we attempt to distill the policy entirely into the affine transform, arriving at a purely feedforward policy. We do so by restricting the base
−5 0 5
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(a)walker run, dim. 0
−2.5 0.0 2.5
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(b)walker run (dist.), dim. 0
−5 0 5
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(c)cheetah run, dim. 0
−2.5 0.0 2.5
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(d)cheetah run (dist.), dim. 0
−5 0 5
π(u|s)
100 150 200 250
Step 0.0
2.5
tanh−1(a) δθ±βθ tanh−1(a)
(e)quadruped run, dim. 0
−2.5 0.0
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(f)quadruped run (dist.), dim. 0
0 5
π(u|s)
100 150 200 250
Step 0
2
tanh−1(a) δθ±βθ tanh−1(a)
(g)humanoid run, dim. 0
−2.5 0.0 2.5
π(u|s)
100 150 200 250
Step
−2.5 0.0 2.5
tanh−1(a) δθ±βθ tanh−1(a)
(h)humanoid run (dist.), dim. 0
Figure 7.5: Policy Visualization & Distillation. Left: Visualization of the base distribution and autoregressive flow for various action dimensions across various environments. Right: Visualization after policy distillation.
distribution, progressively penalizing the L2 norm of the mean (µ𝜙) and log-standard deviation (logσ𝜙), thereby bringing the base distribution toward a standard Gaussian, N (u;0,I). The results are shown on the right side of Figure 7.5, where we see that
the flow now contains nearly all of the temporal structure. In other words, these policies are effectively autoregressive Gaussian densities:
𝜋𝜃(a𝑡|a<𝑡) =N (a𝑡;δ𝜃(a<𝑡),diag(β2𝜃(a<𝑡))), (7.4) almost entirely independent of the state,s. Comparing the left and right columns, we see that the feedforward policies are more consistent overall, resulting from the limited temporal window of the affine flow. Thus, the autoregressive nature of the policy provides an inductive bias, resulting in policies with more regular rhythmic structure. As will be described in Chapter 8, this may hold some connections with central pattern generator circuits in biological neural systems, which produce rhythmic patterns in the absence of any state input (Marder and Bucher, 2001).