Augmenting Deep Reinforcement Learning with Clustering

Next, the environment and the task will be defined before the basic algorithm is discussed, followed by an explanation of the cluster propagation method. The remainder of this chapter will discuss various model-free Deep RL algorithms and the use of the Proximal Policy Optimization algorithm as a starting point.

Figure 1.1: Sample Inefficiency caused by using Neural Networks

Model-Free Algorithms

Advantage Actor Critic Model

The actor part of actor-critic models corresponds to the part of the algorithm that aims to maximize this modified policy gradient objective. In effect, this replaces the log of the policy in the modified policy gradient objective with a ratio of the probabilities of choosing the action in the trajectory under the current policy, versus choosing it with the policy that generated the trajectory (this will be referred to here called as just ratio or ratio of probabilities).

Clustering in Deep RL

In the context of the task described above, this means that once the agent has learned to solve one maze, it can solve another much more easily since it already knows how to move, it just needs to learn the highest policy that tells the agent in which direction to move. An initial part of the network will be used (with its recurrent structure) to classify each state into a group. This clustering will then be linked to the ingress observation and passed through the rest of the network (common A2C architecture) to make policy and value predictions.

In all these environments, each episode consists of 1000 transitions, where each transition refers to the process by which the agent delivers an action to the environment, where the environment updates its state according to the laws of physics, and returns the observations and rewards to the agent . . The observations for this task are the angular velocity of the pole and the velocity of the cart. However, in this thesis, instead of using these observations, the direct image of the environment is passed on to the agent.

This compounds the difficulty of the problem while bringing it closer to a real-world task.

Task

Intuition for the Optimal Policy

The goal of the cart pole task is to swing the pole above the cart and then balance it for as long as possible. To perform upward motion, the cart must be given a strong impulse in one direction, say to the right. Once the pole is on the cart, the agent must give an impulse in the opposite direction, to further swing the pole toward the vertical and bring the cart to rest.

Once the pole is almost vertical, the agent must apply small impulses to keep the pole in balance. For example, if the pole is vertical but appears to be tilting to the left, the cart will need to be moved slightly to the left to swing the pole back to vertical. The task finally ends after 1000 transitions have been made, and the sum of the rewards for all transitions is considered the return of the episode.

Computing Environment

Hardware Specification

Software Specification

The observation available to the RL agent at each time step is the image of the environment at that time step. Due to the high dimensionality of images, a neural network trying to learn a policy with an image as input will take a very large amount of training data. This is because the effective response given to the neural network for each input image is the value of the state represented by the image and the goodness of the chosen action; both of which have high variance since they are calculated from a single trajectory.

However, if there were a method available to quickly learn feature extractors that preserve all the semantic information in the image, and pass only this low-dimensional semantic vector to the RL agent, then we can rest assured that all relevant information is passed on. to the agent, and the agent doesn't have to spend any money. Now, if an autoencoder is trained to encode the image observations, and the encoded semantic vector is used as input to the RL agent network, then the RL agent does not need to learn any image feature extractors at all. The network architecture is the same as in [2] and the hyperparameters are available in the appendix.

Furthermore, since it is entirely possible that as the RL agent learns that it sees new parts of the observation space that the β-VAE was not initially trained on, during each iteration of the PPO algorithm, the β-VAE is first trained on the sampled trajectories for 1 epoch.

PPO-GRU

Network Architecture and Loss Function

The first two correspond to the actor part of the A2C which is supposed to predict a probability distribution. The last output sensor corresponds to the critical part of A2C, and predicts the value of the current state. As briefly discussed in Section 2.2, the main components of the algorithm are an initial clustering done with a recurrent neural network over the observations per time step, followed by merging this clustering with the observations before feeding it to the rest of the network be transferred (the usual A2C architecture).

In the A2C model, we would like our clusters to inform the rest of the agent which regions in the state space require similar actions to be taken and should have similar values. Since we need to include temporal information, the immediate conclusion we can draw is that the underlying GRU output can be used as input to the cluster generation method. The output of this GRU clustering would be a vector of probabilities where each index of the vector corresponds to a cluster, and the value at that index is the probability that the current state belongs to the cluster of that index.

Now that we have generated our clustering, the next step is to use this clustering to inform the rest of the network in its prediction of policy and value.

Informing Value Prediction

Informing Policy Prediction

Although this is known to be a Gaussian distribution, the variance of the summed distribution would be the sum of the variances of the individual distributions. In Deep RL, the variance of the predicted policy is an important scaling factor during learning and can be interpreted as the confidence the algorithm has in its policy prediction. Thus, claiming that the variance of the summed policy is greater than the variances of the individual distributions from which it is composed is equivalent to claiming that the model had more confidence in predicting the cluster mean policy and the offset distribution than the final predicted policy.

This changing parameter can then be added to the predicted policy in the usual way and can be interpreted as a shift of the mean of the policy distribution. In this way, the policy and value predicted by the A2C network can be thought of as the policy and value conditioned by the cluster. This architecture also allows for the gradients from PPO LossLCLIP and Lcritic to inform the clustering itself.

However, this may not be the best way to inform cluster formation, and an explicit loss function for cluster formation could yield better results.

Augmented Loss Function

Both the basic and augmented methods were implemented as described in Chapters 4 and 5 respectively. The computational environment is described in Section 3.3 and the other hyperparameters that were used are listed in the appendix. The goal of each RL agent is to maximize the amount of rewards received during an episode.

In Figure 6.1 we show a graph of the baseline (in red) and the extended network (in blue) during training. The best yield for the baseline and the extended method are together with the corresponding training episode in Table 6.1. From both the graph and the table, it can be seen that the clustering augmented Neural Network achieved a higher maximum return at an earlier training episode.

A video of the policy on top of the extended network was generated and every 20th frame was extracted and placed in Figure 6.2.

Cluster Formation

Just by inspection, one can sense that the first cluster (cluster 0 in the figure) seems to represent the case when the pole is above the cart and the agent tries to balance it there. This is because the speed of the cart is almost always zero, and the angular speed of the pole is mostly zero, except possibly when the pole becomes unbalanced. The argument can be made that this cluster could also represent the case where the pole is underneath the cart and the cart is trying to swing it up.

However, since we know that the policy cannot balance the pole over the cart for long, the total number of states assigned to this cluster seems to indicate that the cluster represents a case of balancing rather than a case of upward swing. The second cluster formed (cluster 1 in the figure) seems to correspond to the situation where the trolley is moving and the rod is swinging upwards. The third cluster (cluster 2 in the figure) appears to represent a situation where the pole is rapidly rotating in circles around the cart, because although the impulses applied by the agent to the cart eventually caused the cart to approach rest, they were so strong that the pole just spun instead of slowing down and stopping near the top.

A major contributor to this sampling inefficiency can be found in how when training a neural network through the Backpropagation algorithm, there is no guarantee that the entire region of the input space that could benefit from a training example actually benefits from training. on that example. In Chapter 6, the baseline and the boosted algorithms were compared and it was shown that the boosted algorithm succeeded in producing an agent that achieved a higher episodic yield on an earlier episode than the baseline agent. This shows that the current cluster formation scheme is not sufficient to reliably form diverse clusters.