• Tidak ada hasil yang ditemukan

5. ABSTRACTION IN HIERARCHICAL SYSTEMS

5.6 Illustrative Example

We now provide a complete description of the simple gridworld task in Figure 5.2(a) and some experimental results to illustrate the utility of relativized options and our hierarchical decomposition. The agent’s goal is to collect all the objects in the various rooms by occupying the same square as the object. Each of the rooms is a 10 by 10 grid with certain obstacles in it. The actions available to the agent are {N, S, E, W}with a 0.1 probability of failing, i.e., going randomly in a direction other than the intended one. This probability of failing is referred to as the slip.

The state is described by the following features: the room number the agent is in, with 0 denoting the corridor, the x and y co-ordinates within the room or corridor with respect to the reference direction indicated in the figure and boolean variables havei,i= 1, . . . ,5, indicating possession of object in roomi. Thus the state with the agent in the cell marked A in the figure and having already gathered the objects in rooms 2 and 4 is represented byh3,6,8,0,1,0,1,0i. The goal is any state of the form h·,·,·,1,1,1,1,1i and the agent receives a reward of +1 on reaching a goal state.

We compared the performance of an agent that employs relativized options with that of an agent that uses multiple regular options. The “relativized” agent employs a single relativized option,Or, whose policy can be suitably lifted to apply in each of the 5 rooms. The relativized option MDP corresponds to a single room and is shown in Figure 5.2(b). The state space S0 of the option MDP is defined by 3 features: x and y co-ordinates and a binary feature have, which is true if the agent has gathered the object in the room. There is an additional absorbing state-action pair (τ, α), otherwise the action set remains the same. The stopping criterion β is 1 at τ and zero elsewhere. The initiation set consists of all states of the form hi,∗i, with i 6= 0.

There is a reward of +1 on transiting to τ from any state of the form h∗,1i, i.e. on exiting the room with the object. One can see that lifting a policy defined in the option MDP yields different policy fragments depending on the room in which the

option is invoked. For example, a policy in the option MDP that picks E in all states would lift to yield a policy fragment that picksW in rooms 3 and 4, picksN in room 5 and picks E in rooms 1 and 2.

The “regular” agent employs 5 regular options, O1,· · ·, O5, one for each room.

Each of the option employs the same state space and stopping criterion as the rela- tivized option. The initiation set for option Oi consists of states of the form hi,∗i.

There is a reward of +1 on exiting the room with the object. Both agents employ SMDP Q-learning Bradtke and Duff (1995) at the higher level and Q-learning Watkins (1989) at the option level.

In both cases the root task, O0 is described as follows: The state set of M0 is described by the the room number the agent is in, the various havei features and if the agent is in the central corridor, then the x and y co-ordinates of the agent; the admissible actions are the primitive actions in the corridor and the corresponding options in the room doorways; the transition and reward functions are those induced by the original task and the option policies. The initiation set is the set of states in the corridor with all havei features set to false. The termination condition is 1 for states in the corridor with all havei features set to true. It is 0 elsewhere.

We also compared the performance of an agent that employs only the four prim- itive actions. All the agents used a discount rate of 0.9, learning rate of 0.05 and -greedy exploration, with an of 0.1. The results shown are averaged over 100 in- dependent runs. The trials were terminated either on completion of the task or after 3000 steps. Figure 5.4(a) shows the asymptotic performance of the agents. This a hard problem for the primitive action agent and it takes around 30,000 iterations before it learns a reasonable policy and another 15,000 before it even approaches optimality. This is often the case when employing RL on even moderately large prob- lems and is one of the chief reason for choosing a hierarchical approach. Since we are more interested in comparing the performance of the option agents, we do not present

further results for the primitive action agent. In fact in some of the later tasks, the primitive action agent does not learn to solve the task in any reasonable amount of time.

Figure 5.4(a) also demonstrates that the option agents perform similarly in the long run, with no significant difference in performance. This demonstrates that there is no loss in performance due to the abstractions we employ here. This is not surprising since the homomorphism conditions are met exactly in this domain.

Figure 5.4(b) shows the initial performance of the option agents. As expected, the relativized agent significantly outperforms the regular agent in the early trials1. Fig- ure 5.5 graphs the rate at which the agents improved over their initial performance.

The relativized agent achieved similar levels of improvement in performance signifi- cantly earlier than the regular option. For example, the relativized agent achieved a 60% improvement in initial performance in 40 trials, while the regular agent needed 110 trials. These results demonstrate that employing relativized options significantly speeds up initial learning performance, and if the homomorphism conditions hold exactly, there is no loss in the asymptotic performance.

Employing a hierarchical approach results in a huge improvement in performance over the primitive action agent. While there is a significant improvement in perfor- mance while employing relativized options, this is not comparable the initial improve- ment over primitive actions. One might ask is this improvement worth the additional expense of relativizing the options. Our answer to this two fold. First, the relative magnitudes of improvement is an artifact of this problem domain. In more com- plex domains, with more redundancy a greater improvement in performance is to be expected. In many cases employing some form of an hierarchy is the only feasible approach and in such cases we can obtain further improvement in performance for

1All the significance tests weretwo sample t-testswith a p-value of 0.01.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 104 0

500 1000 1500 2000 2500 3000

Number of Trials

Average Steps per Trial Options

Primitive Actions

0 50 100 150 200 250 300 350 400 450 500

0 500 1000 1500 2000 2500 3000

Number of Trials

Average Steps per Trial

Relativized Option

Regular Options

(a) (b)

Figure 5.4. (a) Comparison of asymptotic performance of various learning agents on the task shown in Figure 5.2. See text for description of the agents. (b) Comparison of initial performance of the regular and relativized agents on the same task.

some additional cost by relativization. Second, using relativized option opens up the possibility of being able to train an agent to perform a sub-task in some prototypical environment. Once the agent acquires a reasonable policy in this training task then it is able to generalize to all instance of this task. This is particularly useful if training experience is expensive, for example in the case of real robotic systems.