Experimental Illustration in a Complex Game

6. OPTION SCHEMAS AND DEIXIS

6.6 Deictic Representation

6.6.2 Experimental Illustration in a Complex Game

weight vector that corresponds to this cross-product projection. The update for this component will depend on the features indexed by some Ji and Jj.

3. Dependent pointer: For each Ji ∈ Di and Jj ∈ Dj, ρJi×ρJj satisfies equation 4.2, as does ρJi. But ρJj does not satisfy equation 4.2 for at least some value of Jj. This means pointeri is an independent pointer, whilej is a dependent one.

There is a separate component of the weight vector that corresponds to pointer j, but whose update depends on the features indexed by both Ji and Jj. The weight vector is chosen such that there is one component for each independent pointer, one for each dependent pointer and one for each set of mutually dependent pointers. Let the resulting number of components be L. A modified version of the update rule Equation 6.2 is used to update each component l of the weight independently of the updates for the other components:

w^l_n(hⁱ, ψ(s)) = P_O^l((fⁱ(s), g_sⁱ(a), fⁱ(s⁰))·w_n−1^l (hⁱ, ψ(s))

K (6.3)

where P_O^l(s, a, s⁰) = maxν, P_O^l(s, a, s⁰) and K = ^Ph⁰ⁱ∈HP_O^l(f⁰ⁱ(s), g_s⁰ⁱ(a), f⁰ⁱ(s⁰)) wn−1(h⁰ⁱ, ψ(s)) is the normalizing factor. P_O^l(s, a, s⁰) is a “projection” of PO(s, a, s⁰) computed as follows. Let J be the set of features that is required in the computation of w^l_n(hⁱ, ψ(s)). This is determined as described above for the various cases. Then P_O^l(s, a, s⁰) = ^Q

j∈JProb(s⁰_j|Parents(s⁰_j, a)).

0 1

%%&

'' ''(

))

--.

00 00

Figure 6.12. A modified game domain with interacting adversaries and stochastic actions. The task is to collect the black diamond. The adversaries are of three types—benign (shaded), retriever (white) and delayers (black). See text for more explanation.

agent is to collect the diamond in the room and exit it. The agent collects a diamond by occupying the same square as the diamond.

The room also has 8 autonomous adversaries. The adversaries may be of three types—benign, delayer or retriever. The behavior of the benign and delayer adversaries are as described earlier. The retriever behaves like the benign adversary till the agent picks up the diamond. Once the agent picks up the diamond, the retriever’s behavior switches to that of the delayer. The main difference is that once the retriever occupies the same square as the agent, the diamond is returned to the original position and the retriever reverts to benign behavior. The adversary returns to benign behavior if the agent is also “captured” by the delayer.

The complete state of the world is described by (1) the position of the agent—

the number of the room it is currently in (the corridor being 0), and the x and y coordinates in the room; (2) the position of each of the adversaries given by the x and ycoordinates in the room; and (3) a boolean variablehaveindicating possession of the diamond. The other parameters of the task are as defined earlier. The agent is not aware of the identity of the delayer or the retriever. The shaded squares in Figure

Figure 6.13. The option MDP corresponding to the sub-task get-object-and-leave- room for the domain in Figure 6.12. There is just one delayer and one retriever in this image MDP.

6.12 are obstacles. The delayers are shown in black, the retrievers in white and the benign ones shaded.

The option MDP (Figure 6.13) is a symmetrical room with just two adversaries—a delayer and a retriever with fixed chaseandhold parameters. The features describing the state space of the option MDP consists of thex and ycoordinates relative to the room of the agent and of the adversaries and a boolean variable indicating possession of the diamond. The rooms in the world does not match the option MDP exactly and no adversary in the world has the same chase and hold parameters as the adversaries here. The root task is the same one described earlier in Section 6.6.

The deictic agent has access to 4 pointers: Aself pointer that projects the agent’s location onto the image MDP, a have pointer that projects the have feature onto the image MDP, adelayerpointer that projects one of the adversaries onto the delayer in the image MDP and aretriever pointer that projects one of the adversaries onto the retriever in the image MDP. The selfpointer is an independent pointer. The delayer pointer is dependent on the self pointer and the retrieverpointer is dependent on all the other pointers. Note that the self and have pointers are fixed projections and have very restricted domains. The sets Ddelayer Dretriever are given by the 8 pairs

of features describing the adversary coordinates. Dself is a singleton consisting the agent’s x and y coordinates with respect to the room and Dhave is also a singleton consisting of the have feature.

Traditionally, such pointers are not viewed as deictic pointers, but would form part of the “background” information provided to the agent. But we treat them as special pointers in our formulation. The actions of the agent in the image can then be expressed with respect to the self pointer. Note that since the option MDP is an approximate homomorphic image, the homomorphism condition 4.2 is not strictly met by any of the projections. Therefore, in computing the weight updates, the influence of the features not used in the construction of the image MDP are ignored by marginalizing over them.

Experimental Results

The performance of the deictic agent is compared with a relativized agent that employs the same option MDP but chooses from a set H of 64 monolithic transformations, formed by the cross product of the 8 configurations of the deictic pointers.

Both agents employ hierarchical SMDP Q-learning, with the learning rates for the option and the root task set to 0.1. The agents are both trained initially in the option MDP to acquire an approximate initial option policy that achieves the goal some percentage of the trials, but is not optimal. Both agents use greedy exploration.

On learning trials both agents perform similarly, with the monolithic agent hav- ing better initial performance. This is not surprising, if we look at the rate at which the transformation weights converge. Figure 6.14 shows that the monolithic agent identifies the right transformation rapidly just as the deictic agent identifies the delayer rapidly (Figure 6.15(a)). But as Figure 6.15(b) shows, identifying the retriever takes much longer, and the deictic agent performs poorly till the retriever is correctly identified consistently.

0 5 10 15 20 25 30 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of Updates

Normalized Likelihood Measure

Transform 18 Transform 19 Transform 27

Figure 6.14. Typical evolution of a subset of weights of the monolithic agent on the task shown in Figure 6.12.

This result is not surprising, since the correct position for the retriever depends on position of the delayer pointer. Therefore, while the delayer is being learned, the weights for the retriever receive inconsistent updates and it takes a while for the weights to get back on track. Further, the monolithic agent considers all possible combinations of pointer configurations simultaneously. Therefore, while it takes fewer update steps to converge to the right weights, both agents make comparable number of weight updates, 1300 vs. 1550.

Discussion

The algorithm used above updates the weights for all the transformations after each transition in the world. This is possible since the transformations are assumed to be mathematical operations and the agent could use different transformations to project the same transition onto to the option SMDP. But deictic pointers are often implemented as physical sensors. In such cases, this is equivalent to sensing every adversary in the world before making a move and then sensing them after making the move, to gather the data required for the updates. Since the weights converge

0 5 10 15 20 25 30 35 40 45 50 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of Updates

Normalized Likelihood Measure

Robot 3 Robot 6 Robot 4

0 50 100 150 200 250

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of Updates

Normalized Likelihood Measure

Robot 6 Robot 1 Robot 8

(a) (b)

Figure 6.15. (a) Typical evolution of a subset of the delayer weights of the deictic agent on the task shown in Figure 6.12. (b) Typical evolution of a subset of the retriever weights on the same task.

fairly rapidly, compared to the convergence of the policy, the time the agent spends

“looking around” would be a fraction of the total learning time.

Dalam dokumen AN ALGEBRAIC APPROACH TO ABSTRACTION IN REINFORCEMENT LEARNING (Halaman 141-146)