• Tidak ada hasil yang ditemukan

Choosing Transformations

6. OPTION SCHEMAS AND DEIXIS

6.3 Choosing Transformations

Given an SMDP, defining a relativized option requires the use of extensive prior knowledge, namely the transition and reward structure of the entire SMDP, at the very least the parts of the SMDP spanned by the desired domain of the option, the parameters of the option SMDP, and the homomorphism that maps the original state and actions onto the option SMDP. It is not often that all this knowledge is available a priori. Even when such knowledge is available, finding the correct option homomorphism is in general NP-hard.

In the absence of complete knowledge about the system, we need methods for learning some of the required components of a relativized options given the others.

One useful case is when the agent is given the parameters of the option SMDP and is required to choose the right transformations that constitute the option homomor- phism, without complete knowledge of the parameters of the original SMDP. We assume that the agent has access to a set of candidate transformations. The agent learns, using online experience, the right transformation to apply depending on the circumstances under which an option is invoked. When combined, the chosen trans- formations define the option homomorphism.

Such a scenario would often arise in cases where an agent is trained in a small prototypical environment and is required to later act in a complex environment where skills learned earlier are useful. In the example from Section 5.6, the agent may be trained to gather the object in a particular room and then be asked to gather the objects in the different rooms in the environment. The problem of navigating in each of these rooms can now be considered simply that of learning the suitable transformation to apply to the policy learned in the first room. An appropriate set of candidate transformations in this case are reflections and rotations of the room.

Learning the right homomorphism through experience, can also be viewed as on- line minimization without a completely specified model. Recall from the previous chapter, that an option SMDP, MO = hSO, AOO, PO, ROi, is a homomorphic im- age of the original MDP and hence is constructed from the reward respecting SSP partition corresponding to the option homomorphism. Each element of ΨO is there- fore an unique representative of some block of the partition induced by the option homomorphism. Finding the right transformations is then equivalent to identifying the other members of each of the blocks. Since the agent is restricted to a limited set of transformations, the search for a reward respecting SSP partition of the original SMDP is limited to some fixed family of partitions of ΨO.

The interpretation that is intuitively more appealing is the one afforded by viewing relativized options as option schemas. Under this interpretation, the schema, i.e., the option SMDP and the policy, is assumed to be given. The problem is then one of choosing the right bindings to the abstract states and actions from a set of possible bindings. This interpretation yields a better motivation for studying this particular formulation involving insufficient prior knowledge. This is a natural interpretation of our approach to choosing homomorphisms and suggests many problem domains in which this approach may be used.

6.3.1 A Bayesian Algorithm

Formally the problem maybe stated as: Given a set of candidate transformations H and the option MDP MO = hSO, AOO, PO, ROi, how do we choose the right transformation to employ at each invocation of the option? We assume that M = hS, A,Ψ, P, Ri is the MDP describing the actual task, and that P and R are not known.1 Let ψ(s) be a function of the current state s that captures the features necessary to distinguish the particular context in which the option is invoked. In the example in Figure 5.2, the room number is sufficient, while in an object-based environment some property of the target object, say color, might suffice. Often in practice, ψ(s) is a simple function of s like a projection onto a subset of features, as in the rooms example. The features,ψ(s), can be thought of as distinguishing various instances of the family of sub-problems represented by the relativized option. The problem of choosing the right transformation is formulated as a family of Bayesian parameter estimation problems, one for each possible value ofψ(s).

There is one parameter, θ, that can take a finite number of values from H. Let p(h, ψ(s)) denote the prior probability thatθ =h, i.e., the prior probability that his the correct transformation to apply in the sub-problem represented by ψ(s). The set of samples used for computing the posterior distribution is the sequence of transitions, hs1, a1, s2, a2,· · ·i, observed when the option is executing. Note that the probability of observing a transition fromsi tosi+1 underai for anyi, is independent of the other transitions in the sequence. Recursive Bayes learning is used to update the posterior probabilities incrementally.

Let pn(h, ψ(s)) be the posterior probability that h is the correct transformation to apply in the sub-problem represented by ψ(s) after n time steps from when the option was invoked. Initialize p0(h, ψ(s)) = p(h, ψ(s)) for all h and ψ(s). Let En =

1We describe the algorithm in terms of MDPs for simplicity of exposition. The ideas extend to SMDPs naturally.

hsn, an, sn+1i be the transition observed after n time steps of option execution. The posteriors for all h=hf,{gs|s∈S}i are updated as follows:

pn(h, ψ(s)) = P r(En|h, ψ(s))pn−1(h, ψ(s))

N , (6.1)

whereP r(En|h, ψ(s)) =PO(f(sn), gsn(an), f(sn+1)) is the probability of observing the h-projection of transition En in the option MDP and N = Ph0∈HPO((f0(sn), gs0n(a), f0(sn+1))pn−1(h0, ψ(s)) is a normalizing factor. When an option is executing, at time step n, bh= arg maxhpn(h, ψ(s)) is used to project the state to the option MDP and lift the action to the original MDP. After experiencing a transition, the posteriors of all the transformations inH are updated using equation 6.1.

The term P r(En|h, ψ(s)) is a measure of how likely the transition En is in the option MDP assuming that the agent is in the sub-problem represented byψ(s) andh is used to project the transition onto the option MDP. Accumulating this “likelihood”

over a sufficiently long sequence of transitions gives the best possible measure of how correct a transformation is for a given sub-problem. N is a normalizing factor that prevents the values of pnfrom decaying to very small numbers with largen. Without normalization, pn’s are likely to decay to zero since the probability of observing any long sequence of transitions is very low even under the correct transformations.

6.3.2 Experimental Illustration

We tested the algorithm on the gridworld in Figure 5.2. The agent, referred to as the Bayesian agent, has oneget-object-and-exit-roomrelativized option defined in the option MDP in Figure 5.2(b). H consists of all possible combinations of reflections about thex andy axes and rotations through integral multiples of 90 degrees. There are only 8 unique transformations in H. Since the rooms delineate the sub-tasks from one another, ψ(s) is set to room . For each of the rooms in the world, there is one transformation in H that is the desired one. This is a contrived example chosen

to illustrate the algorithm and reduction in problem size is possible in this domain by more informed representation schemes. We will discuss, briefly, the relation of such schemes to relativized options later in this section. As with earlier experiments reported in Chapter 5, the agent employs hierarchical SMDPQ-learning with-greedy exploration, with = 0.1. The learning rate is set at 0.01 and the discount factor,γ, at 0.9. The root task description is the same as in Section 5.6. The priors in each of the rooms were initialized to a uniform distribution with p0(h, ψ(s)) = 0.125 for all h∈ H andψ(s). The trials were terminated either on completion of the task or after 3000 steps. The results shown in Figure 6.1 are averaged over 100 independent runs.

Recall from Section 5.6, that the probability that an action produces a move- ment in a direction other than the desired one, is called the slip of the environment.

The greater the slip, the harder the problem, since the effects of the actions are more stochastic. As shown in Figure 6.1 the agent rapidly learned to apply the right transformation in each room under different levels of stochasticity. Compare this performance to the agent learning with primitive actions alone (Figure 5.4(a)).

The primitive action agent didn’t start improving its performance until after 30,000 iterations and hence is not employed in further experiments. Figure 6.1 compares the performance of the Bayesian agent with an agent that knew the right transfor- mations apriori. As is illustrated in the figure, the difference in performance is not significant.2 In this particular task our transformation-choosing algorithm manages to identify the correct transformations without much loss in performance since there is nothing catastrophic in the environment and the agent is able to recover quickly from wrong initial choices.

Figure 6.2 shows how the various posteriors evolve during a typical run. The graph plots the posteriors in room 5 for the transformations composed of pure rotations or

2All the significance tests weretwo sample t-tests, with a p-value of 0.01, on the distributions of the learning times under the two algorithms.

0 100 200 300 400 500 600 0

200 400 600 800 1000 1200 1400 1600 1800

Trial Number

Average Steps per Trial

slip=0.7

slip=0.5 slip=0.1

know transforms choose transforms

Figure 6.1. Comparison of initial performance of agents with and without knowledge of the appropriate partial homomorphisms on the task shown in Figure 5.2 with various levels of stochasticity.

reflections followed by rotations. The pure reflections were quickly discarded in this case. As is evident, by iteration 10 the posteriors of the incorrect transformations have decayed to 0. The posterior for transform 5, a rotation through 90 degrees, converges to 1.0.