• Tidak ada hasil yang ditemukan

3. MDP HOMOMORPHISMS AND MINIMIZATION

3.4 Minimization Framework

Our approach to abstraction can be considered an instance of a general approach known as model minimization. The goal of MDP minimization is to form a reduced model of a system by ignoring irrelevant information. Solving this reduced model should then yield a solution to the original MDP. Frequently minimization is accom- plished by identifying states and actions that are equivalent in a well-defined sense and forming a “quotient” model by aggregating such states and actions. We build a minimization framework, that is an extension of a framework by Dean and Givan

(1997). It differs from their work in the notion of equivalence we employ based on MDP homomorphisms.

In this section we show that homomorphic equivalence leads to preservation of optimal solutions. We start with the following theorem on optimal value equivalence.

This theorem is an extension of the optimal value equivalence theorem developed in Givan et al. (2003) for stochastic bisimulations.

Theorem 1: (Optimal value equivalence) Let M0 = hS0, A00, P0, R0i be the ho- momorphic image of the MDP M=hS, A,Ψ, P, Ri under the MDP homomorphism h=hf,{gs|s∈S}i. For any (s, a)∈Ψ, Q?(s, a) =Q?(f(s), gs(a)).

Proof: (Along the lines of Givan et al. (2003)) Let us define the m-step optimal discounted action value function recursively for all (s, a)∈Ψ and for all non-negative integers m as

Qm(s, a) = R(s, a) +γ X

s1∈S

"

P(s, a, s1) max

a1∈As1

Qm−1(s1, a1)

#

and set Q−1(s1, a1) = 0. Letting Vm(s1) = maxa1∈As1 Qm(s1, a1), we can rewrite this as:

Qm(s, a) =R(s, a) +γ X

s1∈S

[P(s, a, s1)Vm−1(s1)].

Now we prove by induction on m that the theorem is true. For the base case of m = 0, we have that Q0(s, a) = R(s, a) = R0(f(s), gs(a)) = Q0(f(s), gs(a)). Now let us assume that Qj(s, a) = Qj(f(s), gs(a)) for all values of j less than m and all state-action pairs in Ψ. Now we have,

Qm(s, a) = R(s, a) +γ X

s0∈S

P(s, a, s0)Vm−1(s0)

= R(s, a) +γ X

[s0]Bh|S∈Bh|S

T(s, a,[s0]B

h|S)Vm−1(s0)

= R0(f(s), gs(a)) +γ X

s0∈S0

P0(f(s), gs(a), s0)Vm−1(s0)

= Qm(f(s), gs(a))

The second and third lines use the fact thathis a homomorphism. SinceRis bounded it follows by induction thatQ?(s, a) =Q?(f(s), gs(a)) for all (s, a)∈Ψ. 2 Corollaries:

1. For any h-equivalent (s1, a1),(s2, a2)∈Ψ,Q?(s1, a1) =Q?(s2, a2).

2. For all equivalent s1, s2 ∈S, V?(s1) =V?(s2).

3. For all s∈S,V?(s) =V?(f(s)) .

Proof: Corollary 1 follows from Theorem 1. Corollaries 2 and 3 follow from Theorem

1 and the fact that V?(s) = maxa∈AsQ?(s, a). 2

As shown by Givan et al. (2003), optimal value equivalence is not a sufficient notion of equivalence for our stated minimization goal. In many cases even when the optimal values are equal, the optimal policies might not be related and hence we cannot easily transform solutions of M0 to solutions of M. But when M0 is a homomorphic image, a policy inM0 caninduce a policy inMthat is closely related.

The following describes how to derive such an induced policy.

Definition: Let M0 be an image of M under homomorphism h = hf,{gs|s ∈ S}i.

For any s∈S,g−1s (a0) denotes the set of actions that have the same imagea0 ∈A0f(s) under gs. Let π0 be a stochastic policy in M0. Then π0 lifted to M is the policy πM0 such that for any a∈gs−1(a0), π0M(s, a) =π0(f(s), a0).|g−1s (a0)|.

Note: It is sufficient that Pa∈g−1s (a0)πM0 (s, a) = π0(f(s), a0), but we use the above definition to make the lifted policy unique.

Example 2

This example illustrates the process of lifting a policy from an image MDP to the original MDP. Consider MDP M from example 1 and M0 =hS0, A00, P0, R0i with

S0 ={s01, s02}, A0 ={a01, a02} and Ψ0 ={(s01, a01), (s01, a02), (s02, a01)}. Let h=hf,{gs|s ∈ S}i be a homomorphism fromM toM0 defined by

f(s1) =s01 f(s2) =s02 f(s3) =s02 gs1(a1) =a02 gs2(a1) =a01 gs3(a1) =a01 gs1(a2) =a01 gs2(a2) =a01

Letπ0 be a policy in M0 with

π0(s01, a01) = 0.6 π0(s01, a02) = 0.4 π0(s02, a01) = 1.0

Now π0 lifted toM, the policyπ0M, is derived as follows:

π0M(s1, a1) =π0(s01, a02) = 0.4 πM0 (s1, a2) = π0(s01, a01) = 0.6 π0M(s2, a1) =π0(s02, a01)/2 = 0.5 πM0 (s2, a2) = π0(s02, a01)/2 = 0.5 π0M(s3, a1) =π0(s02, a01) = 1.0

2 Theorem 2: LetM0 =hS0, A00, P0, R0ibe the image ofM=hS, A,Ψ, P, Riunder the homomorphism h=hf,{gs|s ∈S}i. If π0? is an optimal policy for M0, then πM0?

is an optimal policy for M.

Proof: Let π0? be an optimal policy in M0. Consider some (s, a) ∈ Ψ such that π0?(f(s), gs1(a1)) is greater than zero. Then Q?(f(s1), gs1(a1)) is the maximum value of the Q? function in state f(s1). From Theorem 1, we know that Q?(s, a) = Q?(f(s), gs(a)) for all (s, a) ∈ Ψ. Therefore Q?(s1, a1) is the maximum value of the Q? function in states1. Thus a1 is an optimal action in state s1 and henceπ0?M is

an optimal policy for M. 2

Theorem 2 establishes that an MDP can be solved by solving one of its homomor- phic images. To achieve the most impact, we need to derive a smallest homomorphic

s2

s1

s3

s4

a1 a2 0.2

+1

0.2

+1

0.8

0.2

1.0 +1

0.2 0.8

0.8 0.8 0.2 0.8

0.2

0.8

1.0 +1

{σ }2 1

3

α

1

2

0.2 1.0

0.8

+1 0.2 +1

0.8 α

1.0

{σ }

{σ }

(a) (b)

Figure 3.4. (a) Transition graph of example MDP M. This MDP is irreducible under a traditional minimization framework. Our notion of homomorphic equivalence allows us to minimize this further. (b) Transition graph of the minimal image of the MDP M in (a).

image of the MDP, i.e., an image with the least number of admissible state-action pairs. The following definition formalizes this notion.

Definition: An MDP Mis a minimal MDP if for every homomorphic image M0 of M, there exists a homomorphism from M0 toM. A minimal image of an MDP M is a homomorphic image of Mthat is also a minimal MDP.

The model minimization problem can now be stated as: “find a minimal image of a given MDP”. Since this can be computationally prohibitive, we frequently settle for a reasonably reduced model, even if it is not a minimal MDP.

Illustration of Minimization: An Abstract MDP example

We illustrate our minimization framework on a very simple abstract MDP shown in Figure 3.4(a). We will use this as a running example while we develop our framework further. We chose such a simple example in order to make the presentation of the computation involved in later stages easier. Note though that this MDP is irreducible under the state-equivalence based MDP minimization framework of Dean and Givan.

The parameters of M = hS, A,Ψ, P, Ri are S = {s1, s2, s3, s4}, A = {a1, a2}, Ψ =

S ×A, P defined as in Table 3.1 and R given by: R(s2, a1) = R(s3, a2) = 0.8 and R(s2, a2) =R(s3, a1) = 0.2. For all other values of iand j,R(si, aj) equals zero.

to→

from s1 s2 s3 s4

s1 0 0.8 0.2 0

s2 0.2 0 0 0.8

s3 0.8 0 0 0.2

s4 0 0 0 1.0

(i) to→

from s1 s2 s3 s4

s1 0 0.2 0.8 0

s2 0.8 0 0 0.2

s3 0.2 0 0 0.8

s4 0 0 0 1.0

(ii)

Table 3.1. Transition probabilities for the MDP M shown in Figure 3.4(a): (i) under action a1. (ii) under action a2.

The MDP M0 shown in Figure 3.4(b) is a homomorphic image of M. It has the following parameters: S0 = {σ1, σ2, σ3}, A0 = {α1, α2}, Ψ0 = {(σ1, α1), (σ2, α1), (σ2, α2), (σ3, α1)},P0as shown in Table 3.2 andR0defined as follows: R02, α1) = 0.2, R02, α2) = 0.8 and all other rewards are zero.

P01, α1, σ2) = 1.0 P03, α1, σ3) = 1.0 P02, α1, σ1) = 0.8 P02, α2, σ1) = 0.2 P02, α1, σ3) = 0.2 P02, α2, σ3) = 0.8

Table 3.2. The transition probabilities of the MDP M0 shown in Figure 3.4(b).

One can define a homomorphismhf,{gs|s ∈S}ifromMtoM0as follows: f(s1) = σ1, f(s2) = f(s3) = σ2, and f(s4) = σ3. gs1(ai) = gs4(ai) = α1, for i = 1,2, gs2(a1) = gs3(a2) = α2 and gs2(a2) = gs3(a1) = α1.